BikeSaferPA: predicting outcomes for cyclists using Pennsylvania crash data, 2002-2021¶

Part II: Data visualization and exploration¶

In the second phase of the project, I explore the datasets 'cyclists' and 'crashes' and visualized some trends. Recall that I am particularly interested in the percentage of cyclists suffering serious injury or fatality among cyclists involved in crashes, and in particular I would like to examine how this percentage changes when I restrict to crashes in which certain factors are present.

Using the 'crashes' dataframe, I will:

  • Examine prevalences of all crashes involving bicycles, and crashes resulting in serious cyclist injury or fatality, by year, month of the year, day of the week, and time of the day.
  • Visualize the location data of crashes both statewide and locally in a selection of major urban areas.

Using the 'cyclists' dataframe, I will:

  • Examine the distributions of various features among the following cohorts:
    • all cyclists involved in crashes
    • cyclists who suffered serious injury
    • cyclists who suffered fatality
  • Identify crash factors which, when we condition on those factors being present, dramatically increase the percentage of cyclists being seriously injured or killed.
  • Identify pairs of crash factors which have a significant compounding effect on chance of serious cyclist injury or death.

First, import the necessary libraries and load in the dataframes prepared in Part I.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sns

from IPython.display import display, display_html
from ipywidgets import widgets

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth',None)

plt.style.use('fivethirtyeight')

import sys
np.set_printoptions(threshold=sys.maxsize)
In [2]:
cyclists = pd.read_csv('cyclists.csv')
crashes = pd.read_csv('crashes.csv')

Navigation:¶

  • Crash occurences over time
    • Annual crash occurences, 2002-2021
    • Crash occurences by month of the year
    • Monthly crash occurences, 2002-2021
    • Crash occurences by day of the week
    • Crash occurences by hour of the day
  • Visualizing crashes on a map
    • Statewide crashe map
    • Philadelphia crash map
  • Prevalence of various feature values
    • Visualizing feature prevalence
    • Summary of feature value comparison
  • Visualizing pairs of features
  • Summarization of findings

Occurences of crashes involving bicycles over time¶

The plot_over_time function will create three subplots according to desired input time period:

  • The counts of crashes involving bicycles by year/month/day/hour
  • The counts of crashes resulting in serious injury to cyclist by year/month/day/hour
  • The counts of crashes resulting in cyclist death by year/month/day/hour

Annual crash totals from 2002-2021¶

In [3]:
from lib.vis_data import plot_over_time
In [4]:
plot_over_time(crashes)

Below is a display of the year-over-year percents changes in each type of crash. Decreases are shaded blue and increases are shaded red; the intensity of the hue reflects the magnitide of the percent change.

In [5]:
from lib.vis_data import perc_change_table
perc_change_table(crashes)
year 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
yearly change of:                                      
all crashes 2.3% 3.9% -13.4% -2.1% 7.2% -3.0% -4.7% 6.0% -11.1% 5.1% -0.4% -5.1% -3.0% 2.3% -12.4% -15.7% 5.7% -19.6% -7.3%
with serious injury 16.5% -9.1% -28.9% 17.2% 22.7% -12.0% 1.2% -19.5% 1.5% -11.9% 10.2% -27.7% 14.9% 79.6% 0.0% -3.1% 5.3% -29.3% 47.1%
with fatality 17.6% -30.0% 28.6% -27.8% 53.8% -60.0% 100.0% 31.2% -47.6% 54.5% -35.3% 72.7% -15.8% 0.0% 31.2% -14.3% -11.1% 37.5% 9.1%
with either 17.8% -12.6% -21.2% 7.3% 27.3% -21.4% 10.2% -10.3% -10.3% -2.6% 0.0% -13.2% 6.1% 61.4% 4.4% -5.1% 2.7% -20.0% 37.0%

Some observations based on these visualizations:

  • Annual counts of all cyclist-involved crashes experienced fairly small fluctuations from 2002-2021 with a general downward trend driven by steep declines in 2005, 2011, 2017, 2018, and 2020.
  • There is a worrisome trend in the annual counts of crashes with serious cyclist injury. There was a general downward trend that ended in 2014. The annual count spiked heavily in 2016 - despite only a negligible increase in total crashes that year - and stayed high until a significant decline in 2020. The annual count of crashes with serious injury spiked again in 2021, despite another decrease in total crashes that year!
  • Annual counts of cyclist fatalities fluctuated wildly and did not reflect the trend in total annual crash counts, but this is not surprising given that the sample set is so small - only 338 samples in the entire dataset, averaging around 17 per year. However, there has been a general upward trend in annual cyclist fatalities since 2008. Although the annual counts of total crashes and crashes with serious cyclist injury decreased sharply in 2020, the count of crashes with cyclist fatality spiked heavily that year and stayed high in 2021.
    • These observations are consistent with 2020 national cyclist fatality and injury statistics in this NHTSA report - nationally there was a 9% increase in cyclist fatality and a 21% decrease in cyclist injury in 2020 relative to 2019.
    • I know that the early years of the pandemic brought significant challenges and disruptions to the health care system and the emergency response infrastructure - one possible explanation for the 2020 spike in fatalities is that some additional cyclists died who may have otherwise survived their serious injuries due to strains in the system. However, this dataset doesn't supply sufficient evidence for such a hypothesis.

The conditions for cycling in urban settings are dramatically different from those in rural settings, and I expect that might have an effect on crash severity. I'll display the annual count graphs with rural crashes separated out from crashes in urban/urbanized settings via stacked bars:

In [6]:
plot_over_time(crashes,split_urban_rural=True)

I can see a few things:

  • Although rural crashes represent a fairly small percentage of all crashes (between 12-20% annually) they consistently represent a much larger percentage of crashes with serious cyclist injury and of crashes with cyclist fatality.
  • There are several years in which the percentage of rural crashes among crashes with cyclist fatality was much larger than usual. In particular, in 2005 and 2016 the majority of crashes with cyclist fatality were rural, and in 2020 half were rural.

I'll continue to break down crash counts over time between rural and urban/urbanized settings.

Crash totals by month of the year¶

In [7]:
plot_over_time(crashes, feature='CRASH_MONTH',label='month',split_urban_rural=True)
In [8]:
from lib.vis_data import perc_change_table
perc_change_table(crashes, period='month')
month 1 2 3 4 5 6 7 8 9 10 11 12
monthly change of:                        
all crashes -29.2% -4.0% 93.7% 76.0% 41.2% 21.2% 3.5% 5.1% -13.6% -28.3% -39.1% -38.5%
with serious injury -50.8% 16.1% 88.9% 85.3% 38.9% 20.6% 0.9% 16.0% -32.4% -4.2% -44.4% -29.2%
with fatality -35.0% -38.5% 100.0% 25.0% 100.0% 2.5% 29.3% -24.5% -5.0% -26.3% -25.0% -4.8%
with either -47.0% 0.0% 90.9% 72.6% 48.3% 16.7% 5.6% 8.3% -28.6% -8.3% -42.0% -23.9%

The above table should be read carefully. The category for a particular month contains all samples from that month from every year, and so the percent change entries under '1' should be interpreted as the percent change between the December total (for all years) and the January total (for all years).

For urban/urbanized crashes:

  • The total monthly crash count roughly resembles a truncated bell curve centered around August 1. This is consistent with monthly patterns in bicycle utilization counts and average time spent riding daily per person - see for instance this chart from the U.S. Bureau of Transportation Statistics.
  • The monthly count of crashes with serious cyclist injury follows a similar distribution, except for a suppression in the count in the month of September and an elevation in February.
  • The monthly count of crashes with cyclist fatality follows a similar pattern, except for a somewhat elevated counts in January, May, July, December and somewhat suppressed counts in April and October. These may not be significant effects, since there are only 338 cyclist fatality samples. That sample size is small enough that I should expect a noisy view of the actual distribution.

For rural crashes:

  • The percentage of crashes which are rural varies between 12-16% monthly, but is lower in winter and higher in warmer months. I expect that this is occuring for two reasons:
    • Recreational cycling much more common in warmer months than colder months, whereas commuting by bicycle has a less pronounced seasonal effect. Cycling is a more common mode of commuting in urban settings than rural settings.
    • Rural roads and their shoulders are more likely to be obstructed by snow and ice for long periods of time in the winter months as compared to urban streets.
  • This monthly variation in percentage of crashes which are rural is more pronounced among crashes with serious cyclist injury or fatality. In particular, almost half of fatal crashes were rural in the months of June and July.

Monthly crash totals over 2002-2021¶

I'll take a quick look at long-term seasonality by plotting each month's actual counts rather than summing over years.

In [9]:
from lib.vis_data import plot_month_series
plot_month_series(crashes)

Some additional observations regarding these monthly count plots:

  • I see consistent seasonality in the monthly counts of all crashes, with a general downward trend which reflects the trend in annual counts of all crashes that I saw prior.
  • Recall that the highest annual counts of crashes with serious cyclist injury were in 2003, 2016-2019, and 2021. From this monthly plot I can see that the distribution in 2003 was heavily supported with high July-August numbers, whereas the distributions in recent years are broader with more support in spring and fall.
  • The noise in the fatality data (due to the small sample size) is quite apparent. However, I can observe more fatalities than usual in spring and fall of 2020.

Occurences by day of the week¶

In [10]:
plot_over_time(crashes, feature='DAY_OF_WEEK',label='day of the week',split_urban_rural=True)

For urban/urbanized crashes:

  • On average, more crashes occur on weekdays than on weekend days, and more crashes on Saturday than Sunday. The average number of crashes across weekdays is fairly similar.
  • The variation in the number of crashes with serious injuries among weekdays reflects the variation in numnber of total crashes, with Friday and Wednesday having the most.
  • Significantly more crashes with cyclist fatality occur on Friday and Thursday than on the other weekdays. Afternoon commuting traffic is known to be heaviest on Thursday and Friday, which may account for the increase in fatalities on these days.

Rural crash variations during the week are quite different:

  • There is not much variation in the number of crashes from day to day. The count is smallest on Sunday and increases throughout the week, but this effect is less pronounced and Saturday actually has the largest count.
  • The count of crashes with serious cyclist injury is also fairly stable throughout the week, but the counts on Friday and Saturday are slightly higher.
  • The count of fatal crashes varies much more, peaking on Wednesday and Friday.

Occurences by hour of the day¶

In [11]:
plot_over_time(crashes, feature='HOUR_OF_DAY',label='hour of the day',split_urban_rural=True,best_legend=True)

For urban/urbanized crashes:

  • Crash counts rise throughout the morning and afternoon, and peak during the afternoon rush hours - in fact, over 53% of crashes occur between 2pm-8pm and over 21% occur between 4pm-6pm.
  • Crashes with serious cyclist injury, and with cyclist fatality, are less densely concentrated near the afternoon rush - a significant portion happen in the morning, and and outsized portion overnight. In particular, 7% of all crashes happen between 10pm-5pm, but 10% of serious injury crashes and 26% of fatal crashes happen during those same hours.

For rural crashes:

  • Crash counts similarly rise throughout the day, but almost none happen at night - 3% of rural crashes happen between 10pm-5pm as compared to 7% of urban/urbanized crashes during that same time frame. Cycling during the night is much less common in a rural setting, as rural roads are much less well-lit.
  • The 6am spike shown in fatal urban/urbanized crashes is non-existent in fatal rural crashes.

Uncomment the following if you want to take a closer look at hourly counts for each day of the week:

In [12]:
# days = ['Sun','Mon','Tues','Wed','Thurs','Fri','Sat']
# for i,day in enumerate(days):
#     plot_over_time(df = crashes[crashes.DAY_OF_WEEK==i+1], feature='HOUR_OF_DAY',label=f'hour of the day ({day})',kind='hist')

Visualizing locations of crashes on a map, over time¶

Statewide crashes by year¶

The function plot_map plots crash events on a map using DEC_LAT, DEC_LONG data. I won't do an in-depth analysis, but the visualization helps illustrate how the crash events are distributed throughout the state, and how the geographic distribution varies over time.

In [13]:
from lib.vis_data import plot_map
plot_map(crashes)

Focusing on Philadelphia¶

The plot_map function can also focus in on a particular municipality by passing in its municipality code and name. Philadelphia is the largest city and most significant source of crash events. In the period of 2002-2021:

  • 35.5% of PA crashes involving bicycles occured in Philadelphia
  • 18.7% of PA crashes involving serious cyclist injury occured in Philadelphia
  • 22.2% of PA crashed involving cyclist fatality occured in Philadelphia
In [14]:
plot_map(df=crashes,city=(67301,'Philadelphia'))

Uncomment and run the following cell to see maps from other major urban areas in PA:

In [15]:
# cities_to_plot = [(67301,'Philadelphia'),
#                   (2301,'Pittsburgh'),
#                   (39301,'Allentown'),
#                   (6301,'Reading'),
#                   (25302,'Erie'),
#                   (35302,'Scranton'),
#                   (22301,'Harrisburg'),
#                   (14410,'State College')
#                  ]
# for city in cities_to_plot:
#     plot_map(crashes, city=city)

Visualizing prevalence of various feature values¶

In this section, I shall visualize how the values of features break down over three cohorts:

  • all cyclists involved in crashes
  • cyclists involved in crashes who were seriously injured
  • cyclists involved in crashes who were killed (i.e. suffered fatal injury)

Reminder: the data dictionary defines "serious injury" and "fatal injury" as:

  • serious injury: "incapacitating injury, including bleeding wounds and distorted members (amputations or broken bones), and requires transport of the patient from the scene."
  • fatal injury: "the person dies as a result of the injuries sustaines in the crash within 30 days of the crash."
In [16]:
pd.concat([cyclists.INJ_SEVERITY.value_counts()[['susp_serious_injury','killed']],pd.Series({'total':cyclists.shape[0]})],axis=0)
Out[16]:
susp_serious_injury     1609
killed                   338
total                  26882
dtype: int64

Note that I will combine the cyclist groups who were seriously injured or killed because the number of samples of cyclists who were kills was very low (338 out of 26882, i.e. 1.25% of samples). Such a small sample set can be noisy result in less reliable conclusions. Moreover, I expect that the distinction between serious injury and fatality is heavily influenced by factors that are outside the scope of the dataset such as the underlying health of the cyclist, the quality of medical care they might have received, and the fine details of the collision geometry.

I will focus on several types of features:

  • Features related to the cyclist:
    • demographic features: AGE_BINS, SEX
    • cyclist helmet feature: RESTRAINT_HELMET
    • features describing movement and position of the bicycle at the time of the crash: VEH_MOVEMENT, VEH_POSITION
    • features describing the bicycle's involvement in the crash: VEH_ROLE, IMPACT_SIDE,
  • Features related to the crash event:
    • chronological features: DAY_OF_WEEK, HOUR_OF_DAY
    • manner of crash: COLLISION_TYPE,
    • environment/location features: WEATHER, ILLUMINATION,INTERSECT_TYPE, LOCATION_TYPE, RELATION_TO_ROAD, URBAN_RURAL
    • roadway features: RDWY_ALIGNMENT, GRADE,SPEED_LIMIT, ROAD_CONDITION
    • traffic control device features: TCD_TYPE, TCD_FUNC_CD
  • Binary features indicating certain factors in the crash:
    • motor vehicle factors: BUS, SUV, SMALL_TRUCK, HEAVY_TRUCK, VAN, COMM_VEHICLE
    • driver behavior factors: AGGRESSIVE_DRIVING, NHTSA_AGG_DRIVING, CROSS_MEDIAN, LANE_DEPARTURE, RUNNING_STOP_SIGN, RUNNING_RED_LT, SPEEDING_RELATED, TAILGATING,
    • driver condition factors: CELL_PHONE, DISTRACTED, DRINKING_DRIVER, DRUGGED_DRIVER, FATIGUE_ASLEEP, IMPAIRED_DRIVER
    • driver demographic factors: MATURE_DRIVER, YOUNG_DRIVER

Note that some categories are very rare! I only include in our charts feature categories which are represented in at least 0.5% of some cohort of cyclist samples.

In [3]:
cat_features = ['AGE_BINS', 'SEX','RESTRAINT_HELMET',
                'VEH_MOVEMENT', 'VEH_POSITION','VEH_ROLE',
                'IMPACT_SIDE','URBAN_RURAL',
                'DAY_OF_WEEK','HOUR_OF_DAY','COLLISION_TYPE',
                'RDWY_ALIGNMENT','GRADE','SPEED_LIMIT','ROAD_CONDITION','WEATHER',
                'ILLUMINATION','INTERSECT_TYPE','LOCATION_TYPE', 'RELATION_TO_ROAD',
                'TCD_TYPE', 'TCD_FUNC_CD']
flag_features = ['BUS', 'HEAVY_TRUCK', 'SMALL_TRUCK', 'SUV','COMM_VEHICLE', 
                 'RUNNING_STOP_SIGN','RUNNING_RED_LT','SPEEDING_RELATED', 'TAILGATING',
                 'CROSS_MEDIAN', 'LANE_DEPARTURE','AGGRESSIVE_DRIVING','NHTSA_AGG_DRIVING',
                 'CELL_PHONE','DISTRACTED','DRINKING_DRIVER', 'DRUGGED_DRIVER',
                 'FATIGUE_ASLEEP','IMPAIRED_DRIVER',
                 'MATURE_DRIVER','YOUNG_DRIVER']

Feature breakdown visualization¶

In [4]:
from lib.vis_data import feat_perc_comp

# Define scheme for HTML output of dataframes, can have multiple per line
output=''
tables = [feat_perc_comp(feat,cyclists) for feat in cat_features]
tables = [(x,x.data.shape[0]) for x in tables]
tables = sorted(tables,key=lambda x:x[1],reverse=True)
tables = [table[0] for table in tables]
tables = tables+[feat_perc_comp(feat,cyclists) for feat in flag_features]
for table in tables:
    output += table._repr_html_()
    output += "\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"
display_html(output,raw=True)
Breakdown of HOUR_OF_DAY among cyclist groups
HOUR_OF_DAY all cyclists cyclists with serious injury or fatality
0.000000 0.97% 1.39%
1.000000 0.63% 0.98%
2.000000 0.52% 1.13%
3.000000 0.28% 0.72%
5.000000 0.70% 1.13%
6.000000 1.67% 2.72%
7.000000 2.99% 3.08%
8.000000 3.29% 2.57%
9.000000 3.02% 4.21%
10.000000 3.44% 2.77%
11.000000 4.25% 3.70%
12.000000 5.34% 4.57%
13.000000 5.57% 5.24%
14.000000 6.33% 6.53%
15.000000 8.80% 8.63%
16.000000 10.89% 9.25%
17.000000 10.91% 9.25%
18.000000 9.23% 9.35%
19.000000 7.50% 6.37%
20.000000 5.44% 5.81%
21.000000 4.05% 4.78%
22.000000 2.36% 3.39%
23.000000 1.58% 2.11%
            
Breakdown of VEH_MOVEMENT among cyclist groups
VEH_MOVEMENT all cyclists cyclists with serious injury or fatality
straight 80.84% 72.76%
turning_left 5.07% 7.42%
other 4.26% 5.51%
turning_right 1.95% 1.58%
changing_merging 1.86% 2.89%
passing_vehicle 1.44% 1.26%
curve_left 1.05% 2.78%
curve_right 0.91% 2.62%
stopped_in_lane 0.61% 0.71%
slowing_or_stopping_in_lane 0.55% 0.33%
avoiding 0.51% 0.60%
entering_lane 0.41% 0.66%
            
Breakdown of VEH_POSITION among cyclist groups
VEH_POSITION all cyclists cyclists with serious injury or fatality
right_lane_curb 35.41% 44.74%
other 20.11% 12.22%
unknown 10.11% 7.19%
oncoming_lane 7.59% 8.58%
shoulder_right 5.69% 8.94%
one_lane_road 4.74% 3.24%
right_of_trafficway 4.57% 4.57%
other_forward_lane 3.95% 2.57%
left_lane 2.48% 3.13%
left_of_trafficway 2.26% 2.36%
shoulder_left 1.77% 1.18%
right_lane 0.67% 0.46%
            
Breakdown of IMPACT_SIDE among cyclist groups
IMPACT_SIDE all cyclists cyclists with serious injury or fatality
front 39.49% 35.23%
unknown 22.24% 15.20%
left 12.85% 15.82%
rear 9.18% 15.46%
right 8.95% 9.60%
front_left 2.62% 3.90%
front_right 1.79% 1.39%
rear_left 1.04% 1.18%
non_collision 0.84% 0.98%
rear_right 0.76% 0.87%
            
Breakdown of AGE_BINS among cyclist groups
AGE_BINS all cyclists cyclists with serious injury or fatality
(0, 10] 12.46% 9.62%
(10, 20] 35.60% 32.20%
(20, 30] 18.60% 13.90%
(30, 40] 10.61% 11.92%
(40, 50] 10.55% 12.02%
(50, 60] 7.95% 11.76%
(60, 70] 3.11% 5.96%
(70, 80] 0.88% 1.99%
(80, 90] 0.21% 0.63%
            
Breakdown of COLLISION_TYPE among cyclist groups
COLLISION_TYPE all cyclists cyclists with serious injury or fatality
angle 67.80% 63.51%
sideswipe_same_dir 10.85% 8.27%
rear_end 6.25% 11.77%
hit_ped 5.67% 5.04%
head_on 5.33% 7.50%
sideswipe_opp_dir 2.80% 1.85%
other 0.49% 0.57%
non_collision 0.34% 0.72%
hit_fixed_obj 0.20% 0.62%
            
Breakdown of SPEED_LIMIT among cyclist groups
SPEED_LIMIT all cyclists cyclists with serious injury or fatality
15 3.41% 1.64%
20 2.37% 2.31%
25 70.05% 53.31%
30 2.69% 2.98%
35 13.46% 19.93%
40 3.48% 8.22%
45 2.88% 6.78%
50 0.19% 0.51%
55 1.25% 4.16%
            
Breakdown of DAY_OF_WEEK among cyclist groups
DAY_OF_WEEK all cyclists cyclists with serious injury or fatality
Sun 11.49% 12.28%
Mon 14.86% 13.66%
Tues 14.78% 12.28%
Wed 15.35% 15.67%
Thurs 14.77% 14.48%
Fri 15.41% 17.46%
Sat 13.34% 14.18%
            
Breakdown of GRADE among cyclist groups
GRADE all cyclists cyclists with serious injury or fatality
level 70.37% 62.40%
unknown 12.90% 8.89%
downhill 11.96% 19.16%
uphill 2.83% 4.88%
bottom_hill 1.42% 3.29%
top_hill 0.51% 1.39%
            
Breakdown of INTERSECT_TYPE among cyclist groups
INTERSECT_TYPE all cyclists cyclists with serious injury or fatality
four_way 42.47% 32.92%
midblock 34.70% 46.02%
T 19.60% 17.62%
Y 1.44% 1.69%
multi_leg 1.05% 1.03%
other 0.52% 0.26%
            
Breakdown of ILLUMINATION among cyclist groups
ILLUMINATION all cyclists cyclists with serious injury or fatality
daylight 76.07% 70.16%
dark_lit 17.20% 17.26%
dusk 3.14% 2.67%
dark_unlit 2.86% 8.27%
dawn 0.59% 1.59%
            
Breakdown of RELATION_TO_ROAD among cyclist groups
RELATION_TO_ROAD all cyclists cyclists with serious injury or fatality
on_roadway 92.25% 91.94%
shoulder 4.37% 5.84%
roadside 1.76% 1.34%
outside_trafficway 0.81% 0.52%
parking_lane 0.62% 0.10%
            
Breakdown of RESTRAINT_HELMET among cyclist groups
RESTRAINT_HELMET all cyclists cyclists with serious injury or fatality
no_restraint 76.12% 74.99%
bicycle_helmet 14.78% 20.03%
unknown 8.18% 4.01%
motorcycle_helmet 0.57% 0.62%
            
Breakdown of VEH_ROLE among cyclist groups
VEH_ROLE all cyclists cyclists with serious injury or fatality
struck 62.14% 67.64%
striking 36.05% 28.92%
striking_struck 1.38% 2.77%
non_collision 0.43% 0.67%
            
Breakdown of WEATHER among cyclist groups
WEATHER all cyclists cyclists with serious injury or fatality
clear 92.82% 92.47%
rain 6.10% 5.98%
cloudy 0.31% 0.52%
other 0.29% 0.57%
            
Breakdown of LOCATION_TYPE among cyclist groups
LOCATION_TYPE all cyclists cyclists with serious injury or fatality
not_applicable 91.64% 90.85%
driveway_parking_lot 6.97% 6.79%
ramp 0.54% 0.72%
bridge 0.38% 0.93%
            
Breakdown of TCD_TYPE among cyclist groups
TCD_TYPE all cyclists cyclists with serious injury or fatality
not_applicable 47.28% 56.96%
stop_sign 27.51% 23.04%
traffic_signal 24.30% 18.35%
other 0.35% 0.88%
            
Breakdown of URBAN_RURAL among cyclist groups
URBAN_RURAL all cyclists cyclists with serious injury or fatality
urban 77.37% 59.73%
rural 14.66% 32.10%
urbanized 7.97% 8.17%
            
Breakdown of TCD_FUNC_CD among cyclist groups
TCD_FUNC_CD all cyclists cyclists with serious injury or fatality
functioning_properly 50.60% 41.68%
no_controls 48.20% 57.64%
functioning_improperly 1.03% 0.57%
            
Breakdown of SEX among cyclist groups
SEX all cyclists cyclists with serious injury or fatality
M 81.84% 85.62%
F 18.16% 14.38%
            
Breakdown of RDWY_ALIGNMENT among cyclist groups
RDWY_ALIGNMENT all cyclists cyclists with serious injury or fatality
straight 97.00% 93.34%
curve 3.00% 6.66%
            
Breakdown of ROAD_CONDITION among cyclist groups
ROAD_CONDITION all cyclists cyclists with serious injury or fatality
dry 90.93% 90.75%
wet 8.19% 8.02%
            
Breakdown of BUS among cyclist groups
BUS all cyclists cyclists with serious injury or fatality
0 98.91% 98.46%
1 1.09% 1.54%
            
Breakdown of HEAVY_TRUCK among cyclist groups
HEAVY_TRUCK all cyclists cyclists with serious injury or fatality
0 98.77% 96.20%
1 1.23% 3.80%
            
Breakdown of SMALL_TRUCK among cyclist groups
SMALL_TRUCK all cyclists cyclists with serious injury or fatality
0 90.67% 86.39%
1 9.33% 13.61%
            
Breakdown of SUV among cyclist groups
SUV all cyclists cyclists with serious injury or fatality
0 84.44% 81.36%
1 15.56% 18.64%
            
Breakdown of COMM_VEHICLE among cyclist groups
COMM_VEHICLE all cyclists cyclists with serious injury or fatality
0 97.81% 94.71%
1 2.19% 5.29%
            
Breakdown of RUNNING_STOP_SIGN among cyclist groups
RUNNING_STOP_SIGN all cyclists cyclists with serious injury or fatality
0 98.59% 98.66%
1 1.41% 1.34%
            
Breakdown of RUNNING_RED_LT among cyclist groups
RUNNING_RED_LT all cyclists cyclists with serious injury or fatality
0 99.07% 98.92%
1 0.93% 1.08%
            
Breakdown of SPEEDING_RELATED among cyclist groups
SPEEDING_RELATED all cyclists cyclists with serious injury or fatality
0 98.27% 94.92%
1 1.73% 5.08%
            
Breakdown of TAILGATING among cyclist groups
TAILGATING all cyclists cyclists with serious injury or fatality
0 99.44% 99.54%
1 0.56% 0.46%
            
Breakdown of CROSS_MEDIAN among cyclist groups
CROSS_MEDIAN all cyclists cyclists with serious injury or fatality
0 99.45% 98.72%
1 0.55% 1.28%
            
Breakdown of LANE_DEPARTURE among cyclist groups
LANE_DEPARTURE all cyclists cyclists with serious injury or fatality
0 97.44% 94.14%
1 2.56% 5.86%
            
Breakdown of AGGRESSIVE_DRIVING among cyclist groups
AGGRESSIVE_DRIVING all cyclists cyclists with serious injury or fatality
0 78.05% 78.74%
1 21.95% 21.26%
            
Breakdown of NHTSA_AGG_DRIVING among cyclist groups
NHTSA_AGG_DRIVING all cyclists cyclists with serious injury or fatality
0 98.90% 98.05%
1 1.10% 1.95%
            
Breakdown of CELL_PHONE among cyclist groups
CELL_PHONE all cyclists cyclists with serious injury or fatality
0 99.61% 99.54%
            
Breakdown of DISTRACTED among cyclist groups
DISTRACTED all cyclists cyclists with serious injury or fatality
0 94.71% 93.58%
1 5.29% 6.42%
            
Breakdown of DRINKING_DRIVER among cyclist groups
DRINKING_DRIVER all cyclists cyclists with serious injury or fatality
0 98.66% 95.07%
1 1.34% 4.93%
            
Breakdown of DRUGGED_DRIVER among cyclist groups
DRUGGED_DRIVER all cyclists cyclists with serious injury or fatality
0 99.52% 97.64%
1 0.48% 2.36%
            
Breakdown of FATIGUE_ASLEEP among cyclist groups
FATIGUE_ASLEEP all cyclists cyclists with serious injury or fatality
0 99.76% 99.64%
            
Breakdown of IMPAIRED_DRIVER among cyclist groups
IMPAIRED_DRIVER all cyclists cyclists with serious injury or fatality
0 98.36% 93.58%
1 1.64% 6.42%
            
Breakdown of MATURE_DRIVER among cyclist groups
MATURE_DRIVER all cyclists cyclists with serious injury or fatality
0 89.10% 87.01%
1 10.90% 12.99%
            
Breakdown of YOUNG_DRIVER among cyclist groups
YOUNG_DRIVER all cyclists cyclists with serious injury or fatality
0 93.45% 91.58%
1 6.55% 8.42%
            

Summary of feature value comparison¶

aspect of crash observations based on all cyclists any changes observed when restricting to those with serious injury or fatality
cyclist age 10-20 is the most common range, followed by 20-30. Percentage of older cyclists becomes more prevalent - over 50, dramatically so
cyclist helmet use The majority were recorded as not wearing a helmet, and over 2000 cyclists had 'unknown' helmet status. The prevalence of helmets decreases slightly.
cyclist position, movement, road grade
  • The vast majority were moving straight at the time of the collision. Aside from turning left, other movement categories are rare.
  • The most common position is in the right lane at the curb, but other positions are not uncommon.
  • The vast majority were traveling on level grade, but downhill collisions were not uncommon.
  • The percentage of moving straight decreases, and other categories increase: turning left, changing or merging, and navigating a curve.
  • The percentage in the right lane near the curb increases - this situation is the most prevalent position but perhaps more dangerous for cyclists.
  • The percentage on level road decreases and percentage of all categories involving hills increases.
collision time and day
  • The incidence of cyclists collisions increased steadily throughout the day, peaking between 4-5pm.
  • Overall, the vast majority of cyclists were involved in collisions during the daylight hours (when the vast majority of cycling occurs).
  • More cyclists were involved in collisions on weekdays, than weekends, which is consistent with my expectation that weekday commuting traffic would increase the incidence of collisions overall.
  • The percentages for all hours from 10pm-6am increase to varying degrees, and the percentages for daylight hours almost all decrease.
  • The percentage for Friday increase and the percentages for Monday and Tuesday decrease. One might speculate that drivers lose patience towards the end of the workweek.
collision type and cyclist impact point
  • Front impact points are most common.
  • Left, right, and rear impacts are not uncommon.
  • The vast majority of collisons are angle collisions.
  • The majority of cyclists were struck in the collision (as opposed to striking).
  • Front impact points become less prevalent and left and rear become more prevalent.
  • Angle and sideswipe collisions become less prevalent and rear end and head on collisions more prevalent.
  • The cyclist being struck becomes a little more prevalent, and the percentage for striking and struck doubles.
urban, rural, or urbanized setting The vast majority of cyclists were in a crash in an urban setting - consistent with my expectation that cycling happens in urban settings. The percentage in a rural setting more than doubles, and percentage in urban setting decreases - crashes in rural settings are more dangerous for cyclists.
speed limit where collision occured The vast majority of cyclists were traveling in a 25mph zone, a very common speed limit in urban settings. The percentage traveling in higher speed limit zones increases.
illumination and weather-related conditions
  • The vast majority in daylight collisions - consistent with my expectation that cycling happens in daylight.
  • The vast majority in clear weather, and rain was the only other significant category at 6%.
  • The vast majority involved a dry road surface, and 8% a wet road surface.
The percentage traveling in daylight decreases, percentage traveling in dark unlit conditions triples, and percentage traveling at dawn doubles.
crash location
  • The majority were in collisions in intersections (with 4-way the most common type), and 35% in midblock collisions.
  • About half were in collisions where traffic control devices are not relevant, and the other half roughly split between stop signs and traffic signals.
  • When present and controlled, traffic control devices were almost always functioning properly.
  • The roadway was straight in the vast majority of cases - only 3% of cyclists were in a collision on a curve. Though it's important to remember that collisions in intersections still are classifed as 'straight' if the roadway on which the cyclist is currently traveling is not curving in the area of the intersection.
  • The percentage in midblock collisions increases to 46%, which is consistent with my expectation that midblock areas are more dangerous for cyclist due to higher motorist speeds.
  • The percentage in areas where traffic control devices not applicable increases significantly, which is consistent with the increased prevalence of midblock collisions since these devices tend to be at intersections
  • The percentage in areas with uncontrolled traffic control devices (e.g. stop signs) increases, which is consistent with my exectation that intersections with stop signs are more dangerous for cyclists than intersections with traffic signals.
  • The percentage of cyclists in collisions along a curved roadway more than doubled.
vehicle flags
  • Few cyclists were in collisions involving heavy trucks, buses, or commercial vehicles (1-2% for each).
  • Few cyclists were in collisions involving drinking drivers
The percentages for buses, small trucks, and SUVs increase significantly. The percentage for heavy trucks triples, and the percentage for commmercial vehicles doubles.
driver condition or behavior flags
  • Very few cyclists were involved in collisions which were speeding-related or in which a vehicle ran a stop sign, ran a red light, was tailgating, crossed the median, or made a lane departure (less than 3% for each).
  • Very few cyclists were involved in collisions with drinking drivers or drugged drivers (less than 2% for each).
  • While very few cyclists were involved in collisions with drivers flagged as using cell phones or fatigued or asleep, 5% were involved in collisions with distracted drivers.
  • Around 22% of cyclists were involved in collisions with drivers exhibiting at least one aggressive driving behavior, but only 1% with drivers exhibiting at least two different aggressive driving behaviors (i.e. NHTSA-qualified aggressive driving).
  • The percentage of cyclists in collisions with at least one NHTSA-qualifying aggressive driver increases.
  • The percentage for drinking drivers more than triples, and the percentage for drugged drivers increases five-fold.
  • The percentage in speeding-related collisions triples.
  • The percentage for distracted drivers increases.
  • The percentage in collisions in which a vehicle made a lane departure or crossed the median doubles.

Visualizing how pairing factors affects chance of serious injury or fatality¶

Approximately 7.4% of cyclists in the dataset suffered serious injury or fatality. In the previous section, I identified some particular factors which seem to especially strongly affect the percentage of cyclists suffering serious injury or fatality:

  • Cyclist aged over 50 years old
  • Cyclist traveling on a curved road or on a hilly grade
  • Collision being rear-end type
  • Cyclist being both striking and struck
  • Collision occuring midblock
  • Collision in a rural setting
  • Collision in dark unlit conditions or at dawn
  • Collision occuring on a Friday
  • Collision being speeding-related
  • Involvement of a truck, SUV, or commercial vehicle
  • Involvement of a drugged driver, drinking driving, or distracted driver
  • Involvement of NHTSA-qualifying aggressive driver
  • Involvement of a young driver (under 20yo) or a mature driver (over 65yo)
In [5]:
cyclists['OVER_50'] = (cyclists.AGE>=50).astype('int')

filters = {'cyclist over 50':cyclists.OVER_50==1,
           'striking and struck':cyclists.VEH_ROLE=='striking_struck',
           'rear-end':cyclists.COLLISION_TYPE=='rear_end',
           'rear impact':cyclists.IMPACT_SIDE=='rear',
           'curved road':cyclists.RDWY_ALIGNMENT=='curve',
           'hill':~cyclists.GRADE.isin(['level','unknown']),
           'midblock':cyclists.INTERSECT_TYPE=='midblock',
           'rural':cyclists.URBAN_RURAL=='rural',
           'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
           'dawn':cyclists.ILLUMINATION=='dawn',
           'Friday':cyclists.DAY_OF_WEEK==6,
           'speeding related':cyclists.SPEEDING_RELATED==1,
           'heavy truck':cyclists.HEAVY_TRUCK==1,
           'small truck':cyclists.SMALL_TRUCK==1,
           'SUV':cyclists.SUV==1,
           'commercial vehicle':cyclists.COMM_VEHICLE==1,
           'drugged driver':cyclists.DRUGGED_DRIVER==1,
           'drinking driver':cyclists.DRINKING_DRIVER==1,
           'distracted driver':cyclists.DISTRACTED==1,
           'NHTSA agg driver':cyclists.NHTSA_AGG_DRIVING==1,
           'driver under 20':cyclists.YOUNG_DRIVER==1,
           'driver over 65':cyclists.MATURE_DRIVER==1}
In [6]:
percents = [cyclists[filters[filter]].SERIOUS_OR_FATALITY.sum(axis=0)/cyclists[filters[filter]].shape[0] for filter in filters]
percents = pd.DataFrame({'crash factor':filters.keys(),
              'serious injury or fatality percentage':percents}).set_index('crash factor')\
            .sort_values(by='serious injury or fatality percentage',ascending=False).transpose()
format_dict={col:'{:.2%}' for col in percents.columns}
percents.style.format(format_dict).background_gradient(axis=None,cmap='bwr',gmap=percents,vmin=.074-0.5,vmax=.074+0.5)
Out[6]:
crash factor drugged driver drinking driver heavy truck speeding related dark unlit dawn commercial vehicle curved road rural striking and struck rear-end NHTSA agg driver hill rear impact cyclist over 50 small truck midblock driver under 20 distracted driver SUV driver over 65 Friday
serious injury or fatality percentage 35.38% 26.74% 22.36% 21.29% 20.91% 19.50% 17.49% 16.34% 15.86% 14.52% 13.64% 12.84% 12.43% 12.20% 11.73% 10.56% 9.61% 9.31% 8.78% 8.68% 8.63% 8.21%

Note that all of these factors, increase the probability that a cyclist will suffer a servere injury or fatality - with some having a very dramatic effect. When pairing these factors, I expect the effect to compound - e.g. I expect that a collision that is speeding-related involving a drinking driver will lead to higher chance of serious injury or death than either factor alone provides.

The following table gives, for each pairing of factors from the above list, the percentage of cyclists suffering serious injury or fatality having been involved in a crash with BOTH of those factors. The diagonal entries are just the percentages from the above table, i.e. the percentages corresponding to single factors. Percentages are omitted for pairs for which there are fewer than 27 samples (i.e. correspond to less than 0.1% of the entire dataset). Lowering this threshold will reveal more percentages, but observations I glean from very small sample sets are probably not reasonable conclusions about the data distribution from which the dataset is taken.

In the following table:

  • A cell is greyed out if that pair of factors corresponds to fewer than 0.1% of samples (i.e. fewer than 27 samples in our case)
  • A diagonal cell contains the percentage of cyclists suffering serious injury or fatality among those involved in crashes exhibiting that one factor (yellow cells)
  • An off-diagonal cell corresponds to the percentage of cyclists suffering serious injury or fatality among those involved in crashes exhibiting that pair of factors (red or blue cells):
    • A cell is some shade of red if its percentage is larger than the corresponding yellow percentage in its same column, i.e. if its corresponding pair of factors give a higher probability of serious injury or fatality than the single factor corresponding to its column; the shade indicates the degree of change in probability
    • A cell is some shade of blue if its percentage is smaller than the corresponding yellow percentage in its same column, i.e. if its corresponding pair of factors give a lower probability of serious injury or fatality than the single factor corresponding to its column; the shade indicates the degree of change in probability
In [7]:
from lib.vis_data import crosstab_percent, stylize_dataframe
_, percents = crosstab_percent(filters,cyclists)
stylize_dataframe(percents)
  cyclist over 50 striking and struck rear-end rear impact curved road hill midblock rural dark unlit dawn Friday speeding related heavy truck small truck SUV commercial vehicle drugged driver drinking driver distracted driver NHTSA agg driver driver under 20 driver over 65
cyclist over 50 11.73% 26.67% 20.22% 23.48% 24.54% 21.7% 15.69% 19.7% 23.85% 42.11% 13.03% 37.31% 27.45% 16.2% 13.29% 17.39% 28.3% 15.53% 19.44% 13.0% 14.78%
striking and struck 26.67% 14.52% 28.12% 25.0% 23.29% 14.13% 27.87% 25.42% 13.64% 15.52% 17.24% 18.42%
rear-end 20.22% 28.12% 13.64% 18.47% 31.75% 24.07% 17.46% 27.46% 37.34% 16.91% 27.47% 17.2% 19.43% 23.81% 46.67% 38.57% 18.39% 17.17% 22.02%
rear impact 23.48% 25.0% 18.47% 12.2% 21.9% 20.67% 19.16% 23.62% 31.51% 13.3% 30.16% 30.0% 18.14% 16.46% 20.93% 56.41% 45.78% 21.14% 28.57% 15.79% 16.0%
curved road 24.54% 31.75% 21.9% 16.34% 22.87% 20.3% 22.34% 42.5% 18.0% 44.44% 22.12% 21.67% 42.86% 19.05% 29.73% 13.68%
hill 21.7% 23.29% 24.07% 20.67% 22.87% 12.43% 16.13% 21.95% 26.52% 25.93% 15.18% 31.96% 33.33% 18.44% 14.05% 30.43% 70.37% 40.28% 15.16% 19.67% 14.96% 12.75%
midblock 15.69% 14.13% 17.46% 19.16% 20.3% 16.13% 9.61% 18.01% 27.39% 48.98% 10.68% 27.62% 24.46% 12.94% 11.52% 19.61% 50.77% 33.7% 13.21% 16.67% 12.2% 11.84%
rural 19.7% 27.87% 27.46% 23.62% 22.34% 21.95% 18.01% 15.86% 27.6% 33.33% 19.44% 39.85% 28.42% 17.99% 16.95% 25.45% 60.0% 31.52% 17.0% 23.53% 20.27% 15.18%
dark unlit 23.85% 37.34% 31.51% 42.5% 26.52% 27.39% 27.6% 20.91% 21.14% 50.0% 21.51% 23.48% 35.85% 16.07% 24.36% 22.89%
dawn 42.11% 25.93% 48.98% 33.33% 19.5% 20.0% 16.67%
Friday 13.03% 25.42% 16.91% 13.3% 18.0% 15.18% 10.68% 19.44% 21.14% 20.0% 8.21% 21.43% 17.74% 12.17% 10.55% 20.37% 30.56% 11.54% 8.89% 9.89% 10.23%
speeding related 37.31% 27.47% 30.16% 44.44% 31.96% 27.62% 39.85% 50.0% 21.43% 21.29% 36.36% 28.3% 51.43% 27.5% 25.49% 25.93%
heavy truck 27.45% 30.0% 33.33% 24.46% 28.42% 17.74% 22.36% 25.63%
small truck 16.2% 13.64% 17.2% 18.14% 22.12% 18.44% 12.94% 17.99% 21.51% 12.17% 36.36% 10.56% 31.75% 12.24% 22.5% 17.89% 12.5%
SUV 13.29% 15.52% 19.43% 16.46% 21.67% 14.05% 11.52% 16.95% 23.48% 16.67% 10.55% 28.3% 8.68% 42.59% 10.28% 7.55% 13.66% 10.95%
commercial vehicle 17.39% 23.81% 20.93% 30.43% 19.61% 25.45% 20.37% 25.63% 17.49% 27.27%
drugged driver 46.67% 56.41% 70.37% 50.77% 60.0% 35.38% 36.17%
drinking driver 28.3% 38.57% 45.78% 42.86% 40.28% 33.7% 31.52% 35.85% 30.56% 51.43% 31.75% 42.59% 36.17% 26.74% 24.14%
distracted driver 15.53% 17.24% 18.39% 21.14% 19.05% 15.16% 13.21% 17.0% 16.07% 11.54% 12.24% 10.28% 27.27% 8.78% 11.61% 10.56%
NHTSA agg driver 19.44% 28.57% 19.67% 16.67% 23.53% 8.89% 27.5% 22.5% 7.55% 12.84% 18.52% 17.5%
driver under 20 13.0% 17.17% 15.79% 29.73% 14.96% 12.2% 20.27% 24.36% 9.89% 25.49% 17.89% 13.66% 11.61% 18.52% 9.31%
driver over 65 14.78% 18.42% 22.02% 16.0% 13.68% 12.75% 11.84% 15.18% 22.89% 10.23% 25.93% 12.5% 10.95% 24.14% 10.56% 17.5% 8.63%

Many of these pairs have small sample sizes. I focus on a subset of this list:

In [8]:
few_filters = {'cyclist over 50':cyclists.OVER_50==1,
               'rear-end':cyclists.COLLISION_TYPE=='rear_end',
               'curved road':cyclists.RDWY_ALIGNMENT=='curve',
               'hill':~cyclists.GRADE.isin(['level','unknown']),
               'midblock':cyclists.INTERSECT_TYPE=='midblock',
               'rural':cyclists.URBAN_RURAL=='rural',
               'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
               'speeding related':cyclists.SPEEDING_RELATED==1,
               'SUV':cyclists.SUV==1,
               'drinking driver':cyclists.DRINKING_DRIVER==1,
              }


# few_filters = {'
#                 'drinking driver':cyclists.DRINKING_DRIVER==1,
#                'speeding related':cyclists.SPEEDING_RELATED==1,
#                'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
#                'rural':cyclists.URBAN_RURAL=='rural',
#                'curved road':cyclists.RDWY_ALIGNMENT=='curve',
#                'midblock':cyclists.MIDBLOCK==1,
#                'SUV':cyclists.SUV==1}
_,percents = crosstab_percent(few_filters, cyclists)
stylize_dataframe(percents)
  cyclist over 50 rear-end curved road hill midblock rural dark unlit speeding related SUV drinking driver
cyclist over 50 11.73% 20.22% 24.54% 21.7% 15.69% 19.7% 23.85% 37.31% 13.29% 28.3%
rear-end 20.22% 13.64% 31.75% 24.07% 17.46% 27.46% 37.34% 27.47% 19.43% 38.57%
curved road 24.54% 31.75% 16.34% 22.87% 20.3% 22.34% 42.5% 44.44% 21.67% 42.86%
hill 21.7% 24.07% 22.87% 12.43% 16.13% 21.95% 26.52% 31.96% 14.05% 40.28%
midblock 15.69% 17.46% 20.3% 16.13% 9.61% 18.01% 27.39% 27.62% 11.52% 33.7%
rural 19.7% 27.46% 22.34% 21.95% 18.01% 15.86% 27.6% 39.85% 16.95% 31.52%
dark unlit 23.85% 37.34% 42.5% 26.52% 27.39% 27.6% 20.91% 50.0% 23.48% 35.85%
speeding related 37.31% 27.47% 44.44% 31.96% 27.62% 39.85% 50.0% 21.29% 28.3% 51.43%
SUV 13.29% 19.43% 21.67% 14.05% 11.52% 16.95% 23.48% 28.3% 8.68% 42.59%
drinking driver 28.3% 38.57% 42.86% 40.28% 33.7% 31.52% 35.85% 51.43% 42.59% 26.74%

Notice that for this smaller sublist of crash factors, almost all pairs provide a higher probability of cyclist suffering serious injury or fatality than either individual factor in the pair - and many provide a much higher probability. For some factors, the effect is very dramatic - e.g. drinking driver, speeding related, SUV, and dark unlit compound one another quite a bit.

In [9]:
# few_filters = {
#                'rear-end':cyclists.COLLISION_TYPE=='rear_end',
#                'curved road':cyclists.RDWY_ALIGNMENT=='curve',
#                'rural':cyclists.URBAN_RURAL=='rural',
#                'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
#                'speeding related':cyclists.SPEEDING_RELATED==1,
#                'SUV':cyclists.SUV==1,
#                'drinking driver':cyclists.DRINKING_DRIVER==1,
#               }
few_filters = {
               'curved road':cyclists.RDWY_ALIGNMENT=='curve',
               'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
               'speeding related':cyclists.SPEEDING_RELATED==1,
               'drinking driver':cyclists.DRINKING_DRIVER==1,
              }
_,percents = crosstab_percent(few_filters,cyclists)
stylize_dataframe(percents)
  curved road dark unlit speeding related drinking driver
curved road 16.34% 42.5% 44.44% 42.86%
dark unlit 42.5% 20.91% 50.0% 35.85%
speeding related 44.44% 50.0% 21.29% 51.43%
drinking driver 42.86% 35.85% 51.43% 26.74%

Summarization of findings¶

  1. The annual counts of crashes involving cyclists in PA showed a consistent downward trend since 2004, decreasing from above 1600 incidents to below 800 incidents in 2021. However, the annual counts or crashes involving serious cyclist injury or fatality have not declined significantly. In fact, in 2021 there were 103 crashes involving serious cyclist injury and 24 involving cyclist death - both the highest annual counts in this 20-year dataset!
  2. Regarding the distributions of certain crash features and their relationship with cyclist injury severity:
    • The majority of cyclists in collisions are between 10-30 years of age. However, older cyclists are overrepresented among cyclists suffering serious injury or fatality.
    • Around 75% of cyclists in collisions are traveling in a 25mph or below zone, presumably due to the prevalence of low speed limits in urban settings. However, almost half of cyclists suffering serious injury or fatality were traveling in higher speed limit zones.
    • Midblock collisions were overrepresented among cyclists who suffered serious injury or fatality, possibly due to the higher vehicle speeds seen at midblock - 46% of cyclists suffering serious injury or fatality were in midblock collisions, as opposed to 35% of all cyclists.
    • 7.4% of cyclists involved in crashes suffered serious injury or fatality. There are certain crash factors such that when conditioned upon, the percentage of cyclists with serious injury or fatality more than doubles (corresponding percentages in parentheses):
      • Involvement of at least one drugged driver (35.4%) or drinking driver (26.7%)
      • Involvement of at least one heavy truck (22.4%) or commercial vehicle (17.5%)
      • The crash being speeding-related (21.3%)
      • The crash occuring in a dark unlit setting (20.9%) or at dawn (20%)
      • The crash occuring on a curved roadway (16.3%)
      • The crash occuring in a rural setting (15.9%)
    • When conditioned upon some pairs of these factors, the percentage of cyclists suffering serious injury or fatality surpassed 40%:
      • Speeding-related crashes with a drinking driver involved (51.43%)
      • Speeding-related crashes in dark unlit conditions (50%)
      • Speeding-related crashes on a curved roadway (44.4%)
      • Crashes involving a drinking driver on a curved roadway (42.9%)
      • Crashes on a curved roadway in dark unlit conditions (42.5%)