In the second phase of the project, I explore the datasets 'cyclists' and 'crashes' and visualized some trends. Recall that I am particularly interested in the percentage of cyclists suffering serious injury or fatality among cyclists involved in crashes, and in particular I would like to examine how this percentage changes when I restrict to crashes in which certain factors are present.
Using the 'crashes' dataframe, I will:
Using the 'cyclists' dataframe, I will:
First, import the necessary libraries and load in the dataframes prepared in Part I.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sns
from IPython.display import display, display_html
from ipywidgets import widgets
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth',None)
plt.style.use('fivethirtyeight')
import sys
np.set_printoptions(threshold=sys.maxsize)
cyclists = pd.read_csv('cyclists.csv')
crashes = pd.read_csv('crashes.csv')
The plot_over_time function will create three subplots according to desired input time period:
from lib.vis_data import plot_over_time
plot_over_time(crashes)
Below is a display of the year-over-year percents changes in each type of crash. Decreases are shaded blue and increases are shaded red; the intensity of the hue reflects the magnitide of the percent change.
from lib.vis_data import perc_change_table
perc_change_table(crashes)
| year | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| yearly change of: | |||||||||||||||||||
| all crashes | 2.3% | 3.9% | -13.4% | -2.1% | 7.2% | -3.0% | -4.7% | 6.0% | -11.1% | 5.1% | -0.4% | -5.1% | -3.0% | 2.3% | -12.4% | -15.7% | 5.7% | -19.6% | -7.3% |
| with serious injury | 16.5% | -9.1% | -28.9% | 17.2% | 22.7% | -12.0% | 1.2% | -19.5% | 1.5% | -11.9% | 10.2% | -27.7% | 14.9% | 79.6% | 0.0% | -3.1% | 5.3% | -29.3% | 47.1% |
| with fatality | 17.6% | -30.0% | 28.6% | -27.8% | 53.8% | -60.0% | 100.0% | 31.2% | -47.6% | 54.5% | -35.3% | 72.7% | -15.8% | 0.0% | 31.2% | -14.3% | -11.1% | 37.5% | 9.1% |
| with either | 17.8% | -12.6% | -21.2% | 7.3% | 27.3% | -21.4% | 10.2% | -10.3% | -10.3% | -2.6% | 0.0% | -13.2% | 6.1% | 61.4% | 4.4% | -5.1% | 2.7% | -20.0% | 37.0% |
Some observations based on these visualizations:
The conditions for cycling in urban settings are dramatically different from those in rural settings, and I expect that might have an effect on crash severity. I'll display the annual count graphs with rural crashes separated out from crashes in urban/urbanized settings via stacked bars:
plot_over_time(crashes,split_urban_rural=True)
I can see a few things:
I'll continue to break down crash counts over time between rural and urban/urbanized settings.
plot_over_time(crashes, feature='CRASH_MONTH',label='month',split_urban_rural=True)
from lib.vis_data import perc_change_table
perc_change_table(crashes, period='month')
| month | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| monthly change of: | ||||||||||||
| all crashes | -29.2% | -4.0% | 93.7% | 76.0% | 41.2% | 21.2% | 3.5% | 5.1% | -13.6% | -28.3% | -39.1% | -38.5% |
| with serious injury | -50.8% | 16.1% | 88.9% | 85.3% | 38.9% | 20.6% | 0.9% | 16.0% | -32.4% | -4.2% | -44.4% | -29.2% |
| with fatality | -35.0% | -38.5% | 100.0% | 25.0% | 100.0% | 2.5% | 29.3% | -24.5% | -5.0% | -26.3% | -25.0% | -4.8% |
| with either | -47.0% | 0.0% | 90.9% | 72.6% | 48.3% | 16.7% | 5.6% | 8.3% | -28.6% | -8.3% | -42.0% | -23.9% |
The above table should be read carefully. The category for a particular month contains all samples from that month from every year, and so the percent change entries under '1' should be interpreted as the percent change between the December total (for all years) and the January total (for all years).
For urban/urbanized crashes:
For rural crashes:
I'll take a quick look at long-term seasonality by plotting each month's actual counts rather than summing over years.
from lib.vis_data import plot_month_series
plot_month_series(crashes)
Some additional observations regarding these monthly count plots:
plot_over_time(crashes, feature='DAY_OF_WEEK',label='day of the week',split_urban_rural=True)
For urban/urbanized crashes:
Rural crash variations during the week are quite different:
plot_over_time(crashes, feature='HOUR_OF_DAY',label='hour of the day',split_urban_rural=True,best_legend=True)
For urban/urbanized crashes:
For rural crashes:
Uncomment the following if you want to take a closer look at hourly counts for each day of the week:
# days = ['Sun','Mon','Tues','Wed','Thurs','Fri','Sat']
# for i,day in enumerate(days):
# plot_over_time(df = crashes[crashes.DAY_OF_WEEK==i+1], feature='HOUR_OF_DAY',label=f'hour of the day ({day})',kind='hist')
The function plot_map plots crash events on a map using DEC_LAT, DEC_LONG data. I won't do an in-depth analysis, but the visualization helps illustrate how the crash events are distributed throughout the state, and how the geographic distribution varies over time.
from lib.vis_data import plot_map
plot_map(crashes)
The plot_map function can also focus in on a particular municipality by passing in its municipality code and name. Philadelphia is the largest city and most significant source of crash events. In the period of 2002-2021:
plot_map(df=crashes,city=(67301,'Philadelphia'))
Uncomment and run the following cell to see maps from other major urban areas in PA:
# cities_to_plot = [(67301,'Philadelphia'),
# (2301,'Pittsburgh'),
# (39301,'Allentown'),
# (6301,'Reading'),
# (25302,'Erie'),
# (35302,'Scranton'),
# (22301,'Harrisburg'),
# (14410,'State College')
# ]
# for city in cities_to_plot:
# plot_map(crashes, city=city)
In this section, I shall visualize how the values of features break down over three cohorts:
Reminder: the data dictionary defines "serious injury" and "fatal injury" as:
pd.concat([cyclists.INJ_SEVERITY.value_counts()[['susp_serious_injury','killed']],pd.Series({'total':cyclists.shape[0]})],axis=0)
susp_serious_injury 1609 killed 338 total 26882 dtype: int64
Note that I will combine the cyclist groups who were seriously injured or killed because the number of samples of cyclists who were kills was very low (338 out of 26882, i.e. 1.25% of samples). Such a small sample set can be noisy result in less reliable conclusions. Moreover, I expect that the distinction between serious injury and fatality is heavily influenced by factors that are outside the scope of the dataset such as the underlying health of the cyclist, the quality of medical care they might have received, and the fine details of the collision geometry.
I will focus on several types of features:
Note that some categories are very rare! I only include in our charts feature categories which are represented in at least 0.5% of some cohort of cyclist samples.
cat_features = ['AGE_BINS', 'SEX','RESTRAINT_HELMET',
'VEH_MOVEMENT', 'VEH_POSITION','VEH_ROLE',
'IMPACT_SIDE','URBAN_RURAL',
'DAY_OF_WEEK','HOUR_OF_DAY','COLLISION_TYPE',
'RDWY_ALIGNMENT','GRADE','SPEED_LIMIT','ROAD_CONDITION','WEATHER',
'ILLUMINATION','INTERSECT_TYPE','LOCATION_TYPE', 'RELATION_TO_ROAD',
'TCD_TYPE', 'TCD_FUNC_CD']
flag_features = ['BUS', 'HEAVY_TRUCK', 'SMALL_TRUCK', 'SUV','COMM_VEHICLE',
'RUNNING_STOP_SIGN','RUNNING_RED_LT','SPEEDING_RELATED', 'TAILGATING',
'CROSS_MEDIAN', 'LANE_DEPARTURE','AGGRESSIVE_DRIVING','NHTSA_AGG_DRIVING',
'CELL_PHONE','DISTRACTED','DRINKING_DRIVER', 'DRUGGED_DRIVER',
'FATIGUE_ASLEEP','IMPAIRED_DRIVER',
'MATURE_DRIVER','YOUNG_DRIVER']
from lib.vis_data import feat_perc_comp
# Define scheme for HTML output of dataframes, can have multiple per line
output=''
tables = [feat_perc_comp(feat,cyclists) for feat in cat_features]
tables = [(x,x.data.shape[0]) for x in tables]
tables = sorted(tables,key=lambda x:x[1],reverse=True)
tables = [table[0] for table in tables]
tables = tables+[feat_perc_comp(feat,cyclists) for feat in flag_features]
for table in tables:
output += table._repr_html_()
output += "\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"
display_html(output,raw=True)
| HOUR_OF_DAY | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0.000000 | 0.97% | 1.39% |
| 1.000000 | 0.63% | 0.98% |
| 2.000000 | 0.52% | 1.13% |
| 3.000000 | 0.28% | 0.72% |
| 5.000000 | 0.70% | 1.13% |
| 6.000000 | 1.67% | 2.72% |
| 7.000000 | 2.99% | 3.08% |
| 8.000000 | 3.29% | 2.57% |
| 9.000000 | 3.02% | 4.21% |
| 10.000000 | 3.44% | 2.77% |
| 11.000000 | 4.25% | 3.70% |
| 12.000000 | 5.34% | 4.57% |
| 13.000000 | 5.57% | 5.24% |
| 14.000000 | 6.33% | 6.53% |
| 15.000000 | 8.80% | 8.63% |
| 16.000000 | 10.89% | 9.25% |
| 17.000000 | 10.91% | 9.25% |
| 18.000000 | 9.23% | 9.35% |
| 19.000000 | 7.50% | 6.37% |
| 20.000000 | 5.44% | 5.81% |
| 21.000000 | 4.05% | 4.78% |
| 22.000000 | 2.36% | 3.39% |
| 23.000000 | 1.58% | 2.11% |
| VEH_MOVEMENT | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| straight | 80.84% | 72.76% |
| turning_left | 5.07% | 7.42% |
| other | 4.26% | 5.51% |
| turning_right | 1.95% | 1.58% |
| changing_merging | 1.86% | 2.89% |
| passing_vehicle | 1.44% | 1.26% |
| curve_left | 1.05% | 2.78% |
| curve_right | 0.91% | 2.62% |
| stopped_in_lane | 0.61% | 0.71% |
| slowing_or_stopping_in_lane | 0.55% | 0.33% |
| avoiding | 0.51% | 0.60% |
| entering_lane | 0.41% | 0.66% |
| VEH_POSITION | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| right_lane_curb | 35.41% | 44.74% |
| other | 20.11% | 12.22% |
| unknown | 10.11% | 7.19% |
| oncoming_lane | 7.59% | 8.58% |
| shoulder_right | 5.69% | 8.94% |
| one_lane_road | 4.74% | 3.24% |
| right_of_trafficway | 4.57% | 4.57% |
| other_forward_lane | 3.95% | 2.57% |
| left_lane | 2.48% | 3.13% |
| left_of_trafficway | 2.26% | 2.36% |
| shoulder_left | 1.77% | 1.18% |
| right_lane | 0.67% | 0.46% |
| IMPACT_SIDE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| front | 39.49% | 35.23% |
| unknown | 22.24% | 15.20% |
| left | 12.85% | 15.82% |
| rear | 9.18% | 15.46% |
| right | 8.95% | 9.60% |
| front_left | 2.62% | 3.90% |
| front_right | 1.79% | 1.39% |
| rear_left | 1.04% | 1.18% |
| non_collision | 0.84% | 0.98% |
| rear_right | 0.76% | 0.87% |
| AGE_BINS | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| (0, 10] | 12.46% | 9.62% |
| (10, 20] | 35.60% | 32.20% |
| (20, 30] | 18.60% | 13.90% |
| (30, 40] | 10.61% | 11.92% |
| (40, 50] | 10.55% | 12.02% |
| (50, 60] | 7.95% | 11.76% |
| (60, 70] | 3.11% | 5.96% |
| (70, 80] | 0.88% | 1.99% |
| (80, 90] | 0.21% | 0.63% |
| COLLISION_TYPE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| angle | 67.80% | 63.51% |
| sideswipe_same_dir | 10.85% | 8.27% |
| rear_end | 6.25% | 11.77% |
| hit_ped | 5.67% | 5.04% |
| head_on | 5.33% | 7.50% |
| sideswipe_opp_dir | 2.80% | 1.85% |
| other | 0.49% | 0.57% |
| non_collision | 0.34% | 0.72% |
| hit_fixed_obj | 0.20% | 0.62% |
| SPEED_LIMIT | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 15 | 3.41% | 1.64% |
| 20 | 2.37% | 2.31% |
| 25 | 70.05% | 53.31% |
| 30 | 2.69% | 2.98% |
| 35 | 13.46% | 19.93% |
| 40 | 3.48% | 8.22% |
| 45 | 2.88% | 6.78% |
| 50 | 0.19% | 0.51% |
| 55 | 1.25% | 4.16% |
| DAY_OF_WEEK | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| Sun | 11.49% | 12.28% |
| Mon | 14.86% | 13.66% |
| Tues | 14.78% | 12.28% |
| Wed | 15.35% | 15.67% |
| Thurs | 14.77% | 14.48% |
| Fri | 15.41% | 17.46% |
| Sat | 13.34% | 14.18% |
| GRADE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| level | 70.37% | 62.40% |
| unknown | 12.90% | 8.89% |
| downhill | 11.96% | 19.16% |
| uphill | 2.83% | 4.88% |
| bottom_hill | 1.42% | 3.29% |
| top_hill | 0.51% | 1.39% |
| INTERSECT_TYPE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| four_way | 42.47% | 32.92% |
| midblock | 34.70% | 46.02% |
| T | 19.60% | 17.62% |
| Y | 1.44% | 1.69% |
| multi_leg | 1.05% | 1.03% |
| other | 0.52% | 0.26% |
| ILLUMINATION | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| daylight | 76.07% | 70.16% |
| dark_lit | 17.20% | 17.26% |
| dusk | 3.14% | 2.67% |
| dark_unlit | 2.86% | 8.27% |
| dawn | 0.59% | 1.59% |
| RELATION_TO_ROAD | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| on_roadway | 92.25% | 91.94% |
| shoulder | 4.37% | 5.84% |
| roadside | 1.76% | 1.34% |
| outside_trafficway | 0.81% | 0.52% |
| parking_lane | 0.62% | 0.10% |
| RESTRAINT_HELMET | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| no_restraint | 76.12% | 74.99% |
| bicycle_helmet | 14.78% | 20.03% |
| unknown | 8.18% | 4.01% |
| motorcycle_helmet | 0.57% | 0.62% |
| VEH_ROLE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| struck | 62.14% | 67.64% |
| striking | 36.05% | 28.92% |
| striking_struck | 1.38% | 2.77% |
| non_collision | 0.43% | 0.67% |
| WEATHER | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| clear | 92.82% | 92.47% |
| rain | 6.10% | 5.98% |
| cloudy | 0.31% | 0.52% |
| other | 0.29% | 0.57% |
| LOCATION_TYPE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| not_applicable | 91.64% | 90.85% |
| driveway_parking_lot | 6.97% | 6.79% |
| ramp | 0.54% | 0.72% |
| bridge | 0.38% | 0.93% |
| TCD_TYPE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| not_applicable | 47.28% | 56.96% |
| stop_sign | 27.51% | 23.04% |
| traffic_signal | 24.30% | 18.35% |
| other | 0.35% | 0.88% |
| URBAN_RURAL | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| urban | 77.37% | 59.73% |
| rural | 14.66% | 32.10% |
| urbanized | 7.97% | 8.17% |
| TCD_FUNC_CD | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| functioning_properly | 50.60% | 41.68% |
| no_controls | 48.20% | 57.64% |
| functioning_improperly | 1.03% | 0.57% |
| SEX | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| M | 81.84% | 85.62% |
| F | 18.16% | 14.38% |
| RDWY_ALIGNMENT | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| straight | 97.00% | 93.34% |
| curve | 3.00% | 6.66% |
| ROAD_CONDITION | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| dry | 90.93% | 90.75% |
| wet | 8.19% | 8.02% |
| BUS | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 98.91% | 98.46% |
| 1 | 1.09% | 1.54% |
| HEAVY_TRUCK | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 98.77% | 96.20% |
| 1 | 1.23% | 3.80% |
| SMALL_TRUCK | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 90.67% | 86.39% |
| 1 | 9.33% | 13.61% |
| SUV | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 84.44% | 81.36% |
| 1 | 15.56% | 18.64% |
| COMM_VEHICLE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 97.81% | 94.71% |
| 1 | 2.19% | 5.29% |
| RUNNING_STOP_SIGN | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 98.59% | 98.66% |
| 1 | 1.41% | 1.34% |
| RUNNING_RED_LT | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 99.07% | 98.92% |
| 1 | 0.93% | 1.08% |
| SPEEDING_RELATED | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 98.27% | 94.92% |
| 1 | 1.73% | 5.08% |
| TAILGATING | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 99.44% | 99.54% |
| 1 | 0.56% | 0.46% |
| CROSS_MEDIAN | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 99.45% | 98.72% |
| 1 | 0.55% | 1.28% |
| LANE_DEPARTURE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 97.44% | 94.14% |
| 1 | 2.56% | 5.86% |
| AGGRESSIVE_DRIVING | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 78.05% | 78.74% |
| 1 | 21.95% | 21.26% |
| NHTSA_AGG_DRIVING | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 98.90% | 98.05% |
| 1 | 1.10% | 1.95% |
| CELL_PHONE | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 99.61% | 99.54% |
| DISTRACTED | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 94.71% | 93.58% |
| 1 | 5.29% | 6.42% |
| DRINKING_DRIVER | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 98.66% | 95.07% |
| 1 | 1.34% | 4.93% |
| DRUGGED_DRIVER | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 99.52% | 97.64% |
| 1 | 0.48% | 2.36% |
| FATIGUE_ASLEEP | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 99.76% | 99.64% |
| IMPAIRED_DRIVER | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 98.36% | 93.58% |
| 1 | 1.64% | 6.42% |
| MATURE_DRIVER | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 89.10% | 87.01% |
| 1 | 10.90% | 12.99% |
| YOUNG_DRIVER | all cyclists | cyclists with serious injury or fatality |
|---|---|---|
| 0 | 93.45% | 91.58% |
| 1 | 6.55% | 8.42% |
| aspect of crash | observations based on all cyclists | any changes observed when restricting to those with serious injury or fatality |
|---|---|---|
| cyclist age | 10-20 is the most common range, followed by 20-30. | Percentage of older cyclists becomes more prevalent - over 50, dramatically so |
| cyclist helmet use | The majority were recorded as not wearing a helmet, and over 2000 cyclists had 'unknown' helmet status. | The prevalence of helmets decreases slightly. |
| cyclist position, movement, road grade |
|
|
| collision time and day |
|
|
| collision type and cyclist impact point |
|
|
| urban, rural, or urbanized setting | The vast majority of cyclists were in a crash in an urban setting - consistent with my expectation that cycling happens in urban settings. | The percentage in a rural setting more than doubles, and percentage in urban setting decreases - crashes in rural settings are more dangerous for cyclists. |
| speed limit where collision occured | The vast majority of cyclists were traveling in a 25mph zone, a very common speed limit in urban settings. | The percentage traveling in higher speed limit zones increases. |
| illumination and weather-related conditions |
|
The percentage traveling in daylight decreases, percentage traveling in dark unlit conditions triples, and percentage traveling at dawn doubles. |
| crash location |
|
|
| vehicle flags |
|
The percentages for buses, small trucks, and SUVs increase significantly. The percentage for heavy trucks triples, and the percentage for commmercial vehicles doubles. |
| driver condition or behavior flags |
|
|
Approximately 7.4% of cyclists in the dataset suffered serious injury or fatality. In the previous section, I identified some particular factors which seem to especially strongly affect the percentage of cyclists suffering serious injury or fatality:
cyclists['OVER_50'] = (cyclists.AGE>=50).astype('int')
filters = {'cyclist over 50':cyclists.OVER_50==1,
'striking and struck':cyclists.VEH_ROLE=='striking_struck',
'rear-end':cyclists.COLLISION_TYPE=='rear_end',
'rear impact':cyclists.IMPACT_SIDE=='rear',
'curved road':cyclists.RDWY_ALIGNMENT=='curve',
'hill':~cyclists.GRADE.isin(['level','unknown']),
'midblock':cyclists.INTERSECT_TYPE=='midblock',
'rural':cyclists.URBAN_RURAL=='rural',
'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
'dawn':cyclists.ILLUMINATION=='dawn',
'Friday':cyclists.DAY_OF_WEEK==6,
'speeding related':cyclists.SPEEDING_RELATED==1,
'heavy truck':cyclists.HEAVY_TRUCK==1,
'small truck':cyclists.SMALL_TRUCK==1,
'SUV':cyclists.SUV==1,
'commercial vehicle':cyclists.COMM_VEHICLE==1,
'drugged driver':cyclists.DRUGGED_DRIVER==1,
'drinking driver':cyclists.DRINKING_DRIVER==1,
'distracted driver':cyclists.DISTRACTED==1,
'NHTSA agg driver':cyclists.NHTSA_AGG_DRIVING==1,
'driver under 20':cyclists.YOUNG_DRIVER==1,
'driver over 65':cyclists.MATURE_DRIVER==1}
percents = [cyclists[filters[filter]].SERIOUS_OR_FATALITY.sum(axis=0)/cyclists[filters[filter]].shape[0] for filter in filters]
percents = pd.DataFrame({'crash factor':filters.keys(),
'serious injury or fatality percentage':percents}).set_index('crash factor')\
.sort_values(by='serious injury or fatality percentage',ascending=False).transpose()
format_dict={col:'{:.2%}' for col in percents.columns}
percents.style.format(format_dict).background_gradient(axis=None,cmap='bwr',gmap=percents,vmin=.074-0.5,vmax=.074+0.5)
| crash factor | drugged driver | drinking driver | heavy truck | speeding related | dark unlit | dawn | commercial vehicle | curved road | rural | striking and struck | rear-end | NHTSA agg driver | hill | rear impact | cyclist over 50 | small truck | midblock | driver under 20 | distracted driver | SUV | driver over 65 | Friday |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| serious injury or fatality percentage | 35.38% | 26.74% | 22.36% | 21.29% | 20.91% | 19.50% | 17.49% | 16.34% | 15.86% | 14.52% | 13.64% | 12.84% | 12.43% | 12.20% | 11.73% | 10.56% | 9.61% | 9.31% | 8.78% | 8.68% | 8.63% | 8.21% |
Note that all of these factors, increase the probability that a cyclist will suffer a servere injury or fatality - with some having a very dramatic effect. When pairing these factors, I expect the effect to compound - e.g. I expect that a collision that is speeding-related involving a drinking driver will lead to higher chance of serious injury or death than either factor alone provides.
The following table gives, for each pairing of factors from the above list, the percentage of cyclists suffering serious injury or fatality having been involved in a crash with BOTH of those factors. The diagonal entries are just the percentages from the above table, i.e. the percentages corresponding to single factors. Percentages are omitted for pairs for which there are fewer than 27 samples (i.e. correspond to less than 0.1% of the entire dataset). Lowering this threshold will reveal more percentages, but observations I glean from very small sample sets are probably not reasonable conclusions about the data distribution from which the dataset is taken.
In the following table:
from lib.vis_data import crosstab_percent, stylize_dataframe
_, percents = crosstab_percent(filters,cyclists)
stylize_dataframe(percents)
| cyclist over 50 | striking and struck | rear-end | rear impact | curved road | hill | midblock | rural | dark unlit | dawn | Friday | speeding related | heavy truck | small truck | SUV | commercial vehicle | drugged driver | drinking driver | distracted driver | NHTSA agg driver | driver under 20 | driver over 65 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cyclist over 50 | 11.73% | 26.67% | 20.22% | 23.48% | 24.54% | 21.7% | 15.69% | 19.7% | 23.85% | 42.11% | 13.03% | 37.31% | 27.45% | 16.2% | 13.29% | 17.39% | 28.3% | 15.53% | 19.44% | 13.0% | 14.78% | |
| striking and struck | 26.67% | 14.52% | 28.12% | 25.0% | 23.29% | 14.13% | 27.87% | 25.42% | 13.64% | 15.52% | 17.24% | 18.42% | ||||||||||
| rear-end | 20.22% | 28.12% | 13.64% | 18.47% | 31.75% | 24.07% | 17.46% | 27.46% | 37.34% | 16.91% | 27.47% | 17.2% | 19.43% | 23.81% | 46.67% | 38.57% | 18.39% | 17.17% | 22.02% | |||
| rear impact | 23.48% | 25.0% | 18.47% | 12.2% | 21.9% | 20.67% | 19.16% | 23.62% | 31.51% | 13.3% | 30.16% | 30.0% | 18.14% | 16.46% | 20.93% | 56.41% | 45.78% | 21.14% | 28.57% | 15.79% | 16.0% | |
| curved road | 24.54% | 31.75% | 21.9% | 16.34% | 22.87% | 20.3% | 22.34% | 42.5% | 18.0% | 44.44% | 22.12% | 21.67% | 42.86% | 19.05% | 29.73% | 13.68% | ||||||
| hill | 21.7% | 23.29% | 24.07% | 20.67% | 22.87% | 12.43% | 16.13% | 21.95% | 26.52% | 25.93% | 15.18% | 31.96% | 33.33% | 18.44% | 14.05% | 30.43% | 70.37% | 40.28% | 15.16% | 19.67% | 14.96% | 12.75% |
| midblock | 15.69% | 14.13% | 17.46% | 19.16% | 20.3% | 16.13% | 9.61% | 18.01% | 27.39% | 48.98% | 10.68% | 27.62% | 24.46% | 12.94% | 11.52% | 19.61% | 50.77% | 33.7% | 13.21% | 16.67% | 12.2% | 11.84% |
| rural | 19.7% | 27.87% | 27.46% | 23.62% | 22.34% | 21.95% | 18.01% | 15.86% | 27.6% | 33.33% | 19.44% | 39.85% | 28.42% | 17.99% | 16.95% | 25.45% | 60.0% | 31.52% | 17.0% | 23.53% | 20.27% | 15.18% |
| dark unlit | 23.85% | 37.34% | 31.51% | 42.5% | 26.52% | 27.39% | 27.6% | 20.91% | 21.14% | 50.0% | 21.51% | 23.48% | 35.85% | 16.07% | 24.36% | 22.89% | ||||||
| dawn | 42.11% | 25.93% | 48.98% | 33.33% | 19.5% | 20.0% | 16.67% | |||||||||||||||
| Friday | 13.03% | 25.42% | 16.91% | 13.3% | 18.0% | 15.18% | 10.68% | 19.44% | 21.14% | 20.0% | 8.21% | 21.43% | 17.74% | 12.17% | 10.55% | 20.37% | 30.56% | 11.54% | 8.89% | 9.89% | 10.23% | |
| speeding related | 37.31% | 27.47% | 30.16% | 44.44% | 31.96% | 27.62% | 39.85% | 50.0% | 21.43% | 21.29% | 36.36% | 28.3% | 51.43% | 27.5% | 25.49% | 25.93% | ||||||
| heavy truck | 27.45% | 30.0% | 33.33% | 24.46% | 28.42% | 17.74% | 22.36% | 25.63% | ||||||||||||||
| small truck | 16.2% | 13.64% | 17.2% | 18.14% | 22.12% | 18.44% | 12.94% | 17.99% | 21.51% | 12.17% | 36.36% | 10.56% | 31.75% | 12.24% | 22.5% | 17.89% | 12.5% | |||||
| SUV | 13.29% | 15.52% | 19.43% | 16.46% | 21.67% | 14.05% | 11.52% | 16.95% | 23.48% | 16.67% | 10.55% | 28.3% | 8.68% | 42.59% | 10.28% | 7.55% | 13.66% | 10.95% | ||||
| commercial vehicle | 17.39% | 23.81% | 20.93% | 30.43% | 19.61% | 25.45% | 20.37% | 25.63% | 17.49% | 27.27% | ||||||||||||
| drugged driver | 46.67% | 56.41% | 70.37% | 50.77% | 60.0% | 35.38% | 36.17% | |||||||||||||||
| drinking driver | 28.3% | 38.57% | 45.78% | 42.86% | 40.28% | 33.7% | 31.52% | 35.85% | 30.56% | 51.43% | 31.75% | 42.59% | 36.17% | 26.74% | 24.14% | |||||||
| distracted driver | 15.53% | 17.24% | 18.39% | 21.14% | 19.05% | 15.16% | 13.21% | 17.0% | 16.07% | 11.54% | 12.24% | 10.28% | 27.27% | 8.78% | 11.61% | 10.56% | ||||||
| NHTSA agg driver | 19.44% | 28.57% | 19.67% | 16.67% | 23.53% | 8.89% | 27.5% | 22.5% | 7.55% | 12.84% | 18.52% | 17.5% | ||||||||||
| driver under 20 | 13.0% | 17.17% | 15.79% | 29.73% | 14.96% | 12.2% | 20.27% | 24.36% | 9.89% | 25.49% | 17.89% | 13.66% | 11.61% | 18.52% | 9.31% | |||||||
| driver over 65 | 14.78% | 18.42% | 22.02% | 16.0% | 13.68% | 12.75% | 11.84% | 15.18% | 22.89% | 10.23% | 25.93% | 12.5% | 10.95% | 24.14% | 10.56% | 17.5% | 8.63% |
Many of these pairs have small sample sizes. I focus on a subset of this list:
few_filters = {'cyclist over 50':cyclists.OVER_50==1,
'rear-end':cyclists.COLLISION_TYPE=='rear_end',
'curved road':cyclists.RDWY_ALIGNMENT=='curve',
'hill':~cyclists.GRADE.isin(['level','unknown']),
'midblock':cyclists.INTERSECT_TYPE=='midblock',
'rural':cyclists.URBAN_RURAL=='rural',
'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
'speeding related':cyclists.SPEEDING_RELATED==1,
'SUV':cyclists.SUV==1,
'drinking driver':cyclists.DRINKING_DRIVER==1,
}
# few_filters = {'
# 'drinking driver':cyclists.DRINKING_DRIVER==1,
# 'speeding related':cyclists.SPEEDING_RELATED==1,
# 'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
# 'rural':cyclists.URBAN_RURAL=='rural',
# 'curved road':cyclists.RDWY_ALIGNMENT=='curve',
# 'midblock':cyclists.MIDBLOCK==1,
# 'SUV':cyclists.SUV==1}
_,percents = crosstab_percent(few_filters, cyclists)
stylize_dataframe(percents)
| cyclist over 50 | rear-end | curved road | hill | midblock | rural | dark unlit | speeding related | SUV | drinking driver | |
|---|---|---|---|---|---|---|---|---|---|---|
| cyclist over 50 | 11.73% | 20.22% | 24.54% | 21.7% | 15.69% | 19.7% | 23.85% | 37.31% | 13.29% | 28.3% |
| rear-end | 20.22% | 13.64% | 31.75% | 24.07% | 17.46% | 27.46% | 37.34% | 27.47% | 19.43% | 38.57% |
| curved road | 24.54% | 31.75% | 16.34% | 22.87% | 20.3% | 22.34% | 42.5% | 44.44% | 21.67% | 42.86% |
| hill | 21.7% | 24.07% | 22.87% | 12.43% | 16.13% | 21.95% | 26.52% | 31.96% | 14.05% | 40.28% |
| midblock | 15.69% | 17.46% | 20.3% | 16.13% | 9.61% | 18.01% | 27.39% | 27.62% | 11.52% | 33.7% |
| rural | 19.7% | 27.46% | 22.34% | 21.95% | 18.01% | 15.86% | 27.6% | 39.85% | 16.95% | 31.52% |
| dark unlit | 23.85% | 37.34% | 42.5% | 26.52% | 27.39% | 27.6% | 20.91% | 50.0% | 23.48% | 35.85% |
| speeding related | 37.31% | 27.47% | 44.44% | 31.96% | 27.62% | 39.85% | 50.0% | 21.29% | 28.3% | 51.43% |
| SUV | 13.29% | 19.43% | 21.67% | 14.05% | 11.52% | 16.95% | 23.48% | 28.3% | 8.68% | 42.59% |
| drinking driver | 28.3% | 38.57% | 42.86% | 40.28% | 33.7% | 31.52% | 35.85% | 51.43% | 42.59% | 26.74% |
Notice that for this smaller sublist of crash factors, almost all pairs provide a higher probability of cyclist suffering serious injury or fatality than either individual factor in the pair - and many provide a much higher probability. For some factors, the effect is very dramatic - e.g. drinking driver, speeding related, SUV, and dark unlit compound one another quite a bit.
# few_filters = {
# 'rear-end':cyclists.COLLISION_TYPE=='rear_end',
# 'curved road':cyclists.RDWY_ALIGNMENT=='curve',
# 'rural':cyclists.URBAN_RURAL=='rural',
# 'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
# 'speeding related':cyclists.SPEEDING_RELATED==1,
# 'SUV':cyclists.SUV==1,
# 'drinking driver':cyclists.DRINKING_DRIVER==1,
# }
few_filters = {
'curved road':cyclists.RDWY_ALIGNMENT=='curve',
'dark unlit':cyclists.ILLUMINATION=='dark_unlit',
'speeding related':cyclists.SPEEDING_RELATED==1,
'drinking driver':cyclists.DRINKING_DRIVER==1,
}
_,percents = crosstab_percent(few_filters,cyclists)
stylize_dataframe(percents)
| curved road | dark unlit | speeding related | drinking driver | |
|---|---|---|---|---|
| curved road | 16.34% | 42.5% | 44.44% | 42.86% |
| dark unlit | 42.5% | 20.91% | 50.0% | 35.85% |
| speeding related | 44.44% | 50.0% | 21.29% | 51.43% |
| drinking driver | 42.86% | 35.85% | 51.43% | 26.74% |