Which factors most dramatically impact a cyclist's risk of suffering serious injury or death in a crash?
Can we predict the severity of a cyclist's outcome based on these factors?
To address these questions I've build BikeSaferPA, a machine learning model which I designed to predict whether a cyclist in a crash in Pennsylvania will suffer serious injury or fatality as a result. I focused on a publically accessible dataset of crash records in the state during the period of 2002-2021, made available by Pennsylvania Department of Transportation (PENNDOT). This series of Jupyter notebooks will descibe this process.
I focused on cyclists involved in crash incidents appearing in this dataset. The goals of this project are:
PENNDOT publishes a publically accessible dataset containing records regarding all crashes in the state involving motor vehicles, cyclists, pedestrians, etc. The Pennsylvania dataset is somewhat unique among crash datasets in that it includes significantly more data features about bicycles and cyclists involved in these accidents than many other analogous datasets. Datasets for individual years can be downloaded here.
The PENNDOT datasets contain several different .csv files for each year on record. Here is the full list, with the ones I'll use in bold:
Regarding identifiers:
Here are links to two helpful documents provided by PENNDOT:
Note: the data dictionary defines "serious injury" and "fatal injury" as:
In this first notebook, I will do the following:
First I import the necessary packages:
import numpy as np
import pandas as pd
import openpyxl
import requests
import json
import glob
import zipfile
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth',None)
import sys
np.set_printoptions(threshold=sys.maxsize)
The following block unzips a collection of ZIP files downloaded from this PENNDOT page.
These files are not included in the GitHub repository, due to their file sizes. If you want to try out this process, you'll need to go download them from the PENNDOT page yourself. Your ZIP files should be in the directory './data/zip/' and the extracted CSV filed will end up in './data/raw_csv/'
# zip_files = glob.glob('data/zip/*.zip')
# for file in zip_files:
# print(f'Unzipping {file}...')
# with zipfile.ZipFile(file, 'r') as zip_ref:
# zip_ref.extractall('data/raw_csv/')
The get_data function will produce for a given year the following dataframes:
Unless you want to download the PENNDOT files yourself and try this out, you can also skip the next block and load in my prepared raw CSV files.
# from lib.get_data import extract_data
# bicycle_list, person_list, crash_list, roadway_list= [],[],[],[]
# # Call dataframe collection function for each year
# YEARS = range(2002,2022)
# for year in YEARS:
# bicycles, persons,crashes,roadway= extract_data(year)
# bicycle_list.append(bicycles)
# person_list.append(persons)
# crash_list.append(crashes)
# roadway_list.append(roadway)
# # # Concatenate yearly dataframes and reindex
# bicycles = pd.concat(bicycle_list,axis=0).reset_index(drop=True)
# persons = pd.concat(person_list,axis=0).reset_index(drop=True)
# crashes = pd.concat(crash_list,axis=0).reset_index(drop=True)
# roadway = pd.concat(roadway_list,axis=0).reset_index(drop=True)
# del bicycle_list, person_list, crash_list
# # # Write to CSV files
# bicycles.to_csv('./data/raw_csv/bicycles_raw.csv',index=False)
# persons.to_csv('./data/raw_csv/persons_raw.csv',index=False)
# crashes.to_csv('./data/raw_csv/crashes_raw.csv',index=False)
# roadway.to_csv('./data/raw_csv/roadway_raw.csv',index=False)
bicycles = pd.read_csv('./data/raw_csv/bicycles_raw.csv')
persons = pd.read_csv('./data/raw_csv/persons_raw.csv')
crashes = pd.read_csv('./data/raw_csv/crashes_raw.csv')
roadway = pd.read_csv('./data/raw_csv/roadway_raw.csv')
A few tasks are needed regarding the dataframes 'bicycles', 'crashes', and 'persons':
for feat in ['RDWY_ALIGNMENT','VEH_MOVEMENT','VEH_TYPE','VEH_POSITION','VEH_ROLE']:
bicycles[feat] = pd.to_numeric(bicycles[feat],errors='coerce')
for feat in ['PC_HDLGHT_IND','PC_HLMT_IND','PC_REAR_RFLTR_IND']:
bicycles[feat]=bicycles[feat].replace({'Y':1,'N':0,'U':np.nan})
for feat in ['UNIT_NUM','IMPACT_POINT']:
bicycles[feat]=bicycles[feat].replace({'U':np.nan,99:np.nan})
bicycles['GRADE'] = bicycles['GRADE'].replace({1:'level',2:'uphill',3:'downhill',
4:'bottom_hill',5:'top_hill',9:np.nan})
bicycles['IMPACT_SIDE'] = bicycles['IMPACT_POINT'].replace({0:'non_collision',13:'top',14:'undercarriage',
11:'front',12:'front',1:'front',
2:'front_right',3:'right',4:'rear_right',
5:'rear',6:'rear',7:'rear',
8:'rear_left',9:'left',10:'front_left',99:np.nan})
bicycles['RDWY_ALIGNMENT'] = bicycles['RDWY_ALIGNMENT'].replace({1:'straight',2:'curve',
3:'curve',4:'curve',9:np.nan})
bicycles['VEH_ROLE'] = bicycles['VEH_ROLE'].replace({0:'non_collision',1:'striking',2:'struck',3:'striking_struck'})
bicycles['VEH_TYPE'] = bicycles['VEH_TYPE'].replace({20:'bicycle',21:'other_pedalcycle'})
bicycles['VEH_POSITION'] = bicycles['VEH_POSITION'].replace({0:np.nan,1:'right_lane_curb',2:'right_lane',
3:'left_lane',4:'left_turn_lane',
5:'center_turn_lane',6:'other_forward_lane',
7:'oncoming_lane',8:'left_of_trafficway',
9:'right_of_trafficway',10:'HOV_lane',
11:'shoulder_right',12:'shoulder_left',
13:'one_lane_road',14:'acc_dec_lane',
98:'other',99:np.nan})
bicycles['VEH_MOVEMENT'] = bicycles['VEH_MOVEMENT'].replace({1:'straight',2:'slowing_or_stopping_in_lane',
3:'stopped_in_lane',4:'passing_vehicle',
5:'entering_lane',6:'parked',
7:'leaving_lane',8:'avoiding',
9:'turning_right_red',10:'turning_right',
11:'turning_left_red',12:'turning_left',
13:'u_turn',14:'backing',15:'changing_merging',
16:'curve_right',17:'curve_left',
18:'entering_lane',19:'leaving_lane',
98:'other',99:np.nan})
# One sample has missing UNIT_NUM, and - set it to first avail unit number
bicycles['UNIT_NUM'] = bicycles.UNIT_NUM.fillna(1)
# Two samples have missing VEH_ROLE, but they have IMPACT_POINT=='non_collision' - use that for VEH_ROLE
bicycles['VEH_ROLE'] = bicycles.VEH_ROLE.fillna('non_collision')
persons['INJ_SEVERITY'] = pd.to_numeric(persons['INJ_SEVERITY'],errors='coerce')
persons['AGE']=persons['AGE'].replace({99:np.nan})
for feat in ['SEX','TRANSPORTED','UNIT_NUM']:
persons[feat]=persons[feat].replace({'Y':1,'N':0,'U':np.nan,'0':0,' ':np.nan,'R':np.nan,'9':np.nan,97:np.nan})
persons['INJ_SEVERITY']=persons['INJ_SEVERITY'].replace({0:'no_injury',1:'killed',2:'susp_serious_injury',
3:'susp_minor_injury',4:'possible_injury',
8:'unknown_injury',9:np.nan})
persons['PERSON_TYPE']=persons['PERSON_TYPE'].replace({1:'driver',2:'passenger',7:'pedestrian',
8:'other',9:np.nan})
persons['RESTRAINT_HELMET']=persons['RESTRAINT_HELMET'].replace({0:'no_restraint',1:'shoulder_belt',2:'lap_belt',
3:'lap_shoulder_belt',4:'child_seat',
5:'motorcycle_helmet',6:'bicycle_helmet',
10:'belt_improper',11:'child_seat_improper',
12:'helmet_improper',21:'child_seat',22:'child_seat',
23:'booster_seat',24:'child_restraint_unknown',
90:'unknown_restraint',98:'other',99:np.nan})
# Two persons have CRN==2012109815 and UNIT_NUM==1. Since the bicycle has UNIT_NUM==1, change UNIT_NUM to 2 for person who appears to be motorist
persons.at[35557,'UNIT_NUM']=2
for feat in ['WEATHER1','WEATHER2','TIME_OF_DAY']:
crashes[feat] = pd.to_numeric(crashes[feat],errors='coerce')
for feat in ['DISPATCH_TM','ARRIVAL_TM','TIME_OF_DAY']:
crashes[feat] = crashes[feat].replace({9999:np.nan})
crashes['HOUR_OF_DAY'] = crashes['HOUR_OF_DAY'].replace({99:np.nan,24:0})
crashes['COLLISION_TYPE'] = crashes['COLLISION_TYPE'].replace({0:'non_collision',1:'rear_end',
2:'head_on',3:'backing',4:'angle',
5:'sideswipe_same_dir',6:'sideswipe_opp_dir',
7:'hit_fixed_obj',8:'hit_ped',
9:'other',98:'other',99:np.nan})
crashes['ILLUMINATION'] = crashes['ILLUMINATION'].replace({1:'daylight',2:'dark_unlit',3:'dark_lit',4:'dusk',
5:'dawn',6:'dark_lit',8:'other',9:np.nan})
crashes['ROAD_CONDITION'] = crashes['ROAD_CONDITION'].replace({1:'dry',2:'ice_frost',3:'mud_dirt_gravel',
4:'oil',5:'sand',6:'slush',7:'snow',
8:'water',9:'wet',98:'other',0:np.nan,99:np.nan})
crashes['URBAN_RURAL'] = crashes['URBAN_RURAL'].replace({1:'rural',2:'urbanized',3:'urban'})
crashes['INTERSECT_TYPE'] = crashes['INTERSECT_TYPE'].replace({0:'midblock',1:'four_way',2:'T',3:'Y',
4:'circle',5:'multi_leg',6:'ramp_end',
7:'ramp_begin',8:'crossover',9:'rr_crossing',
10:'other',11:'L',12:'circle',13:'circle',99:np.nan})
crashes['LOCATION_TYPE'] = crashes['LOCATION_TYPE'].replace({0:'not_applicable', 99:np.nan,1:'underpass',
2:'ramp',3:'bridge',4:'tunnel',
5:'toll_booth',6:'cross_over_related',
7:'driveway_parking_lot', 8:'ramp_bridge'})
crashes['RELATION_TO_ROAD'] = crashes['RELATION_TO_ROAD'].replace({1:'on_roadway',2:'shoulder', 3:'median',
4:'roadside',5:'outside_trafficway',
6:'parking_lane',7:'int_ramp_hwy',9:np.nan})
crashes['TCD_TYPE'] = crashes['TCD_TYPE'].replace({0:'not_applicable',1:'flashing_traffic_signal',2:'traffic_signal',
3:'stop_sign',4:'yield_sign',5:'active_RR_controls',6:'passive_RR_controls',
7:'officer_or_flagman',8:'other',9:np.nan})
crashes['TCD_FUNC_CD'] = crashes['TCD_FUNC_CD'].replace({0:'no_controls',1:'not_functioning',2:'functioning_improperly',
3:'functioning_properly',4:'emergency_preemptive_signal',9:np.nan})
for feat in ['WEATHER1','WEATHER2']:
crashes[feat] = pd.to_numeric(crashes[feat],errors='coerce')
# Replace counts features for buses, trucks, SUVs with binary features indicating presence
for feat in ['BUS','HEAVY_TRUCK','SMALL_TRUCK','SUV','VAN']:
crashes[feat] = crashes[feat+'_COUNT'].apply(lambda x: 1 if x>0 else x)
# crashes = crashes.drop(feat+'_COUNT',axis=1)
# Recode 99 as np.nan
for feat in ['WEATHER1','WEATHER2']:
crashes[feat]=crashes[feat].replace({99:np.nan})
# fill empty WEATHER1 with WEATHER2 when possible
crashes['WEATHER1'] = crashes.WEATHER1.fillna(crashes.WEATHER2)
# if WEATHER1==WEATHER2, set WEATHER2 to be np.nan
crashes.loc[crashes.WEATHER1==crashes.WEATHER2,'WEATHER2']=np.nan
# If WEATHER1==3 (clear) and WEATHER2 is not np.nan, replace WEATHER1 with WEATHER2
crashes['WEATHER1'] = crashes['WEATHER1'].where((crashes.WEATHER1!=3)|(crashes.WEATHER2.isna()),crashes.WEATHER2)
# Rename and delete WEATHER2
crashes = crashes.rename(columns={'WEATHER1':'WEATHER'})
crashes = crashes.drop(columns='WEATHER2')
crashes['WEATHER'] = crashes['WEATHER'].replace({1:'blowing_sand_soil_dirt',2:'blowing_snow', 3:'clear',
4:'cloudy',5:'fog_smog_smoke',6:'freezing_rain',
7:'rain',8:'severe_crosswind',9:'sleet_hail',
10:'snow',98:'other'})
# Adjust incorrect DEC_LAT, DEC_LONG
# This sample missing fractional part
crashes.at[24770,'DEC_LAT']=np.nan
crashes.at[24770,'DEC_LONG']=np.nan
# These samples locations don't match minicipalities - replace with approx location based on roadway data
crashes.at[3087,'DEC_LAT']=40.0081
crashes.at[3087,'DEC_LONG']=-75.1923
crashes.at[24805,'DEC_LAT']=40.3322
crashes.at[24805,'DEC_LONG']=-75.9278
# Fill NaN lat,lon values with mean by municipality when possible, then fill remaining with mean by county when possible.
for coord in ['DEC_LAT','DEC_LONG']:
for area in ['MUNICIPALITY','COUNTY']:
crashes[coord]=crashes.groupby(area)[coord].transform(lambda x:x.fillna(x.mean()))
# Fill some missing HOUR_OF_DAY using TIME_OF_DAY, or DISPATCH_TM, or ARRIVAL_TM when possible
crashes['HOUR_OF_DAY'] = crashes.HOUR_OF_DAY.fillna(crashes.TIME_OF_DAY.floordiv(100))
crashes['HOUR_OF_DAY'] = crashes.HOUR_OF_DAY.fillna(crashes.DISPATCH_TM.floordiv(100))
crashes['HOUR_OF_DAY'] = crashes.HOUR_OF_DAY.fillna(crashes.ARRIVAL_TM.floordiv(100))
# Fill all five missing ILLUMINATION 'daylight'
crashes['ILLUMINATION'] = crashes.ILLUMINATION.fillna('daylight')
# Don't need ARRIVAL_TM, DISPATCH_TM anymore
crashes = crashes.drop(columns=['DISPATCH_TM','ARRIVAL_TM'])
# Fill missing SPEED_LIMIT with mode by county.
roadway['SPEED_LIMIT'] = roadway.groupby('RDWY_COUNTY',sort=False)['SPEED_LIMIT'].apply(lambda x: x.fillna(x.mode()[0])).astype('int')
# When there are several roadways with the same CRN, don't know which one the cyclist was on.
# Use minimum value among all roadway samples with matching CRN
crashes = crashes.merge(pd.DataFrame(roadway.groupby('CRN').min().SPEED_LIMIT),how='left',on='CRN')
del roadway
It will be useful later to have a dataframe containing only cyclists as samples, and which also have vehicle and crash features. I accomplish this by merging some dataframes.
# Merge bicycle vehicle data onto persons and crashes
cols = ['CRN', 'UNIT_NUM','INJ_SEVERITY','PERSON_TYPE',
'AGE','SEX','RESTRAINT_HELMET']
cyclists = persons[cols].merge(bicycles,on=['CRN','UNIT_NUM'],how='left')
# Isolate cyclists by restricting to samples which inherit VEH_TYPE from bicycles
cyclists = cyclists[cyclists.VEH_TYPE.notnull()]
# Merge crash data onto cyclists
cols = ['CRN', 'COUNTY', 'MUNICIPALITY', 'BUS_COUNT','COMM_VEH_COUNT',
'HEAVY_TRUCK_COUNT', 'SMALL_TRUCK_COUNT', 'SUV_COUNT', 'VAN_COUNT',
'CRASH_MONTH', 'CRASH_YEAR', 'DAY_OF_WEEK', 'HOUR_OF_DAY',
'COLLISION_TYPE', 'ILLUMINATION', 'INTERSECT_TYPE', 'LOCATION_TYPE',
'RELATION_TO_ROAD', 'ROAD_CONDITION', 'TCD_TYPE', 'TCD_FUNC_CD',
'URBAN_RURAL', 'WEATHER', 'AGGRESSIVE_DRIVING', 'ANGLE_CRASH',
'CELL_PHONE', 'COMM_VEHICLE', 'CROSS_MEDIAN', 'CURVED_ROAD',
'CURVE_DVR_ERROR','DRUG_RELATED','ALCOHOL_RELATED',
'DISTRACTED', 'DRINKING_DRIVER', 'DRUGGED_DRIVER', 'FATIGUE_ASLEEP','HO_OPPDIR_SDSWP',
'ICY_ROAD', 'ILLUMINATION_DARK', 'IMPAIRED_DRIVER', 'INTERSECTION',
'LANE_DEPARTURE', 'NHTSA_AGG_DRIVING','NO_CLEARANCE','NON_INTERSECTION','REAR_END',
'RUNNING_RED_LT', 'RUNNING_STOP_SIGN', 'RURAL', 'SNOW_SLUSH_ROAD',
'SPEEDING', 'SPEEDING_RELATED', 'SUDDEN_DEER', 'TAILGATING', 'URBAN',
'WET_ROAD', 'WORK_ZONE', 'MATURE_DRIVER', 'YOUNG_DRIVER', 'BUS',
'HEAVY_TRUCK', 'SMALL_TRUCK', 'SUV', 'VAN', 'SPEED_LIMIT']
cyclists = cyclists.merge(crashes[cols],on=['CRN'],how='left').drop('VEH_TYPE',axis=1)
del cols, persons
Many fields have missing data which I would like to fill using medians or modes (either groupwise or global). I will later design the BikeSafePA model to predict whether a cyclist suffers serious injury or fatality, so I will wait to implement this median/mode imputation as part of a pipeline to avoid data leakage from test set to training set. However, here's the plan:
# When ROAD_CONDITION is dry, it makes sense to fill WEATHER to clear
cyclists.loc[(cyclists.ROAD_CONDITION=='dry')&(cyclists.WEATHER.isna()),'WEATHER'] = 'clear'
# When WEATHER is rain, is makes sense to fill ROAD_CONDITION to wet
cyclists.loc[(cyclists.WEATHER=='rain')&(cyclists.ROAD_CONDITION.isna()),'ROAD_CONDITION'] = 'wet'
# These have many NaN - create new 'unknown' category
for feature in ['RESTRAINT_HELMET', 'IMPACT_SIDE', 'GRADE', 'VEH_POSITION']:
cyclists[feature]=cyclists[feature].fillna('unknown')
# Create injury severity flag
cyclists['SERIOUS_OR_FATALITY'] = cyclists['INJ_SEVERITY']\
.apply(lambda x: 1 if x in ['killed','susp_serious_injury'] else 0)
# Also bin the age values
cyclists['AGE_BINS'] = pd.cut(cyclists.AGE,bins=range(0,101,10)).astype('str')
These will end up in the same directory as this notebook.
cyclists.to_csv('cyclists.csv',index=False)
crashes.to_csv('crashes.csv',index=False)