Click here to view this notebook as an HTML file.
In this notebook I will construct BikeSaferPA, a model designed to predict whether or not a cyclist suffered serious injury or fatality. I begin by importing necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
from IPython.display import display
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth',None)
import scipy.stats as ss
import shap
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import HistGradientBoostingClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
I'll read in the dataset and create some additional binary features that I'll use in the models.
cyclists = pd.read_csv('cyclists.csv')
# Define a binary feature HILL to replace GRADE
cyclists['HILL'] = cyclists.GRADE.isin(['downhill','uphill','bottom_hill','top_hill']).astype(int)
# I'll compute groupwise mode (with dropna = False) to impute missing SEX data in the pipeline.
# Mode can't compare 'F' or 'M' with np.nan which can cause issues when multiple modes exist
# Note that among floats, 0 < 1 < np.nan
cyclists['FEMALE'] = cyclists['SEX'].replace({'F':1,'M':0})
I'll start with the following model features:
Categorical features: these I'll encode as one-hot vectors in the pipeline. Here are their respective categories:
Numerical features: AGE, SPEED_LIMIT, CRASH_YEAR
Ordinal features: BUS_COUNT, COMM_VEH_COUNT, HEAVY_TRUCK_COUNT, SMALL_TRUCK_COUNT, SUV_COUNT, VAN_COUNT (note: these are almost binary - they take values 0, 1, 2 and very few samples have one of these features equal to 2)
Binary features: None of these have missing values except FEMALE. Some of these are self-explanatory, but for others I clarify here what they indicate.
Target feature: SERIOUS_OR_FATALITY - a binary feature. This is what I want to predict.
Throughout this notebook I'll use several different feature sets. I'll store them in a single dictionary. The first key:value pair will correspond to this full feature set.
# Feature subsets
feat_dict = {}
feat_dict['all'] = {'cyc':['DAY_OF_WEEK','HOUR_OF_DAY'],
'ord':['BUS_COUNT','HEAVY_TRUCK_COUNT','COMM_VEH_COUNT',
'SMALL_TRUCK_COUNT','SUV_COUNT','VAN_COUNT'],
'cat':['RESTRAINT_HELMET','VEH_ROLE',
'URBAN_RURAL','ILLUMINATION',
'ROAD_CONDITION','WEATHER',
'COLLISION_TYPE','IMPACT_SIDE',
'TCD_TYPE','TCD_FUNC_CD'],
'group':['MUNICIPALITY','COUNTY','CRASH_MONTH'],
'num':['AGE','SPEED_LIMIT','CRASH_YEAR'],
'bin':['FEMALE','HILL',
'NON_INTERSECTION','CURVED_ROAD',
'ALCOHOL_RELATED','CURVE_DVR_ERROR',
'DRINKING_DRIVER','DRUGGED_DRIVER','DRUG_RELATED',
'DISTRACTED','FATIGUE_ASLEEP',
'AGGRESSIVE_DRIVING','CELL_PHONE','LANE_DEPARTURE',
'NO_CLEARANCE','NHTSA_AGG_DRIVING','CROSS_MEDIAN',
'RUNNING_RED_LT','RUNNING_STOP_SIGN','TAILGATING',
'SPEEDING_RELATED',
'MATURE_DRIVER','YOUNG_DRIVER',
'SUDDEN_DEER','WORK_ZONE']
}
TARGET = 'SERIOUS_OR_FATALITY'
A few of the above features have missing values, which I will impute via the machine learning pipeline:
I separate features into several categories so that the pipeline function handles them properly. Note that the features in GROUP_FEATURES will be used in the missing data imputation process but won't be used as model features. Also, I'll do a train_test_split to take off a holdout set (X_test,y_test) which I won't touch again until evaluating the final models at the end of the study.
df = cyclists[[feat for feat_type in feat_dict['all'] for feat in feat_dict['all'][feat_type]]+[TARGET]]
X = df.drop(TARGET,axis=1)
y= df[TARGET]
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)
Periodic encoding can be important for features that are cyclic in nature, so that the model can learn the relationship between the highest and lowest values (e.g. hour 0 is close to hour 23, chronologically). I examine several options for transforming the periodic features DAY_OF_WEEK and HOUR_OF_DAY:
n_splines = 12 for HOUR_OF_DAY and n_splines = 7 for DAY_OF_WEEK. Note that this results in fewer features than one-hot-encoding, while allowing the model to be more expressive than sine and cosine encoding.I shall use a collection of features in the 'cyclists' dataset to predict the target variable SERIOUS_OR_FATALITY, a binary variable which indicates whether the cyclist suffered serious injury or fatality as a result of the crash. For the discussion that follows, I refer to those with SERIOUS_OR_FATALITY==1 as the positive class and those with SERIOUS_OR_FATALITY==0 as the negative class.
This is an imbalanced classification problem, since only about 7.2% of samples are in the positive class. For such tasks, accuracy score (or equivalently, error rate) is not a reasonable metric to use on its own. In particular, a classifier that predicts SERIOUS_OR_FATALITY==0 for every sample would have an expected accuracy score of 92.8% - which seems quite respectable at first glance - without detecting any of the serious injuries or fatalities. There are some options for metrics that give a more nuanced view of the classifier's performance:
The confusion matrix, a 2x2 matrix which reports the number of TP (true positives), FP (false positives), TN (true negatives) and FN (false negatives).
The precision score, or positive predictive value: $\displaystyle \frac{TP}{TP+FP}$ which is between $0$ and $1$, where $1$ is best ($FP = 0$)
The recall score, or true positive rate: $\displaystyle \frac{TP}{TP+FN}$ which is between $0$ and $1$, where $1$ is best ($FN = 0$)
The specificity, or true negative rate: $\displaystyle \frac{TN}{TN+FP}$. This can be thought of as a recall score for the negative class, and can be computed in scikit-learn by passing pos_label=0 into recall_score.
The negative predictive value: $\displaystyle \frac{TN}{TN+FN}$. This can be thought of as a precision score for the negative class, and can be computed in scikit-learn by passing pos_label=0 into precision_score.
$F_1$-score, which is the harmonic mean of the precision and recall scores: $$F_1 = \frac{2}{(\text{precision})^{-1}+(\text{recall})^{-1}} = \frac{2 \cdot \text{precision}\cdot \text{recall}}{\text{precision}+\text{recall}} = \frac{2\cdot TP}{2\cdot TP+FP+FN}$$ $F_1$ is between $0$ and $1$, where $1$ is best (both precision and recall are equal to 1) and $0$ is the worst (precision is $0$ or recall is $0$) The $F_1$ score has two significant limitations:
Note that one can actually compute a separate $F_1$ score for each class, by treating each class as the positive class. The sciki-learn "classification report" functionality reports a $F_1$ score for each class, and this is what is happening in that case.
More generally, the $F_{\beta}$-score, which allows one to weight recall and precision differently by adjusting a parameter $\beta \geq 0$. This may be desirable if high precision is more important than high recall or vice versa. $$F_{\beta} = (1+\beta^2) \cdot \frac{2 \cdot \text{precision}\cdot \text{recall}}{\beta^2 \cdot \text{precision}+\text{recall}} = \frac{(1+\beta^2)\cdot TP}{(1+\beta^2)\cdot TP+FP+FN}$$ Note that when $\beta=0$, $F_{\beta}$ equals the precision score, and as $\beta \rightarrow \infty$, $F_{\beta}$ approaches the recall score. When $\beta = 1$, we recover the usual $F_1$ score. We should view $\beta$ as the ratio of the perceived importance of recall to importance of precision.
Another modification of the $F_1$ score that helps address class imbalance is the weighted average $F_1$ score. One can compute a $F_1$ score for each class separately (varying which class is treated as the positive class), and then take the weighted average of these $F_1$ scores, weighted by the total number of samples in each class. Note that in this case, it is possible to produce a weighted average $F_1$ score that does not lie between the precision and recall scores.
The ROC-AUC score is the area under the ROC (receiver operating characteristic curve) curve. The ROC curve is the plot of the true positive rate (i.e. the recall $TP/(TP+FN)$) vs. the false positive rate (i.e. $\displaystyle FP/(FP+TN)$ of a model at various values of the prediction probability threshold from $0$ to $1$. The ROC-AUC score of a random classifier (i.e. a classifier which is expected to be correct $50 \%$ of the time) will tend towards $0.5$ as the number of samples increases. A perfectly correct classifier has ROC-AUC score equal to $1$, and a perfectly incorrect classifier has score equal to $0$. Intuitively, the ROC-AUC score can be interpreted as the probability that given a randomly chosen positive sample and negative sample, the classifier will rank their prediction probabilities correctly. Generically speaking, ROC-AUC scores between $0.7-0.8$ are acceptible, those between $0.8-0.9$ are excellent, and those above $0.9$ are outstanding. Note that:
I will initially use the ROC-AUC score in the model selection and hyperparameter tuning processes, and then assess $F_1$ and $F_{\beta}$ scores for particular decision threshold choices for the best models.
I have streamlined my model selection and evaluation process by creating a custom ClassifierStudy class which contains methods to:
build_pipeline)fit_pipeline)score_pipeline)cv_score)randomized_search)shap_values, shap_plot)As described previously, I've included several options for how to encode the periodic features HOUR_OF_DAY and DAY_OF_WEEK, some of which express their periodic nature (sine-and-cosine encoding, periodic spline encoding) and others which don't (one-hot encoding, ordinal encoding, and leaving them out entirely). Furthermore, I have included hybrid methods which encode HOUR_OF_DAY using either trig or spline encoding, encode DAY_OF_WEEK as a binary variable IS_WEEKEND, and then use polynomial combinations of these.
I'll use repeated stratified cross-validation (with five splits and three repeats) to evaluate these methods and choose a winner.
from lib.study_classif import ClassifierStudy
cyc_results = {}
for method in [None,'ord','onehot','trig','spline','interact-trig','interact-spline']:
clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',class_weight='balanced')
study = ClassifierStudy(classifier = clf, X = X_train,y = y_train,features = feat_dict['all'])
study.build_pipeline(cyc_method = method)
cyc_results[method] = study.cv_score(print_mean_score = False, return_mean_score = True)
pd.DataFrame(cyc_results,index=['mean of cv scores']).sort_values(axis=1,by='mean of cv scores',ascending=False)
The results are quite similar, with periodic spline encoding giving a slight edge over the other methods. Going forward, we'll use periodic spline encoding for the periodic features HOUR_OF_DAY and DAY_OF_WEEK, which is the default method in the cv_score method of ClassifierStudy.
One significant advantage of logistic regression is its interpretability. Recall that in logistic regression, the log of the odds ratio $ln\left( \frac{P(y=1)}{P(y=0)}\right)$ or 'log-odds' is modeled as a linear function of the input features. Therefore, the coefficient of a particular feature can be interpreted as the rate of change of the log-odds with respect to that feature, i.e. the change in log-odds induced by a 1-unit change in that feature.
I begin by fitting a couple of baseline logistic regression models so that I can examine their coefficients in the trained model. LRStudy is a child class of ClassifierStudy which is intended for pipelines endind in an instance of sklearn.linear_model.LogisticRegression. This child class has an additional method, plot_coeff, which fits a model and display a bar plot of coefficient values associated to the various model features.
from lib.study_classif import LRStudy
class_weight='balanced' - this can help linear models to perform better in the presence of unbalanced target classes. The inbalance is around 93%-7%, which is not particularly extreme. With a more extreme class imbalance, I would consider minority class oversampling methods (e.g. random oversampling, SMOTE, ADASYN, etc.).class_weight='balanced' often improves performance of a logistic regression model in the presence of class imbalance, it will also generally result in poor estimates of individual class probabilities. Ordinary logistic regression produces well-calibrated class probabilities, but this isn't necessarily the case if the class_weight='balanced' option is used because the loss function is modified.clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',class_weight='balanced')
study = LRStudy(clf,X_train,y_train,feat_dict['all'])
study.build_pipeline()
study.plot_coeff(title_add = 'L2-regularization only')
Score on validation set: 0.72338032152522
The horizontal line drawn at y=0 is intended to help distinguish between features which have positive and negative effects on the log-odds. There are no features with zero coefficients, but many have very small coefficients. They don't have a significant influence on the model and L1 regularization will likely push them to zero.
I start with an out-of-the-box L1-regularized logistic regression classifier.
clf = LogisticRegression(max_iter=1000,solver='saga',penalty='l1',class_weight='balanced')
study = LRStudy(clf,X_train,y_train,feat_dict['all'])
study.build_pipeline()
study.plot_coeff(title_add = 'L1-regularization only')
Score on validation set: 0.7230131258396489
I see some features indeed have zero coefficients, but there remain some nonzero coefficients that are quite small. If I decrease the hyperparameter C (the inverse of the regularization strength), I can push more towards zero. I just tried its default value of 1.0 - I'll try 0.1.
clf = LogisticRegression(max_iter=1000,solver='saga',penalty='l1',class_weight='balanced',C=0.1)
study = LRStudy(clf,X_train,y_train,feat_dict['all'])
study.build_pipeline()
study.plot_coeff(title_add = 'stronger L1-regularization')
Score on validation set: 0.7272700246189844
I've pushed lots more coefficients to zero, and also raised the validation score. In particular, the following have zero coefficients in the model:
In a model with L1-regularization at least this strong, it appears safe to remove the features in the first three bullet points above without performance loss. I will keep TAILGATING, RUNNING_STOP_SIGN, CURVE_DVR_ERROR, however - I want to see how they interacts with AGGRESSIVE_DRIVING. Additionally, I will create a few salient binary features from TCD_TYPE and then drop TCD_TYPE and TCD_FUNC_CD also.
I'll go ahead and make a list of feature type lists with the aforementioned features removed, and make sure the SAGA model tolerates dropping them.
for tcd_type in ['traffic_signal','flashing_traffic_signal','stop_sign']:
cyclists[f'TCD_{tcd_type}'] = (cyclists.TCD_TYPE==tcd_type).astype(int)
# cyclists['TCD_functioning'] = (cyclists.TCD_FUNC_CD=='functioning_properly').astype(int)
feat_dict['all']['bin'] += [f'TCD_{tcd_type}' for tcd_type in ['traffic_signal','flashing_traffic_signal','stop_sign']]
df = cyclists[[feat for feat_type in feat_dict['all'] for feat in feat_dict['all'][feat_type]]+[TARGET]]
X = df.drop(TARGET,axis=1)
y= df[TARGET]
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)
feat_dict['lr_small'] = {}
drop_feat = ['BUS_COUNT','WEATHER','ROAD_CONDITION',
'TCD_TYPE','TCD_FUNC_CD',
'DISTRACTED','DRUGGED_DRIVER','DRINKING_DRIVER',
'FATIGUE_ASLEEP', 'CELL_PHONE',
'SUDDEN_DEER', 'WORK_ZONE']
for feat_type in feat_dict['all']:
feat_dict['lr_small'][feat_type] = [feat for feat in feat_dict['all'][feat_type] if feat not in drop_feat]
for tcd_type in ['traffic_signal','flashing_traffic_signal','stop_sign']:
feat_dict['all']['bin'].remove(f'TCD_{tcd_type}')
clf = LogisticRegression(max_iter=1000,solver='saga',penalty='l1',class_weight='balanced',C=0.1)
study = LRStudy(clf,X_train,y_train,features=feat_dict['lr_small'])
study.build_pipeline(cyc_method=None)
study.plot_coeff('stronger L1-regularization and fewer features')
Score on validation set: 0.729097968130307
clf = LogisticRegression(max_iter=1000,solver='saga',penalty='l1',class_weight='balanced',C=0.1)
study = LRStudy(clf,X_train,y_train,features=feat_dict['lr_small'])
study.build_pipeline()
study.plot_coeff('stronger L1-regularization and fewer features')
Score on validation set: 0.7285829299805233
Indeed no loss in validation score! I'll take a closer look at the features with largest coefficients, as they are most impactful to the model. I'll list just those whose coefficient magnitude is greater than 0.25:
large_coeff = study.coeff[np.abs(study.coeff['coeff value'])>0.25].sort_values(by='coeff value',ascending=False)
large_coeff.style.background_gradient(axis=0,gmap=-large_coeff['coeff value'],cmap='RdBu')
| coeff value | |
|---|---|
| feature name | |
| SPEEDING_RELATED | 0.875896 |
| COMM_VEH_COUNT | 0.839971 |
| HEAVY_TRUCK_COUNT | 0.781875 |
| DRUG_RELATED | 0.764003 |
| HILL | 0.621782 |
| ALCOHOL_RELATED | 0.573973 |
| SMALL_TRUCK_COUNT | 0.453031 |
| RUNNING_RED_LT | 0.446169 |
| IMPACT_SIDE_front_left | 0.441555 |
| VEH_ROLE_striking_struck | 0.399137 |
| CURVED_ROAD | 0.393789 |
| DRINKING_DRIVER | 0.381857 |
| URBAN_RURAL_rural | 0.334569 |
| VAN_COUNT | 0.297721 |
| SUV_COUNT | 0.296763 |
| ILLUMINATION_dark_unlit | 0.281212 |
| ILLUMINATION_dawn | 0.278546 |
| COLLISION_TYPE_head_on | 0.271771 |
| FEMALE | -0.257571 |
| ILLUMINATION_dusk | -0.261101 |
| VEH_ROLE_striking | -0.266743 |
| URBAN_RURAL_urban | -0.350611 |
| COLLISION_TYPE_sideswipe_opp_dir | -0.370684 |
| AGGRESSIVE_DRIVING | -0.401452 |
| IMPACT_SIDE_front_right | -0.420935 |
| RESTRAINT_HELMET_unknown | -0.529896 |
| COLLISION_TYPE_sideswipe_same_dir | -0.612924 |
| NO_CLEARANCE | -0.741798 |
I'll establish some CV-scores for the two baseline models, and I'll go ahead and use the reduced feature set for both. Here I use five-fold cross-validation repeated three times (i.e. RepeatedStratifiedKFold with n_splits=5 and n_repeats=3.)
%%time
clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',class_weight='balanced',C=0.1)
study_nc = LRStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_nc.build_pipeline()
study_nc.cv_score()
Mean CV roc_auc score: 0.7372753688357127 CPU times: user 7.53 s, sys: 203 ms, total: 7.73 s Wall time: 5.71 s
%%time
clf = LogisticRegression(max_iter=1000,solver='saga',penalty='l1',class_weight='balanced',C=0.1)
study_saga = LRStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_saga.build_pipeline()
study_saga.cv_score()
Mean CV roc_auc score: 0.7360723698622664 CPU times: user 9.16 s, sys: 93.3 ms, total: 9.25 s Wall time: 9.26 s
The two models seem to perform comparably despite the different regularization methods. Since Newton-Cholesky provides a slightly better CV score, I'll proceed with only that one going forward.
Linear models like logistic regression can often perform well if the data is pre-processed with principle component analysis to reduce the dimensionality and focus on axes with the highest variance. However, the the optimal number of components in PCA depends on the dataset. I'll use randomized search cross validation to attempt to find an optimal value of n_components.
I'll check the number of features in the transformed X_train, so I know a maximum size for n_components.
study_saga.pipe[:-1].fit_transform(X_train).shape[1]
85
%%time
# Use the randomized_search function in order to try a range of n_components values
clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',class_weight='balanced',C=0.1)
study_nc_pca = LRStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_nc_pca.build_pipeline(pca=True)
study_nc_pca.randomized_search(params={},n_components = list(range(60,86)),n_iter=25)
| mean cv score (roc_auc) | |
|---|---|
| pca__n_components | |
| 76 | 0.738291 |
| 61 | 0.738240 |
| 69 | 0.738231 |
| 74 | 0.738227 |
| 79 | 0.738211 |
| 77 | 0.738211 |
| 84 | 0.738211 |
| 83 | 0.738211 |
| 80 | 0.738211 |
| 85 | 0.738211 |
CPU times: user 1.14 s, sys: 240 ms, total: 1.38 s Wall time: 16.5 s
PCA didn't seem to give much of an improvement. I won't use it going forward.
%%time
clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',class_weight='balanced',C=0.1)
study_nc = LRStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_nc.build_pipeline()
params = {'C':ss.loguniform(0.02,10)}
study_nc.randomized_search(params,refit=True)
| mean cv score (roc_auc) | |
|---|---|
| LR_clf__C | |
| 0.052738 | 0.738320 |
| 0.052730 | 0.738319 |
| 0.028694 | 0.738043 |
| 0.205070 | 0.737860 |
| 0.825641 | 0.737017 |
| 0.838342 | 0.737009 |
| 1.629659 | 0.736703 |
| 1.890861 | 0.736649 |
| 4.353248 | 0.736486 |
| 7.361722 | 0.736436 |
CPU times: user 888 ms, sys: 97.4 ms, total: 986 ms Wall time: 5.09 s
The SAGA solver is also compatible with the Elastic-Net penalty, which uses a mixture of L1 and L2 regularization. The mix is controlled by the value of the parameter l1_ratio:
For good measure, I'll try tuning hyperparameters on an Elastic-Net model. I tune both C and l1_ratio.
%%time
clf = LogisticRegression(max_iter=1000,solver='saga',penalty='elasticnet',class_weight='balanced')
study_saga_en =LRStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_saga_en.build_pipeline()
params = {'C':ss.loguniform(0.02,10),
'l1_ratio':ss.uniform(0,1)}
study_saga_en.randomized_search(params,n_iter=50,refit=True)
| mean cv score (roc_auc) | ||
|---|---|---|
| LR_clf__C | LR_clf__l1_ratio | |
| 0.061911 | 0.183405 | 0.738193 |
| 0.205070 | 0.950714 | 0.738190 |
| 0.052738 | 0.155995 | 0.738182 |
| 0.122891 | 0.366362 | 0.738181 |
| 0.114621 | 0.542696 | 0.738178 |
| 0.138154 | 0.325183 | 0.738176 |
| 0.132486 | 0.524756 | 0.738159 |
| 0.138781 | 0.520068 | 0.738148 |
| 0.132792 | 0.097672 | 0.738131 |
| 0.156356 | 0.063558 | 0.738040 |
CPU times: user 3.06 s, sys: 435 ms, total: 3.49 s Wall time: 1min 5s
Performance of the best Elastic-Net classifier is about the same as the best one using Newton-Cholesky solver, and Newton-Cholesky is much faster. Overall, the CV scores don't vary much across hyperparameter choices and there are apparently many good options. I'll proceed with:
I've used the area under the ROC curve as a metric to help select hyperparameters, but I haven't yet considered what classification threshold to use. I will consider the $F_{\beta}$ score.
Recall that the $F_{\beta}$ score is a generalization of the $F_1$ score which allows for either precision or recall to be prioritized more strongly over the other. One should view $\beta$ as the ratio of the perceived importance of recall to importance of precision.
In the task of predicting which cyclists will suffer serious injury or fatality in a crash based on various factors, it seems that recall should be more crucial than precision. A false positive corresponds to a cyclist who ended up with lesser injury than predicted, which is "good news"; a false negative corresponds to a cyclist suffering an adverse outcome which the model failed to predict.
The logistic regression model predicts the probability that a particular sample has class 1 (i.e. suffers serious injury or death), and then the model's class prediction is determined by a threshold one sets when calibrating the model. A typical threshold is 0.5, but it can be adjusted in order to optimize the desired $F_{\beta}$ score.
lr_params = {'C': 0.05273751067944609}
clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',
class_weight='balanced',**lr_params)
study_lr =ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_lr.build_pipeline()
study_lr.fit_pipeline(split_first=True)
study_lr.predict_proba_pipeline()
study_lr.find_best_threshold(beta=1)
Threshold optimizing F_1 score: 0.662141377258788 Best F_1 score: 0.29865361077111385
precision recall f1-score support
neither seriously injured nor killed 0.95 0.90 0.93 3989
seriously injured or killed 0.24 0.39 0.30 312
accuracy 0.87 4301
macro avg 0.60 0.65 0.61 4301
weighted avg 0.90 0.87 0.88 4301
The largest $F_1$ score here is 0.3, and notice how low the recall is for the positive class. This only correctly predicts 122 out of 312 cyclists with serious injury or death. I'll now consider the $F_4$ score, which prioritizes recall four time as heavily as precision on the positive class.
study_lr.find_best_threshold(beta=4)
Threshold optimizing F_4 score: 0.2774721271485444 Best F_4 score: 0.5933574879227053
precision recall f1-score support
neither seriously injured nor killed 0.98 0.25 0.40 3989
seriously injured or killed 0.09 0.93 0.16 312
accuracy 0.30 4301
macro avg 0.53 0.59 0.28 4301
weighted avg 0.91 0.30 0.38 4301
This has swung too far the other way. Now there are nearly 3000 false positives! I'll consider $F_3$.
study_lr.find_best_threshold(beta=3)
Threshold optimizing F_3 score: 0.42397093830815596 Best F_3 score: 0.49366816913500755
precision recall f1-score support
neither seriously injured nor killed 0.97 0.59 0.74 3989
seriously injured or killed 0.12 0.74 0.21 312
accuracy 0.60 4301
macro avg 0.55 0.67 0.47 4301
weighted avg 0.91 0.60 0.70 4301
This strikes a good balance - $F_3$ prioritizes recall three times as heavily as precision on the positive class, and the model indeed correctly classify 74% of cyclists with serious injury or death and 59% of those without.
The result is likely very dependent on the test set in this particular train/test split. I'll take the mean of best thresholds over a series of trials with differing random seeds in order to produce a more robust threshold choice.
%%time
n_trials = 100
thresh_list = []
for i in range(n_trials):
clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',
class_weight='balanced',**lr_params)
study_lr =ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'],random_state=42*i)
study_lr.build_pipeline()
study_lr.fit_pipeline(split_first=True)
study_lr.predict_proba_pipeline()
study_lr.find_best_threshold(beta=3,conf=False,report=False,print_result=False)
thresh_list.append(study_lr.best_thresh)
lr_thresh = np.mean(thresh_list)
print(f'The average best threshold for F_3 over {n_trials} trials is {lr_thresh}')
The average best threshold for F_3 over 100 trials is 0.3943127118198778 CPU times: user 52.1 s, sys: 1.22 s, total: 53.4 s Wall time: 36.7 s
lr_thresh = 0.3943127118198778
I will compare the logistic regression results to a gradient-boosted decision tree model.
All three of these models allow the use of early stopping, in which the algorithm will set aside a validation set at the start of training and check model performance on that validation set after each iteration. If the validation score fails to improve after n_iter_no_change iterations, training is halted effective at the beginning of that waiting period. In general, letting a decision tree based model train for too many iterations can promote overfitting to the training data, and early stopping can be a valuable measure against this.
n_iter_no_change is passed, and disabled if n_iter_no_change = None.early_stopping=False.cv_score method and is activated as long as we set a value of early_stopping_round when initializing the model. The number of iterations used before early stopping triggered is easy to access, and we print the min, max, and mean of this number across folds.One downside of early stopping is that because the validation set is set aside, the model is trained on slightly less data. Let's first see some baseline scores using early stopping.
I'll compute some cross-validation scores for GradientBoostingClassifer and HistGradientBoostingClassifier. I'll keep using both feature sets to see how they fare.
import time
clf_list = []
clf_list.append(
('GBC',GradientBoostingClassifier(
n_estimators = 2000, n_iter_no_change=50,random_state=42
))
)
clf_list.append(
('HGBC',HistGradientBoostingClassifier(
early_stopping=True,max_iter=2000,n_iter_no_change=50,random_state=42
))
)
clf_list.append(('LGBM',LGBMClassifier(
n_estimators = 300, random_state=42, verbosity=-1, early_stopping_round=150
))
)
cv_scores = {}
for feature_set in ['all','lr_small']:
cv_scores[feature_set] = []
for clf in clf_list:
start = time.time()
study = ClassifierStudy(clf[1],X_train,y_train,feat_dict[feature_set])
study.build_pipeline()
score = study.cv_score(print_mean_score = False, return_mean_score = True)
cv_scores[feature_set].append(score)
end = time.time()
print(f"The time elapsed for {clf[0]} with feature set '{feature_set}' was {end-start:.2f} seconds")
pd.DataFrame(cv_scores,index = [clf[0] for clf in clf_list])\
.rename(columns={'all':'all features',
'lr_small':'small feature set'})
The time elapsed for GBC with feature set 'all' was 65.44 seconds The time elapsed for HGBC with feature set 'all' was 21.21 seconds Number of iterations used due to early stopping: Min:10.0...Max:76.0...Mean:40.1 The time elapsed for LGBM with feature set 'all' was 19.25 seconds The time elapsed for GBC with feature set 'lr_small' was 48.61 seconds The time elapsed for HGBC with feature set 'lr_small' was 18.08 seconds Number of iterations used due to early stopping: Min:17.0...Max:81.0...Mean:38.1 The time elapsed for LGBM with feature set 'lr_small' was 18.68 seconds
| all features | small feature set | |
|---|---|---|
| GBC | 0.737861 | 0.739843 |
| HGBC | 0.734066 | 0.732187 |
| LGBM | 0.733911 | 0.734358 |
Let's do another baseline trial in which we don't use early stopping. For GBC and HGBC we'll use the default number of iterations (100) and for LGBMC we'll use 50 - in the ballpark suggested by the early stopping results of the previous trial.
import time
clf_list = []
clf_list.append(
('GBC',GradientBoostingClassifier(
random_state=42
))
)
clf_list.append(
('HGBC',HistGradientBoostingClassifier(
early_stopping=False,random_state=42
))
)
clf_list.append(('LGBM',LGBMClassifier(
n_estimators = 50, random_state=42, verbosity=-1
))
)
cv_scores = {}
for feature_set in ['all','lr_small']:
cv_scores[feature_set] = []
for clf in clf_list:
start = time.time()
study = ClassifierStudy(clf[1],X_train,y_train,feat_dict[feature_set])
study.build_pipeline()
score = study.cv_score(print_mean_score = False, return_mean_score = True)
cv_scores[feature_set].append(score)
end = time.time()
print(f"The time elapsed for {clf[0]} with feature set '{feature_set}' was {end-start:.2f} seconds")
pd.DataFrame(cv_scores,index = [clf[0] for clf in clf_list])\
.rename(columns={'all':'all features',
'lr_small':'small feature set'})
The time elapsed for GBC with feature set 'all' was 45.66 seconds The time elapsed for HGBC with feature set 'all' was 20.06 seconds The time elapsed for LGBM with feature set 'all' was 8.00 seconds The time elapsed for GBC with feature set 'lr_small' was 38.27 seconds The time elapsed for HGBC with feature set 'lr_small' was 20.42 seconds The time elapsed for LGBM with feature set 'lr_small' was 8.07 seconds
| all features | small feature set | |
|---|---|---|
| GBC | 0.739526 | 0.740716 |
| HGBC | 0.733804 | 0.733977 |
| LGBM | 0.739802 | 0.740515 |
I notice a few things:
We'll proceed in tuning hyperparameters for all three models. For LGBMC, we won't use early stopping and instead we'll instead tune the n_estimators parameter - since this model trains so quickly, we can afford additional iterations of our randomized search. For GBC and HGBC, we'll stick to early stopping so we don't have the additional parameter to tune.
%%time
clf = LGBMClassifier(random_state=42,verbosity=-1)
study_lgb = ClassifierStudy(clf,X_train,y_train,feat_dict['all'])
study_lgb.build_pipeline()
params = {'learning_rate':ss.loguniform(0.01,0.5),
'n_estimators':ss.randint(20,200),
'max_depth':list(range(2,10))+[None],
'reg_alpha':ss.loguniform(1,12),
'reg_lambda':ss.loguniform(1,12),
'min_child_samples':ss.randint(20,200),
}
study_lgb.randomized_search(params, n_iter=200, refit=True)
| mean cv score (roc_auc) | ||||||
|---|---|---|---|---|---|---|
| clf__learning_rate | clf__max_depth | clf__min_child_samples | clf__n_estimators | clf__reg_alpha | clf__reg_lambda | |
| 0.054095 | 4.000000 | 154 | 198 | 2.298391 | 1.413513 | 0.746978 |
| 0.361816 | 2.000000 | 93 | 76 | 3.029152 | 1.584134 | 0.746602 |
| 0.196691 | 2.000000 | 26 | 193 | 5.707852 | 1.198187 | 0.746079 |
| 0.091231 | 5.000000 | 152 | 139 | 1.058877 | 5.045954 | 0.745935 |
| 0.092152 | 3.000000 | 63 | 181 | 2.669554 | 10.000757 | 0.745788 |
| 0.096562 | 4.000000 | 130 | 170 | 1.374626 | 1.458581 | 0.745520 |
| 0.087464 | 7.000000 | 112 | 95 | 4.081847 | 1.483229 | 0.745419 |
| 0.082637 | 4.000000 | 100 | 128 | 1.468653 | 4.706400 | 0.745365 |
| 0.187748 | 3.000000 | 57 | 134 | 1.826710 | 1.952371 | 0.745265 |
| 0.358340 | 2.000000 | 48 | 82 | 1.993273 | 1.727925 | 0.745239 |
CPU times: user 10.6 s, sys: 1.51 s, total: 12.1 s Wall time: 3min 54s
%%time
clf = LGBMClassifier(random_state=42,verbosity=-1)
study_lgb_small = ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_lgb_small.build_pipeline()
params = {'learning_rate':ss.loguniform(0.05,0.3),
'n_estimators':ss.randint(50,200),
'max_depth':list(range(2,4))+[None],
'reg_alpha':ss.loguniform(1,5),
'reg_lambda':ss.loguniform(1,5),
'min_child_samples':ss.randint(20,100),
}
study_lgb_small.randomized_search(params, n_iter=200, refit=True)
| mean cv score (roc_auc) | ||||||
|---|---|---|---|---|---|---|
| clf__learning_rate | clf__max_depth | clf__min_child_samples | clf__n_estimators | clf__reg_alpha | clf__reg_lambda | |
| 0.124370 | 3.000000 | 42 | 192 | 3.373709 | 1.445184 | 0.748165 |
| 0.179153 | 2.000000 | 75 | 130 | 1.536285 | 4.815313 | 0.746769 |
| 0.068086 | 3.000000 | 70 | 196 | 3.753246 | 1.742619 | 0.746754 |
| 0.196413 | 2.000000 | 93 | 169 | 4.883624 | 2.134325 | 0.746580 |
| 0.234433 | 2.000000 | 48 | 181 | 2.094448 | 3.877900 | 0.746553 |
| 0.087403 | 3.000000 | 73 | 155 | 2.493478 | 1.051688 | 0.746528 |
| 0.216218 | 2.000000 | 79 | 162 | 1.475868 | 1.161652 | 0.746509 |
| 0.190305 | 3.000000 | 97 | 166 | 3.242309 | 4.527639 | 0.746481 |
| 0.154016 | 2.000000 | 27 | 193 | 1.085140 | 4.165995 | 0.746453 |
| 0.156615 | 2.000000 | 31 | 192 | 3.051467 | 1.086586 | 0.746408 |
CPU times: user 8.32 s, sys: 1.07 s, total: 9.39 s Wall time: 3min 41s
%%time
clf = HistGradientBoostingClassifier(early_stopping=True,max_iter=2000,
n_iter_no_change=50,random_state=42)
study_hgb = ClassifierStudy(clf,X_train,y_train,feat_dict['all'])
study_hgb.build_pipeline()
params = {'learning_rate':ss.loguniform(0.005,0.2),
'max_depth':list(range(2,10))+[None],
'l2_regularization':ss.loguniform(1,12),
'min_samples_leaf':ss.randint(20,200),
}
study_hgb.randomized_search(params, n_iter=50, refit=True)
| mean cv score (roc_auc) | ||||
|---|---|---|---|---|
| clf__l2_regularization | clf__learning_rate | clf__max_depth | clf__min_samples_leaf | |
| 11.515739 | 0.021773 | 2.000000 | 143 | 0.743552 |
| 2.423873 | 0.141829 | 2.000000 | 140 | 0.742604 |
| 1.869471 | 0.031303 | 2.000000 | 120 | 0.742459 |
| 5.360337 | 0.075306 | 3.000000 | 170 | 0.742266 |
| 5.422195 | 0.026345 | 3.000000 | 151 | 0.742035 |
| 5.000058 | 0.005010 | nan | 22 | 0.741564 |
| 3.890447 | 0.009888 | 3.000000 | 63 | 0.741419 |
| 9.714206 | 0.115022 | 3.000000 | 52 | 0.741244 |
| 2.247857 | 0.011267 | 4.000000 | 166 | 0.740723 |
| 3.588824 | 0.044469 | 4.000000 | 70 | 0.740503 |
CPU times: user 46.2 s, sys: 14.6 s, total: 1min Wall time: 6min 12s
%%time
clf = HistGradientBoostingClassifier(early_stopping=True,max_iter=2000,
n_iter_no_change=50,random_state=42)
study_hgb_small = ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_hgb_small.build_pipeline()
params = {'learning_rate':ss.loguniform(0.005,0.2),
'max_depth':list(range(2,10))+[None],
'l2_regularization':ss.loguniform(1,12),
'min_samples_leaf':ss.randint(20,200),
}
study_hgb_small.randomized_search(params, n_iter=50, refit=True)
| mean cv score (roc_auc) | ||||
|---|---|---|---|---|
| clf__l2_regularization | clf__learning_rate | clf__max_depth | clf__min_samples_leaf | |
| 2.423873 | 0.141829 | 2.000000 | 140 | 0.743990 |
| 5.422195 | 0.026345 | 3.000000 | 151 | 0.743437 |
| 11.515739 | 0.021773 | 2.000000 | 143 | 0.743403 |
| 1.869471 | 0.031303 | 2.000000 | 120 | 0.742660 |
| 5.360337 | 0.075306 | 3.000000 | 170 | 0.742111 |
| 5.000058 | 0.005010 | nan | 22 | 0.742039 |
| 3.890447 | 0.009888 | 3.000000 | 63 | 0.741894 |
| 2.062002 | 0.047774 | 4.000000 | 127 | 0.741084 |
| 2.247857 | 0.011267 | 4.000000 | 166 | 0.741053 |
| 1.171092 | 0.015746 | 9.000000 | 56 | 0.741041 |
CPU times: user 7.59 s, sys: 1.67 s, total: 9.26 s Wall time: 4min 31s
%%time
clf = GradientBoostingClassifier(n_estimators = 2000, n_iter_no_change=50,random_state=42)
study_gb = ClassifierStudy(clf,X_train,y_train,feat_dict['all'])
study_gb.build_pipeline()
params = {'learning_rate':ss.loguniform(0.005,0.2),
'max_depth':list(range(2,10))+[None],
'subsample':ss.uniform(0.5,1),
'min_samples_leaf':ss.randint(20,200),
}
study_gb.randomized_search(params, n_iter = 50, refit=True)
| mean cv score (roc_auc) | ||||
|---|---|---|---|---|
| clf__learning_rate | clf__max_depth | clf__min_samples_leaf | clf__subsample | |
| 0.031303 | 2.000000 | 120 | 0.784840 | 0.747300 |
| 0.011267 | 4.000000 | 166 | 0.848666 | 0.745177 |
| 0.006046 | 6.000000 | 148 | 0.739562 | 0.744191 |
| 0.036629 | 5.000000 | 32 | 0.576980 | 0.743771 |
| 0.009835 | 7.000000 | 108 | 0.791229 | 0.743310 |
| 0.065118 | 4.000000 | 167 | 0.825959 | 0.742898 |
| 0.007780 | 9.000000 | 78 | 0.823203 | 0.741206 |
| 0.006738 | 9.000000 | 142 | 0.742160 | 0.741153 |
| 0.016542 | 8.000000 | 111 | 0.856298 | 0.741101 |
| 0.019907 | 9.000000 | 40 | 0.656019 | 0.740766 |
CPU times: user 13.2 s, sys: 757 ms, total: 14 s Wall time: 10min 32s
%%time
clf = GradientBoostingClassifier(n_estimators = 2000, n_iter_no_change=50,random_state=42)
study_gb_small = ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'])
study_gb_small.build_pipeline()
params = {'learning_rate':ss.loguniform(0.005,0.2),
'max_depth':list(range(2,10))+[None],
'subsample':ss.uniform(0.5,1),
'min_samples_leaf':ss.randint(20,200),
}
study_gb_small.randomized_search(params, n_iter = 50, refit=True)
| mean cv score (roc_auc) | ||||
|---|---|---|---|---|
| clf__learning_rate | clf__max_depth | clf__min_samples_leaf | clf__subsample | |
| 0.031303 | 2.000000 | 120 | 0.784840 | 0.747868 |
| 0.011267 | 4.000000 | 166 | 0.848666 | 0.745586 |
| 0.006046 | 6.000000 | 148 | 0.739562 | 0.744704 |
| 0.065118 | 4.000000 | 167 | 0.825959 | 0.744684 |
| 0.036629 | 5.000000 | 32 | 0.576980 | 0.743293 |
| 0.009835 | 7.000000 | 108 | 0.791229 | 0.743057 |
| 0.006738 | 9.000000 | 142 | 0.742160 | 0.742037 |
| 0.016542 | 8.000000 | 111 | 0.856298 | 0.741965 |
| 0.007780 | 9.000000 | 78 | 0.823203 | 0.741028 |
| 0.019907 | 9.000000 | 40 | 0.656019 | 0.740210 |
CPU times: user 10.8 s, sys: 429 ms, total: 11.3 s Wall time: 7min 11s
Here are the results of the hyperparameter tuning, in case you want to proceed without re-running the randomized search.
best_params = {'all': {'gb': {'learning_rate': 0.03130343103935811,
'max_depth': 2,
'min_samples_leaf': 120,
'subsample': 0.7848404943774676},
'hgb': {'l2_regularization': 11.515738930806396,
'learning_rate': 0.021772619713840338,
'max_depth': 2,
'min_samples_leaf': 143},
'lgb': {'learning_rate': 0.05409453242074271,
'max_depth': 4,
'min_child_samples': 154,
'n_estimators': 198,
'reg_alpha': 2.2983907150108998,
'reg_lambda': 1.413513006153221}},
'lr_small': {'gb': {'learning_rate': 0.03130343103935811,
'max_depth': 2,
'min_samples_leaf': 120,
'subsample': 0.7848404943774676},
'hgb': {'l2_regularization': 2.4238734679222236,
'learning_rate': 0.14182851952262965,
'max_depth': 2,
'min_samples_leaf': 140},
'lgb': {'learning_rate': 0.12436979645979512,
'max_depth': 3,
'min_child_samples': 42,
'n_estimators': 192,
'reg_alpha': 3.373708711638402,
'reg_lambda': 1.4451837004705437}}}
best_params = {}
best_params['all'] = {
'gb':study_gb.best_params,
'hgb':study_hgb.best_params,
'lgb':study_lgb.best_params
}
best_params['lr_small'] = {
'gb':study_gb_small.best_params,
'hgb':study_hgb_small.best_params,
'lgb':study_lgb_small.best_params
}
best_params
{'all': {'gb': {'learning_rate': 0.03130343103935811,
'max_depth': 2,
'min_samples_leaf': 120,
'subsample': 0.7848404943774676},
'hgb': {'l2_regularization': 11.515738930806396,
'learning_rate': 0.021772619713840338,
'max_depth': 2,
'min_samples_leaf': 143},
'lgb': {'learning_rate': 0.19669065259669294,
'max_depth': 2,
'min_child_samples': 26,
'n_estimators': 193,
'reg_alpha': 5.707852486890183,
'reg_lambda': 1.1981866373949588}},
'lr_small': {'gb': {'learning_rate': 0.03130343103935811,
'max_depth': 2,
'min_samples_leaf': 120,
'subsample': 0.7848404943774676},
'hgb': {'l2_regularization': 2.4238734679222236,
'learning_rate': 0.14182851952262965,
'max_depth': 2,
'min_samples_leaf': 140},
'lgb': {'learning_rate': 0.12656361513845585,
'max_depth': 3,
'min_child_samples': 36,
'n_estimators': 189,
'reg_alpha': 3.070386836762345,
'reg_lambda': 3.097528444873408}}}
tuned_scores = {}
tuned_scores['all'] = [
study_gb.best_score,
study_hgb.best_score,
study_lgb.best_score
]
tuned_scores['lr_small'] = [
study_gb_small.best_score,
study_hgb_small.best_score,
study_lgb_small.best_score
]
tuned_scores = pd.DataFrame(tuned_scores,index = ['GBC','HGBC','LGBMC'])\
.rename(columns={'all':'all features',
'lr_small':'small feature set'})
tuned_scores
| all features | small feature set | |
|---|---|---|
| GBC | 0.747300 | 0.747868 |
| HGBC | 0.743552 | 0.743990 |
| LGBMC | 0.746978 | 0.748165 |
Just as I did with the linear regression model, I'll calculate the best threshold choices over a series of train/test split trials, and then select the mean of those thresholds.
lgb_params = best_params['lr_small']['lgb']
%%time
n_trials = 100
thresh_list = []
for i in range(n_trials):
clf = LGBMClassifier(random_state=42*i,**lgb_params,verbosity=-1)
study_lgb =ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'],random_state=42*i)
study_lgb.build_pipeline()
study_lgb.fit_pipeline(split_first=True)
study_lgb.predict_proba_pipeline()
study_lgb.find_best_threshold(beta=3,conf=False,report=False,print_result=False)
thresh_list.append(study_lgb.best_thresh)
lgb_thresh = np.mean(thresh_list)
print(f'The average best threshold for F_3 over {n_trials} trials is {lgb_thresh}')
The average best threshold for F_3 over 100 trials is 0.051350362708522494 CPU times: user 55.7 s, sys: 11.3 s, total: 1min 6s Wall time: 52.1 s
What explains this small threshold? Recall that in the logistic regression models I used the parameter class_weight='balanced', a measure that can help those models significantly in the presence of class imbalance. Given that the training set is fairly large and the class imbalance is not particularly extreme (a 90%-10% split as opposed to a 99%-1% split or worse), tree and forest models should still be able to learn the minority class relatively well. However, they will still be 'less confident' at predicting the minority class, i.e. the predicted probabilities will be scaled down and require smaller thresholds for class distinction.
In order to address this anyhow, we can use the scale_pos_weight hyperparameter in LGBM Classifier. It's recommended that in the binary classification case, scale_pos_weight is set equal to the ratio of the negative class count to the positive class count.
%%time
scale_pos_weight = (len(y_train)-sum(y_train))/sum(y_train)
n_trials = 100
thresh_list = []
for i in range(n_trials):
clf = LGBMClassifier(random_state=42*i,scale_pos_weight = scale_pos_weight,**lgb_params,verbosity=-1)
study_lgb_bal =ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'],random_state=42*i)
study_lgb_bal.build_pipeline()
study_lgb_bal.fit_pipeline(split_first=True)
study_lgb_bal.predict_proba_pipeline()
study_lgb_bal.find_best_threshold(beta=3,conf=False,report=False,print_result=False)
thresh_list.append(study_lgb_bal.best_thresh)
lgb_bal_thresh = np.mean(thresh_list)
print(f'The average best threshold for F_3 over {n_trials} trials is {lgb_bal_thresh}')
The average best threshold for F_3 over 100 trials is 0.3612220261352673 CPU times: user 56.1 s, sys: 11 s, total: 1min 7s Wall time: 52 s
lgb_thresh = 0.051350362708522494
lgb_bal_thresh = 0.3612220261352673
I'll choose between the best logistic regression classifier and the best best balanced and unbalanced LightGBM classifiers based on an array of their cross-validation scores according to various metrics.
def cv_scores_dict(clf,thresh):
"""
Builds a pipeline ending with the input classifier clf
and computes a mean CV score using a variety of metrics
Parameters:
-----------
clf : sklearn compatible binary classifier
thresh : float
the classification threshold to use when
computing y_pred from y_pred_proba
Returns:
-------
scores_dict: dict
keys are scoring metric keywords and values are
corresponding model scores on test set
"""
study = ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'])
study.build_pipeline()
scores_dict = {}
for scoring in ['roc_auc','fb','f1w','acc']:
scores_dict[scoring] = [study.cv_score(scoring=scoring,beta=3,thresh=thresh,
return_mean_score=True, print_mean_score=False)]
return scores_dict
%%time
clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',
class_weight='balanced',**lr_params)
lr_cv = cv_scores_dict(clf,lr_thresh)
CPU times: user 3min 32s, sys: 16.7 s, total: 3min 48s Wall time: 31.8 s
%%time
clf = LGBMClassifier(random_state=42,**lgb_params,verbosity=-1)
lgb_cv = cv_scores_dict(clf,lgb_thresh)
CPU times: user 34 s, sys: 6.52 s, total: 40.5 s Wall time: 30.7 s
%%time
clf = LGBMClassifier(random_state=42,scale_pos_weight=scale_pos_weight,**lgb_params,verbosity=-1)
lgb_bal_cv = cv_scores_dict(clf,lgb_bal_thresh)
CPU times: user 33.4 s, sys: 6.36 s, total: 39.8 s Wall time: 30.7 s
# Define table styles
styles = [dict(selector="caption",
props=[("text-align", "center"),
("font-size", "100%"),
("color", 'black'),
("text-decoration","underline"),
("font-weight","bold")])]
cv_scores = pd.concat([pd.DataFrame(lr_cv,index=['Optimal LogisticRegression model']),
pd.DataFrame(lgb_cv,index=['Optimal LGBMClassifier model']),
pd.DataFrame(lgb_bal_cv,index=['Optimal LGBMClassifier model, balanced'])],axis=0)\
.rename(columns = {'roc_auc':'ROC AUC score','fb': 'F_beta score (beta=3)',
'f1w':'weighted average F_1 score','acc':'accuracy score'})
cv_scores.style.set_caption('Comparison of model CV scores (n_splits=5, n_repeats = 3)')\
.set_table_styles(styles)
| ROC AUC score | F_beta score (beta=3) | weighted average F_1 score | accuracy score | |
|---|---|---|---|---|
| Optimal LogisticRegression model | 0.737163 | 0.500303 | 0.650726 | 0.549175 |
| Optimal LGBMClassifier model | 0.746698 | 0.502195 | 0.688262 | 0.592870 |
| Optimal LGBMClassifier model, balanced | 0.737929 | 0.500185 | 0.660113 | 0.559854 |
the gradient boosted tree models perform better than the logistic regression model, and the version without class weighting performs much better.
I select the the LGBM model without class weighting as the BikeSaferPA model. However, I proceed to evaluate the performance of logistic regression on the test set also in the next section for illustrative purposes.
I compute the AUC score for each of the best classifiers on the holdout test set, and display the confusion matrix and classification report that I obtain with the chosen thresholds.
%%time
def test_scores_dict(clf,thresh):
"""
Builds a pipeline ending with the input classifier clf
and scores it on the test set using a variety of metrics
Parameters:
-----------
clf : sklearn compatible binary classifier
thresh : float
the classification threshold to use when
computing y_pred from y_pred_proba
Returns:
-------
(study,scores_dict) : tuple
study : ClassiferStudy instance
with pipeline build and fitted and predictions made
scores_dict: dict
keys are scoring metric keywords and values are
corresponding model scores on test set
"""
study = ClassifierStudy(clf,X_train,y_train,feat_dict['lr_small'])
study.build_pipeline()
study.fit_pipeline()
study.predict_proba_pipeline(X_test)
scores_dict = {}
for scoring in ['roc_auc','fb','f1w','acc']:
scores_dict[scoring] = [study.score_pipeline(y_test,scoring=scoring,beta=3,
thresh=thresh,print_score=False)]
return (study,scores_dict)
clf = LogisticRegression(max_iter=1000,solver='newton-cholesky',
class_weight='balanced',**lr_params)
study_lr,lr_test = test_scores_dict(clf,lr_thresh)
clf = LGBMClassifier(random_state=42,**lgb_params,verbosity=-1)
study_lgb,lgb_test = test_scores_dict(clf,lgb_thresh)
test_scores = pd.concat([pd.DataFrame(lr_test,index=['Optimal LogisticRegression model']),
pd.DataFrame(lgb_test,index=['Optimal LGBMClassifier model'])],axis=0)\
.rename(columns = {'roc_auc':'ROC AUC score','fb': 'F_beta score (beta=3)',
'f1w':'weighted average F_1 score','acc':'accuracy score'})
test_scores.style.set_caption('Comparison of model scores on holdout test set')\
.set_table_styles(styles)
CPU times: user 2.56 s, sys: 2.51 s, total: 5.06 s Wall time: 1.24 s
| ROC AUC score | F_beta score (beta=3) | weighted average F_1 score | accuracy score | |
|---|---|---|---|---|
| Optimal LogisticRegression model | 0.713936 | 0.475655 | 0.635683 | 0.531523 |
| Optimal LGBMClassifier model | 0.718382 | 0.483844 | 0.682742 | 0.585829 |
fig, axs = plt.subplots(1,2,figsize=(8,4))
RocCurveDisplay.from_predictions(y_test,study_lr.y_predict_proba,
name='LR',ax=axs[0])
axs[0].set_title('ROC Curve for LR model',fontsize='medium')
RocCurveDisplay.from_predictions(y_test,study_lgb.y_predict_proba,
name='BikeSaferPA',ax=axs[1])
axs[1].set_title('ROC Curve for BikeSaferPA',fontsize='medium')
for ax in axs:
ax.tick_params(axis='x', labelsize='x-small')
ax.tick_params(axis='y', labelsize='x-small')
ax.set_ylabel('True positive rate',fontsize='small')
ax.set_xlabel('False positive rate',fontsize='small')
plt.tight_layout()
plt.show()
A classifer with an AUC score of 0.71-0.72 is generally considered "acceptible".
fig, axs = plt.subplots(1,2,figsize=(6,3))
ConfusionMatrixDisplay.from_predictions(y_test,(study_lr.y_predict_proba >= lr_thresh).astype(int),
ax=axs[0],colorbar=False)
axs[0].set_title('Confusion Matrix for LR model',fontsize='small')
ConfusionMatrixDisplay.from_predictions(y_test,(study_lgb.y_predict_proba >= lgb_thresh).astype(int),
ax=axs[1],colorbar=False)
axs[1].set_title('Confusion Matrix for BikeSaferPA',fontsize='small')
for ax in axs:
ax.tick_params(axis='x', labelsize='x-small')
ax.tick_params(axis='y', labelsize='x-small')
ax.set_ylabel('True label',fontsize='small')
ax.set_xlabel('Predicted label',fontsize='small')
plt.tight_layout()
plt.show()
Finally, the classification reports for each model:
study_lr.score_pipeline(y_test,scoring='classif_report',thresh=lr_thresh)
precision recall f1-score support
neither seriously injured nor killed 0.97 0.51 0.67 4988
seriously injured or killed 0.11 0.76 0.19 389
accuracy 0.53 5377
macro avg 0.54 0.64 0.43 5377
weighted avg 0.90 0.53 0.64 5377
study_lgb.score_pipeline(y_test,scoring='classif_report',thresh=lgb_thresh)
precision recall f1-score support
neither seriously injured nor killed 0.97 0.56 0.71 4988
seriously injured or killed 0.12 0.75 0.20 389
accuracy 0.58 5377
macro avg 0.54 0.66 0.46 5377
weighted avg 0.90 0.58 0.67 5377
Machine learning models can be difficult to interpret, and one excellent candidate for an interpretation method comes in the form of SHAP (SHapley Additive exPlanation) values. These values attribute to each feature and each sample the change in expected model prediction when conditioning on that feature, i.e. when adding that feature to the model. Note that when features are not independent or when the model is non-linear, the order in which features are added has an impact on the changes in expectation they generate. The SHAP value of a feature is really an average of its impact to expectation, where the average is taken over all possible orders in which features can be added.
I have built methods into ClassifierStudy to calculate and plot SHAP values. I'll compute and examine SHAP values for BikeSaferPA, the best HistGradientBoostingClassifier model. Since spline feature importance will be difficult to interpret, I'll use a version of the model that doesn't use DAY_OF_WEEK or HOUR_OF_DAY by setting cyc_method=None in the build_pipeline method.
Warning: due to use of deprecated types, e.g. 'np.bool', the latest version of the shap library is not compatible with newest versions of the numpy library. You may need to roll back to numpy version 1.23.0 for the following section to run properly.
!pip install numpy==1.23.0
Requirement already satisfied: numpy==1.23.0 in /Users/eamonn/mambaforge/envs/ds/lib/python3.10/site-packages (1.23.0)
Some remarks about the plots below:
Note that we use cyc_method=None, i.e. we leave out the features DAY_OF_WEEK and HOUR_OF_DAY entirely from this analysis. It would be difficult to interpret the importance of various periodic spline features associated to these.
%%time
clf = LGBMClassifier(random_state=42,**lgb_params,verbosity=-1)
study_lgb = ClassifierStudy(clf,X_train,y_train,features=feat_dict['lr_small'])
study_lgb.build_pipeline(cyc_method=None,num_ss=False)
study_lgb.shap_values(X_test)
study_lgb.shap_plot(max_display=20)
CPU times: user 8.55 s, sys: 171 ms, total: 8.73 s Wall time: 7.38 s
Based on the scatter plots below, I can conclude the following regarding BikeSafePA's predicted probability that a cyclist suffers serious injury or fatality:
fig, axs = plt.subplots(1,3,figsize=(12,3))
for i,feat in enumerate(['AGE','SPEED_LIMIT','CRASH_YEAR']):
shap.plots.scatter(study_lgb.shap_vals[:,feat],show=False,ax=axs[i],dot_size=4)
axs[i].set_xlabel(feat, fontsize='x-small')
axs[i].set_ylabel(f'SHAP value for {feat}',fontsize='x-small')
plt.tight_layout()
plt.show()
For all of the ordinal features of the form (vehicle type)_COUNT, the presence of such a vehicle pushes BikeSafePA's predicted probability upwards, and the absence pushes it downward. Additionally, in all cases there is a lot more variance in the push strenth when the vehicle type is present. This indicates interactions with other features in the dataset in those situations.
When the following factors are present, their features push the model towards predictiong serious injury or fatality:
When these factors are present, their features push the model away predictiong serious injury or fatality:
Some features have very high variances in their SHAP values when that factor is present, e.g. HILL, CURVED_ROAD, ALCOHOL_RELATED, SPEEDING_RELATED, COMM_VEH_COUNT, indicating interactions.
For one-hot features arising from categorical features:
fig, axs = plt.subplots(1,2,figsize=(12,4))
for i,feat in enumerate(['AGE','SPEED_LIMIT']):
shap.plots.scatter(study_lgb.shap_vals[:,feat],color=study_lgb.shap_vals[:,'RESTRAINT_HELMET_bicycle_helmet'],show=False,ax=axs[i],dot_size=4,alpha=0.5)
axs[i].set_xlabel(feat, fontsize='x-small')
axs[i].set_ylabel(f'SHAP value for {feat}',fontsize='x-small')
plt.tight_layout()
plt.show()
I developed two final candidates for BikeSaferPA, a classifier model which predicts whether cyclist suffered serious injury or fatality: logistic regression and gradient boosted decision tree models. Model selection relied on the ROC-AUC score, and then prediction thresholds were chosen by analyzing $F_{\beta}$ scores.
I first selected features by based on their log-odds coefficient values in a fitted LogisticRegression model (with purely L1 regularization, to promote sparsity of the coefficients). Specifically, I omitted some features which had zero or very small coefficients, suggesting they're not very important to the model's expressability. It's crucial here that I applied standard scaling to the numerical features ahead of time - if features are at vastly different scales, then their log-odds coefficients can provide misleading information. This reduced set of features was used to construct both logistric regression and gradient boosted tree models.
I encoded features in several ways:
This encoding was automated via a custom model pipeline.
I examined several logistic regression and gradient boosted decision tree models with a wide variety of hyperparameter settings. After tuning hyperparameters to optimize ROC-AUC score via randomized search with five-fold cross validation, the two best models were:
The ROC-AUC score is computed based on how the model's predicted probabilities affect the true positive and true negative rates at all possible prediction thresholds, and so optimizing the AUC doesn't on its own provide a choice for the best prediction threshold. I selected prediction thresholds for both models which optiized the $F_3$ score, a variant of the classical $F_1$ score which considers recall (of the positive class) as three times as important as precision.
Based on its performance with respect to ROC-AUC score, I selected the gradient boosted tree algorithm as the BikeSaferPA model!


When trained on the entire training set and scored on the holdout test set, BikeSaferPA attains ROC-AUC score of around 0.72. Using the classification threshold values I selected in the parameter tuning phase, the model correctly classifies 75% of cyclists in the test set who suffered serious injury or fatality, and correctly classifies 56% of those who didn't. I set the threshold to optimize the $F_3$ score, a variant of the $F_1$ score which views recall as three times as important as precision; adjusting it further could produce variants which are better at identifying cyclists at risk of serious injury or death, but they would also accumulate more false positives. The end-user of BikeSaferPA should adjust its classification threshold to accomodate the needs of their particular use case.
Despite that large number of features, there are some challenges which limit performance:
Although this is a rich set of features, whether a cyclist suffered serious injury or fatality may depend heavily on factors of the collision and surrounds which are not expressed in the dataset, such as:
Many factors of a crash - such as inclement weather, darkness of the surroundings, visibility issues, roadway condition problems, impairment of a driver, vehicle speeding or other aggressive behavior, etc. - can vary significantly by degrees. Certainly the degrees matter in many cases, but they aren't expressed in the data.
The dataset has a TRAVEL_SPD column for vehicles, but I decided not to use it because so much data is missing:
I decided to use the combination of SPEEDING_RELATED and SPEED_LIMIT as an imperfect proxy for actual travel speeds of vehicles. Both of these are important features to the models (as I shall see in the next section's SHAP analysis), but I believe the model could learn more from actual travel speeds of all vehicles.
There were 536 multi-cyclist collisions containing at least one cyclist who suffered serious injury or fatality and at least one cyclist who didn't. Moreover, there were 373 multi-passenger bicycles/pedalcycles for which their passengers didn't all have the same severity outcome! This leads to tuples of samples which have almost duplicated input features but differing target feature. Such samples can can make it more difficult for a model to train effectively because they essentially add noise to the data. Several ways of addressing this issue which I did not pursue:
I computed SHAP (SHapley Additive exPlanation) values on the test set. SHAP values are very reliable metrics for determining the importance of features to the model's predictions and help to explain the decisions made by models such as BikeSaferPA.
Note: the following statements should not be interpreted as inferring causality; rather, they are statements about how conditioning on certain factors affects the expected prediction of BikeSaferPA!
Based on SHAP values, I can conclude the following about BikeSaferPA's predicted probability that a cyclist suffered serious injury or fatality:
Based on the results of BikeSaferPA's SHAP values, I recommend the following actions to be taken in an effort to reduce the incidence of serious cyclist injury and cyclist fatality (as well as cyclist crashes in general) in Pennsylvania: