Philadelphia Collision Investigation

Introduction

Vehicular accidents are a leading cause of injuries in the US. In 2018 motor-vehicle crashes accoundted for 39,404 deaths, 4.5 million medically consulted injuries and resulted in injury costs of $445.6 billion dollars. NTSC Injury Facts

According to the "Vision Zero" Plan put together by the NSC, and important aspect of reducing the severity of accidents is to design roadways in a way to prevent them.

An important aspect of the procedure to reduce the number of accidents is to identify not only the probelmatic areas, but also the additional factors that drive the severity of those accidents.

Identifying those problem areas and factors is an invaluable tool for both city planners and government officials. It also can prove to be a tool for insurance groups to push local governments into action. Nieghborhood groups can substanciate their claims for improvement in their local area. Finally Police and Emergency Response departments can better plan and hopefully reduce the amount of man hours and divert more resources to other causes when collisions are reduced.

Business Problem

The city of Philadelphia has been implementing the Vision Zero plant discussed above to improve upon the safety of its inhabitants. They are using data to make smart decisions for city planning and enformcement.

So the problem to be addressed in this report is can we identify the severity of collisions based on design and situational parameters to help the city of Philadelphia better design roads and traffic to reduce the severity of the vehicular collisions.

Data Understanding and Preparation

Data Sources

The dataset for this analysis is a the dataset used for Philadelphia's Vision Zero project. The data is derived form PennDOT's Crash Incident Dataset filtered to 2013-2017. PennDOT's full dataset can be accessed at their API, it does not contain data past 2017. The actual dataset for the city of Philadelphia can be found at the OpenDataPhilly portal. This project uses both the geojson and csv versions of the data.

Additionally for visualization and some spatial investigation this project makes use of the zipcodes boundaries obtained at OpenDataPhilly. We will also be using the metadata from the same website to idenfity some of the more useful features.

Data Preparation

The target value for identification and classification is the max_severity_level. This is explained by the chart below:

MAX_SEVERITY_LEVEL	Maximum injury severity level of the crash
0	Not injured
1	Killed
2	Major injury
3	Moderate injury
4	Minor injury
8	Injury/ Unknown Severity
9	Unknown

The data contains 95 features, which encompass temporal, spatial, collision results and other descriptions (intersection type, illumination, etc).

In order to determine what features/ columns to a count of the unique values per feature is performed. Features with low counts are also presented. From this approach it is clear which columns can be left out of the investigation (for example county and district).

Important features that provide varied information where identified by looking at unique counts of values. Combining the Vision Zero dataset with the complete datasets metadata also allowed to remove some unexplained features.

The features identified as additional to the Vision Zero are mostly descriptive of the effects of the collision. For example bicycle_death_count, vehicle_count. These were dropped from the data set investigated.

array(['belted_death_count', 'belted_maj_inj_count',
       'bicycle_death_count', 'bicycle_maj_inj_count',
       'mcycle_death_count', 'mcycle_maj_inj_count', 'sch_zone_ind',
       'unb_death_count', 'unb_maj_inj_count', 'unk_inj_deg_count',
       'unk_inj_per_count', 'vehicle_count'], dtype=object)

Columns dropped looking for not corresponding to metadata documentation

Furthermore as we are only looking at the situational conditions leading up to the collision, features related to the outcome were dropped, these mostly related to *_counts features.

Finally the incident feature as it appears to have inconsistent values for 2017.

crash_year	0	1
2013	2657	3991
2014	2624	3784
2015	2833	3898
2016	2898	4286
2017	6470	nan

The dataset contains records with Unknown max_severity_levels, these records were dropped as we are trying to investigate parameters that can lead to a reduction in severity levels.

A count of the dataset target variables leads us to identify that this is an imbalanced dataset.

Additionally the geolocated version of the data was spatially joined to the zipcode boundaries to determine the zipcode the accidents occured within to assist with vizualization.

Feature	Description
crash_month	Month for the collision
crash_year	Year for the collision
crn	Record Identifier (used as an index)
day_of_week	Weekday for the collision
hour_of_day	Hour for the collision
illumination	Illumination Descriptor
intersect_type	Intersection Descriptor
tcd_type	Traffic Control Device Descriptor
road_condition	Condition of road at time of collision
CODE	zipcode
geometry	Coordinates for the Collision

Additional data cleanup was done on data after the data exploration of it and will be identified at that point.

Data Exploration

The first step was to identify the seasonal components for the data to identify if the Vision Zero approach has had any effectiveness on collision reduction.

From these plots we determine the following:

There has not been an reduction on the total number of collisions a year, an increase occured in 2016
Summer months seem to increase the number of collisions
Weekends see an increase in collisions
Most collisions seem to occur during commute times to and from work 6-9 and 15-18.

Does this information look different for fatalities?

For the fatality plot there are some differences between years

There is an increase in the fall months, perhaps due to it being dark out.
More of them occur on weekday 1 (Sunday)
Late night/ early morning see most of the fatal collisions (could explain the increase on Sunday, early morning).

In order to have a assessment of the performance we also look at the percentages of each type of collisions in relation to the overall.

The overall trend over years and severity level seems to stay consistent.

We also take a look at illumination and its effects on collision.

	illumination
Daylight	20501
Dark - street lights	11346
Dusk	687
Dark - no street lights	397
Dawn	311
Dark-unknown roadway lighting	148
Other	51

In terms of percentages, there is a slight increase on collisions when the road, has illumination of Other and Dark-unknown roadway lighting. However the collision count for those is low, so those records where dropped as they provide incomplete information.

If we look at fatalities it matches what was seen previously. Most fatalities occur late at night

Illumination	Collisions
Dark - street lights	53%
Daylight	41%
Dark - no street lights	3%
Dusk	2%
Dawn	2%
Dark-unknown roadway lighting	0%

We also look at intersection type to help identify where most collisions are occurring.

Intersection Type	Collision Count
Four way intersection	14983
Mid-block	13600
T intersection	3342
Multi-leg intersection	740
Y intersection	557
On ramp	82
Traffic circle or Round About	47
Other	44
Off ramp	40
Railroad crossing	5

As common sense most collisions happen at Four way Intersections a significant amount also happen at Mid-block.

Since intersections contain traffic devices we also look at the tcd type vs the interect_type.

intersect_type	Not applicable	Flashing traffic signal	Traffic signal	Stop sign	Yield sign	Active RR crossing controls	Passive RR crossing controls	Police officer or flagman	Other Type TCD	Unknown
Mid-block	12773	6	691	72	8	0	0	5	44	1
Four way intersection	752	74	10605	3512	26	0	0	3	5	6
T intersection	1094	11	1184	1042	6	0	0	1	4	0
Y intersection	185	1	214	141	13	0	0	0	2	1
Traffic circle or Round About	9	0	32	2	4	0	0	0	0	0
Multi-leg intersection	13	3	686	38	0	0	0	0	0	0
On ramp	40	0	2	7	33	0	0	0	0	0
Off ramp	34	0	2	2	1	0	0	1	0	0
Railroad crossing	0	0	0	0	0	3	2	0	0	0
Other	29	0	2	12	1	0	0	0	0	0

As expected Midblock collisions do not have one, however it appears that most of the additional incidents occur at fourway intersections with a traffic signal. Personal experience leads me to believe that these might be related individuals running lights.

For road conditions the results are as follows:

Road Condition	Collision Count
Dry	26821
Wet	5665
Ice	237
Snow covered	224
Other	164
Ice Patches	159
Slush	132
Sand/ mud/ dirt/ oil/ or gravel	21
Water	18

A significant number of collisions occur under wet conditions, considering we would expect roads to remain in its majority dry.

However it does appear that both Water (low count of records) and slush have an effect on if the road collision results on a major injury or a fatality.

Finally taking a look at what can be gathered from spatial data. Analyzing the number of incidents we can determine which zipcodes contain the largest amount of collisions.

It is interesting to note that the zipcodes with the highest number of collisions are not downtown as I would have expected. The top 2 contain parts of Roosevelt Highway, the 3 one is located in between college campuses and the last 2 seem to have large avenues.

The zipcodes are quite general areas, if we generate a hexbin we can clearly see that the majority of the incidents occur on Roosevelt Blvd. With an additional concentration downtown.

If we just look at the individual locations that have the most collisions we can identify that the top 10 locations are mostly around Roosevelt Blvd but there are also other locations with at the downtown area of Race and Vine along with Alegheny and Broad

We can concentrate on the following three areas if we want to reduce the number of collisions.

Looking at only fatalities and major injuries once again the major accidents occur along Roosevelt Ave.

Modeling

Dataset Preparation

Now that we have a clear understanding of the data, we the crash year feature from our data, as we determined that it did not provide significant information. We also dropped the geographical information CODE and geometry as it is believed that the information contained in them can be surmised from other information in the data set.

Our data ends up as the following:

crash_month
False    31939
Name: crash_month, dtype: int64

day_of_week
False    31939
Name: day_of_week, dtype: int64

hour_of_day
False    31939
Name: hour_of_day, dtype: int64

illumination
False    31939
Name: illumination, dtype: int64

intersect_type
False    31939
Name: intersect_type, dtype: int64

max_severity_level
False    31939
Name: max_severity_level, dtype: int64

tcd_type
False    31939
Name: tcd_type, dtype: int64

road_condition
False    31939
Name: road_condition, dtype: int64

Since our all our data is categorical and we do not want to introduce any additional information from label encoding since there is not numerical value between tcd_type features. We perform one hot encoding across the data set.

This results in a dataset of 33,441 records and 78 features.

We perform a 70-30 train test split, and since we are dealing with an imbalanced dataset we perform SMOTE to generate synthetic records to balance the record. However since the first choice of classifying algorithms is a Decision Tree Classifier, the imbalance can have low impact.

4    12544
3    12544
2    12544
1    12544
0    12544
Name: max_severity_level, dtype: int64

SMOTE generated dataset

Model Creation

Several models where attempted and optimized for the multi-class classification problem.

Decision Trees
Random Forest Classifier
KNN Classifier
XGBOOST

Since this is a multi-class classification problem, the measurement of the classifiers isn't as straight forward as for binary classification problems.

This required looking at both the f1-scores, precision, recall and confusion matrices for each one of the attempted results. These models used the SMOTE sampled data as there was a slight increase on performance on the training dataset.

Decision Tree

From the training results we had the following metrics.

	Not injured	Minor injury	Killed	Major injury	Moderate injury	accuracy	macro avg	weighted avg
precision	0.4	0.57	0.28	0.04	0.22	0.5	0.3	0.45
recall	0.27	0.78	0.02	0.11	0.02	0.5	0.24	0.5
f1-score	0.32	0.66	0.04	0.06	0.04	0.5	0.22	0.45
support	6425	12544	261	709	3469	0.5	23408	23408

Confusion Matrix Results

	Not injured	Minor injury	Killed	Major injury	Moderate injury
Not injured	0.27	0.64	0	0.07	0.01
Minor injury	0.14	0.78	0	0.07	0.01
Killed	0.36	0.48	0.02	0.13	0
Major injury	0.24	0.63	0	0.11	0.02
Moderate injury	0.17	0.72	0	0.09	0.02

As we can observer from the results, a decision tree classifier doesn't perform well on this data. It is doing a good job in identifying Minor Injuries, but it is not identifying any other category properly.

Random Forest Classifier

From the training results the following are the results

	Not injured	Minor injury	Killed	Major injury	Moderate injury	accuracy	macro avg	weighted avg
precision	0.75	0.76	0.64	0.65	0.68	0.75	0.7	0.74
recall	0.67	0.88	0.26	0.38	0.5	0.75	0.54	0.75
f1-score	0.71	0.82	0.37	0.48	0.58	0.75	0.59	0.74
support	6425	12544	261	709	3469	0.75	23408	23408

Confusion Matrix Results

	Not injured	Minor injury	Killed	Major injury	Moderate injury
Not injured	0.67	0.28	0	0.01	0.04
Minor injury	0.07	0.88	0	0.01	0.04
Killed	0.24	0.41	0.26	0.01	0.08
Major injury	0.13	0.41	0	0.38	0.07
Moderate injury	0.11	0.38	0	0.01	0.5

This classifier performs significantly better in terms of generating insight into the data. However it presents a problem as it is misclassifying injuries as not injured at , specifically 25 percent of killed result as not injured.

KNN Classifier

From the training results we had the following metrics.

	Not injured	Minor injury	Killed	Major injury	Moderate injury	accuracy	macro avg	weighted avg
precision	0.51	0.7	0.4	0.46	0.42	0.61	0.5	0.6
recall	0.59	0.74	0.08	0.11	0.31	0.61	0.37	0.61
f1-score	0.55	0.72	0.14	0.18	0.36	0.61	0.39	0.59
support	6425	12544	261	709	3469	0.61	23408	23408

Confusion Matrix Results

	Not injured	Minor injury	Killed	Major injury	Moderate injury
Not injured	0.59	0.34	0	0	0.06
Minor injury	0.18	0.74	0	0	0.07
Killed	0.39	0.4	0.08	0.01	0.11
Major injury	0.31	0.47	0.01	0.11	0.11
Moderate injury	0.29	0.4	0	0	0.31

KNN performs substantially worse than our Random Forest Classifier on the training data.

XGBOOST

Using XGBOOST with a one vs rest classifier did not improve performance, and the recall by individual classes was not good.

	Not injured	Minor injury	Killed	Major injury	Moderate injury	accuracy	macro avg	weighted avg
precision	0.49	0.57	0	0.33	0.67	0.56	0.41	0.55
recall	0.25	0.91	0	0	0.01	0.56	0.23	0.56
f1-score	0.33	0.7	0	0	0.02	0.56	0.21	0.47
support	6425	12544	261	709	3469	0.56	23408	23408

Confusion Matrix Results

	Not injured	Minor injury	Moderate injury
Not injured	0.25	0.75	0
Minor injury	0.09	0.91	0
Killed	0.23	0.77	0
Major injury	0.15	0.84	0
Moderate injury	0.11	0.88	0.01

Evaluation/Results

From the results on the training data, the selected model was the one based on Random Forest Classifier. It obtained the highest recall foreach individual class which would mean it is better at identifying all results in that class. The accuracy of .75 is also not bad.

Comparing against the test data, the results do not inspire much confidence.

	Not injured	Minor injury	Killed	Major injury	Moderate injury	accuracy	macro avg	weighted avg
precision	0.34	0.56	0	0.01	0.15	0.47	0.21	0.41
recall	0.28	0.72	0	0	0.07	0.47	0.22	0.47
f1-score	0.31	0.63	0	0	0.1	0.47	0.21	0.43
support	2720	5357	130	325	1501	0.47	10033	10033

Confusion Matrix Results

	Not injured	Minor injury	Killed	Major injury	Moderate injury
Not injured	0.28	0.63	0	0.01	0.07
Minor injury	0.2	0.72	0	0.01	0.07
Killed	0.29	0.59	0	0.02	0.1
Major injury	0.23	0.69	0.01	0	0.07
Moderate injury	0.21	0.7	0	0.01	0.07

The results lead us to believe that the classifier is overfitting the test data, even using Cross Validation. This leads us to believe that the data does not have enough information to differentiate between Killed, Major, Moderate Injuries, vs Minor Injuries vs Not Injured. Additional features would need to be taken into account.

The figure below presents a geographical location of the collisions and if they succeded or not in identifying the proper collision severity (x- represent missclasifications).

However some information can be gathered from the model feature importances

Feature	Importance
hour_of_day	0.31
crash_month	0.28
day_of_week	0.2
intersect_type	0.07
tcd_type	0.05
road_condition	0.05
illumination	0.04

From the weights assigned, it is clear the model is prioritizing the identification of the classes by temporal qualities mostly time of day. Which makes sense based on the exploratory analysis which showd that fatal car crashes occur later at night while most accidents occur at peak commute hours.

Conclusion

In this project we had two goals, identify contributing factors that can cause different severity levels in collisions and build a model that will determine the severity of the accident.

While the models performed relatively well on the training set, it did not perform well on a test dataset leading to the belief that the features selected do not provided enough information to classify into multiple classes.

The model however could be further paired down into less categories or a binary injury vs no injury problem which could lead to better performance and help plan city flow better.

The exploratory data analysis allowed us to obtain various insights.

Slush can increase the severity of a collision and such campaigns addressing those issues could have a positive impact on fatality reductions
Campaigns dealing with improved driver safety late at night could decrease fatalities
Specific high frequency areas where identified and can be targeted for more indepth analysis.

Further Work

Further work to be done would be related to identifying better features for classification and pairing down the target variables into a binary classification project for serious/non serious collisions.

The geographic information should also be included into the data either by performing classification of the geography, using identifiable bins or other methods. This could lead to a better model and help identify locations that are dangerous.

Philadelphia Collision Investigation

Fernando Rodriguez

Fernando Rodriguez

Introduction

Business Problem

Data Understanding and Preparation

Data Sources

Data Preparation

Data Exploration

Modeling

Dataset Preparation

Model Creation

Decision Tree

Random Forest Classifier

KNN Classifier

XGBOOST

Evaluation/Results

Conclusion

Further Work

Automatic Markdown Post Generation from ESPN Fantasy Football API

Simple QT Application for data entry to MongoDB using Python and mongoengine