Philly PD Incident Data Exploration with GeoPandas
With the police reform at the forefront of the conversation for most Americans, I thought it would be interesting to look at what information regarding policing is freely available for my current city, Philadelphia.
Enter OpenDataPhilly. A great repository for information regarding Philadelphia.
The most informative datasets I found regarding policing in the city are the following:
The first dataset provides a listing of complaints against police officers linking them to their police district
The second dataset contains information and geographical location of crime incidents.
Finally the last provides the boundaries of each police district, allowing us to do some data aggregation over the district.
A great tool to do spatial analysis is Geopandas. It behaves similar to a pandas database but with some additional spatial features.
The first step is to load our different datasets into a geopandas df.
import geopandas
import pandas as pd
complaints=pd.read_csv('ppd_complaints.csv')
po_districts=geopandas.read_file("Boundaries_District.geojson")
disciplines=pd.read_csv('ppd_complaint_disciplines.csv')
inc=geopandas.read_file("../incidents_part1_part2.geojson")
To check our dataset we can plot it.
po_districts.plot()
In order to simplify the analysis we do some cleanup of the data
complaint_date=complaints.reset_index().set_index('complaint_id')['date_received'].to_dict()
disciplines['date_recieved']=disciplines.complaint_id.map(complaint_date)
disciplines['district_num']=pd.to_numeric(disciplines.po_assigned_unit.dropna().apply(lambda x: x.split(' ')[0][:2]),errors='coerce')
Now there are a couple of different aspects we would want to check for the data. The first would be to determine or get a metric of how may complaints each officer has. I would assume that pretty much every officer has some. But how many is too many? Are there any particular officers or precincts that have a significantly larger amount of than their peers.
We can look at the basic stats of the complaints dataset.
complaintsbyofficer=disciplines.groupby('officer_id').complaint_id.count().sort_values(ascending=False)
complaintsbyofficer.describe().to_frame().style.format("{:.2f}")
complaint_id | |
---|---|
count | 3032.00 |
mean | 2.66 |
std | 2.59 |
min | 1.00 |
25% | 1.00 |
50% | 2.00 |
75% | 3.00 |
max | 32.00 |
However this doesn't provide much information regarding the shape of the dataset and its outliers. Although one officer receiving 32 complaints seems high
One great way to get a quick overview of the outliers we can use a box plot. This is easy to do with seaborn.
import seaborne as sns
unit_officer=disciplines.groupby(['district_num','officer_id']).complaint_id.count().reset_index()
fig, ax = plt.subplots(1, 1,figsize=(35,10))
g=sns.boxplot(x="district_num", y="complaint_id",data=unit_officer[unit_officer['district_num'].isin(po_districts.DISTRICT_.unique())], ax=ax)
plt.xticks(rotation=90);
ax.tick_params(labelsize=24);
fig.suptitle('District Complaints/ Officer', fontsize=32);
ax.set_xlabel('District Number', fontsize=24);
ax.set_ylabel('Number of Complaints', fontsize=24);
From the chart it seems that district 14 and district 35 have a significantly amount of officers that receive more complaints than their peers.
While we are at it we can pull out the outliers per category using the following.
outlier_officers=[]
for c, g in unit_officer.reset_index().groupby('district_num'):
stats=boxplot_stats(g.complaint_id)
outliers = [y for stat in boxplot_stats(g['complaint_id']) for y in stat['fliers']]
outlier_officers=np.append(outlier_officers,list(g[g['complaint_id'].isin(outliers)].officer_id.values))
Now that we have the outliers is there an easily identifiable patter for the officers, for example race.
outlier_race=disciplines[disciplines['officer_id'].isin(outlier_officers)].groupby('officer_id')['po_race'].unique()
fig, ax = plt.subplots(1, 1,figsize=(20,10))
outlier_race.apply(lambda x:x[0]).value_counts().plot(kind='bar', ax=ax);
ax.tick_params(labelsize=24)
From the histogram it appears that a majority of the incidents come from police officers identified as white. However this could also just be because the majority of police officers are white.
fig, ax = plt.subplots(1, 1,figsize=(20,10))
disciplines.groupby('officer_id').po_race.unique().apply(lambda x:x[0]).value_counts().plot(kind='bar');
ax.tick_params(labelsize=24)
From that the histogram of the overall police officers receiving complaints we see that the outlier population seems to closely mirror the overall population so the race of the officer might not have a significant impact in whether or nor an officer receives a large amount of complaints.
Continuing with the racial information contained by the complaints we can check to see if that has an effect on the findings related to the complaints.
colors_tc=cm.Spectral(np.linspace(0, 1,len(disciplines.investigative_findings.unique()) ) )
fig, ax = plt.subplots(1, 1,figsize=(10,10))
disciplines.investigative_findings.value_counts().plot(kind='pie', ax=ax,textprops={'fontsize': 14},colors=colors_tc,wedgeprops={"edgecolor":"k",'linewidth': 1, 'antialiased': True})
Discarding the Pending investigations it looks like overall 20-25% of complaints have resulted in sustained findings.
race_results=disciplines.groupby('po_race').investigative_findings.value_counts().to_frame()
race_results.columns=['counts']
race_results=race_results.reset_index()
r_f=race_results[~race_results.investigative_findings.isin(['Pending'])]
cdict={}
for i,c in zip(disciplines.investigative_findings.unique(), colors_tc):
cdict[i]=c
def plt_pie(x,y,**kwargs):
cd=[cdict[l] for l in x]
plt.pie(y,labels=x, colors=cd,wedgeprops={"edgecolor":"k",'linewidth': 1, 'antialiased': True})
g=sns.FacetGrid(r_f,col="po_race", col_wrap=2, sharey=False,height=6)
g=(g.map(plt_pie, "investigative_findings", "counts",cdict=cdict))
pd.pivot_table(r_f,values='counts',index=['po_race'],columns=['investigative_findings']).apply(lambda x: x / x.sum(), axis=1).fillna(0).style.format("{:.2%}")
investigative_findings | No Sustained Findings | Not Applicable | Sustained Finding |
---|---|---|---|
po_race | |||
asian | 79.29% | 4.14% | 16.57% |
black | 76.20% | 3.07% | 20.73% |
indian | 80.00% | 20.00% | 0.00% |
latino | 74.45% | 2.35% | 23.20% |
other | 80.65% | 0.00% | 19.35% |
white | 77.30% | 2.43% | 20.26% |
It does not appear that race of the officer greatly affects the investigation results, although it does appear that Latino officers then to have more complaints sustained and officers of Asian descent have less.
In order to get a better idea of what this looks like spatially we can plot the values and show the amount of complaints by district.
In order to provide a nice basemap we use contextily and make sure we convert to the right coordinate system.
import contextily as ctx
comp_dict=disciplines.groupby('district_num').complaint_id.count().to_dict()
po_districts['comp_count']=pd.to_numeric(po_districts.DISTRICT_).apply(lambda x: comp_dict[x])
fig, ax = plt.subplots(1, 1,figsize=(30,30))
po_districts.to_crs(epsg=3857).plot(column='comp_count',cmap='YlOrRd',legend=True, ax=ax, alpha=0.5)
po_districts.to_crs(epsg=3857).boundary.plot(ax=ax, color='k')
ctx.add_basemap(ax, source=ctx.providers.Stamen.TonerLite)
for l,name in zip(po_districts.to_crs(epsg=3857).centroid,po_districts.DISTRICT_):
plt.text(l.x,l.y,name,ha='center', va='center',fontsize='32')
ax.set_axis_off()
ax.set_title('POLICE DISTRICT, COMPLAINT COUNTS',fontsize='56');
From this plot we can see that districts 14 and 35 (previously identified as having outlier officers) have a large amount of complaints.
However complaint counts by themselves don't necessarily provide us the full picture. After all there could be districts with a higher number of crime incidents leading to a higher number of complaints.
We can get a count of crime incidents per district relatively easy and use that to get a measurement of complaints per 1000 incidents. We can do so using spatial joins.
inc_in_z=geopandas.sjoin(inc,po_districts,how='inner',op='within')
inc_in_d=inc_in_z.groupby('DISTRICT_').objectid.count()
inc_in_d_dict=inc_in_d.to_dict()
po_districts['incidents']=po_districts['DISTRICT_'].apply(lambda x: inc_in_d_dict[x])
po_districts['ratio']=po_districts.apply(lambda x:(x.comp_count*1000)/x.incidents,axis=1)
ratio_dict=po_districts[['DISTRICT_','ratio']].set_index('DISTRICT_').to_dict()['ratio']
inc_in_d=inc_in_d.loc[lambda x: x.index != 77]
xmax=inc_in_d.max()
xmin=inc_in_d.min()
def scale_into_range(x,a,b,xmin,xmax):
res=(((b-a)*(x-xmin))/(xmax-xmin))+a
return res
inc_in_d_scale=inc_in_d.apply(lambda x: scale_into_range(x,1000,20000,xmin,xmax))
inc_in_d_scale_dict=inc_in_d_scale.to_dict()
po_districts_n=po_districts[po_districts['DISTRICT_']!=77].copy()
po_districts_n['Inc_Sc']=po_districts_n.DISTRICT_.map(inc_in_d_scale)
po_districts_n['comp_count']=pd.to_numeric(po_districts_n.DISTRICT_).apply(lambda x: comp_dict[x])
fig, ax = plt.subplots(1, 1,figsize=(30,30))
po_districts_n.to_crs(epsg=3857).plot(column='comp_count',cmap='Blues', ax=ax, alpha=0.5)
po_districts_n.to_crs(epsg=3857).boundary.plot(ax=ax, color='k')
ctx.add_basemap(ax, source=ctx.providers.Stamen.TonerLite)
plt.scatter(list(po_districts_n.to_crs(epsg=3857).centroid.x),list(po_districts_n.to_crs(epsg=3857).centroid.y),alpha=0.5,s=list(po_districts_n.Inc_Sc),c=list(po_districts_n.ratio),cmap='YlOrRd',edgecolor='k')
for l,name in zip(po_districts_n.to_crs(epsg=3857).centroid,po_districts_n.DISTRICT_):
mlab='District: '+str(name)+'\nIncidents: '+'{:,}'.format(inc_in_d_dict[name])+'\nComplaints: '+str(comp_dict[name])
plt.text(l.x,l.y,mlab,ha='center', va='center',fontsize='12')
axins = inset_axes(ax,
width="5%", # width = 5% of parent_bbox width
height="100%", # height : 50%
loc='lower left',
bbox_to_anchor=(-0.1, 0., 1, 1),
bbox_transform=ax.transAxes,
borderpad=0,
)
sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=po_districts_n.comp_count.min(), vmax=po_districts_n.comp_count.max()))
cbr = fig.colorbar(sm, cax=axins,alpha=0.5)
cbr.ax.tick_params(labelsize=18)
cbr.set_label(label='Complaint Counts',fontsize=18)
axins = inset_axes(ax,
width="5%", # width = 5% of parent_bbox width
height="100%", # height : 50%
loc='lower left',
bbox_to_anchor=(1.1, 0., 1, 1),
bbox_transform=ax.transAxes,
borderpad=0,
)
sm2 = plt.cm.ScalarMappable(cmap='YlOrRd', norm=plt.Normalize(vmin=po_districts_n.ratio.min(), vmax=po_districts_n.ratio.max()))
cbr2 = fig.colorbar(sm2, cax=axins,alpha=0.5)
cbr2.ax.tick_params(labelsize=18)
cbr2.set_label(label='Complaints per 1000 Incidents',fontsize=18)
plt.text(0.8, 0.1, 'Marker Size related to Number of Incidents\n District 77 (Airport) not included',fontsize=26, horizontalalignment='center',verticalalignment='center', transform=ax.transAxes,bbox={'ec':'k','fc':'w'})
ax.set_axis_off()
ax.set_title('POLICE DISTRICT, RATIO OF COMPLAINTS PER 1000 INCIDENTS',fontsize='56')
In this plot the circles correlate to number of incidents, the color of them related to the ratio of complaints to 1000 incidents. The color of the district correlated to number of complaints.
District 35 once again pops out as having a high ratio of complaints.
Perhaps this district has a large number of complaints that are found to not be valid. We can check that for all districts relatively easy.
dn_results=disciplines.groupby('district_num').investigative_findings.value_counts().to_frame()
dn_results.columns=['counts']
dn_results=dn_results.reset_index()
dn_f=dn_results[~dn_results.investigative_findings.isin(['Pending'])]
display(HTML('<h1>INVESTIGATION RESULTS PER DISTRICT</h1>'))
dn_perc=pd.pivot_table(dn_f,values='counts',index=['district_num'],columns=['investigative_findings']).apply(lambda x: x / x.sum(), axis=1)
dn_perc.loc[po_districts.DISTRICT_.unique()].sort_values('Sustained Finding',ascending=False).style.format("{:.2%}")
fig, ax = plt.subplots(1, 1,figsize=(20,10))
dn_perc.loc[po_districts.DISTRICT_.unique()].plot(kind='bar',figsize=(20,10),stacked=True,ax=ax);
ax.set_title('COMPLAINT INVESTIGATION RESULTS PER DISTRICT',fontsize='32');
ax.tick_params(labelsize=18)
Form the resulting bar chart it does not appear that District 35 (or anyother outside of 22) have a distinctly different distribution of complaint findings.
Is there anything unique about the districts with a high ratio of complaints per 1000 incidents? One specific aspect to look at is the racial makeup of each district.
We can use the ACS Census information to obtain that information per census block and join that to the district information.
First we query the API to get the state and county codes.
state_codes=requests.get('https://api.census.gov/data/2010/dec/sf1?get=NAME&for=state:*')
state_dict={}
for x in state_codes.json()[1:]:
state_dict[x[0]]=x[1]
state_code=state_dict['Pennsylvania']
county_codes=requests.get('https://api.census.gov/data/2010/dec/sf1?get=NAME&for=county:*&in=state:'+state_code)
county_dict={}
for x in county_codes.json()[1:]:
county_dict[x[0]]=x[2]
county_code=county_dict['Philadelphia County, Pennsylvania']
After we get both the state code and the county information we can request the population information form the api.
census_race=requests.get('https://api.census.gov/data/2018/acs/acs5?get=group(B02001)&for=block%20group:*&in=state:'+state_code+'%20'+'county:'+county_code)
census_race_df=pd.DataFrame(census_race.json()[1:],columns=census_race.json()[0])
#We do some data cleanup there, eventually creating a key for its block number
census_race_df['bkey']=census_race_df.apply(lambda x: x['tract']+'|'+x['block group'],axis=1)
From OpenDataPhilly we also download the census block geojson and joining it with the police district with the census block info.
c_bgroup=geopandas.read_file("Census_Block_Groups_2010.geojson")
c_bgroup['bkey']=c_bgroup.TRACTCE10+['|']+c_bgroup.BLKGRPCE10
cbgroups_in_pd=geopandas.sjoin(c_bgroup,po_districts,how='inner',op='intersects')
We combine the district geojson information with the race population counts.
cb_div=cbgroups_in_pd.join(census_race_df.drop(['Geography','Total','state','county','tract','block group'], axis=1),on='bkey',how='left')
We can convert that to percentages and combine some columns and create some dictionaries for mapping to our geopandas df.
district_race_perc=cb_div[['DISTRICT_']+race_col].groupby('DISTRICT_').sum().apply(lambda x: x / x.sum(), axis=1).dropna()
cats=['White alone','Black or African American alone','Asian alone']
district_race_perc['Other']=district_race_perc.drop(cats,axis=1).sum(axis=1)
white_dict=district_race_perc['White alone'].to_dict()
black_dict=district_race_perc['Black or African American alone'].to_dict()
asian_dict=district_race_perc['Asian alone'].to_dict()
other_dict=district_race_perc['Other'].to_dict()
Now we can update our previous Philly map to include the raccial information
po_districts_n['perc_white']=pd.to_numeric(po_districts_n.DISTRICT_).map(white_dict)
po_districts_n['perc_black']=pd.to_numeric(po_districts_n.DISTRICT_).map(black_dict)
po_districts_n['perc_asian']=pd.to_numeric(po_districts_n.DISTRICT_).map(asian_dict)
po_districts_n['perc_other']=pd.to_numeric(po_districts_n.DISTRICT_).map(other_dict)
fig, ax = plt.subplots(1, 1,figsize=(30,30))
po_districts_n.to_crs(epsg=3857).plot(column='perc_black',cmap='Greys', ax=ax, alpha=0.8)
po_districts_n.to_crs(epsg=3857).boundary.plot(ax=ax, color='k')
ctx.add_basemap(ax, source=ctx.providers.Stamen.TonerLite)
plt.scatter(list(po_districts_n.to_crs(epsg=3857).centroid.x),list(po_districts_n.to_crs(epsg=3857).centroid.y),alpha=0.5,s=list(po_districts_n.Inc_Sc),c=list(po_districts_n.ratio),cmap='YlOrRd',edgecolor='k')
for l,name in zip(po_districts_n.to_crs(epsg=3857).centroid,po_districts_n.DISTRICT_):
mlab='District: '+str(name)+'\nIncidents: '+'{:,}'.format(inc_in_d_dict[name])+'\nComplaints: '+str(comp_dict[name])
plt.text(l.x,l.y,mlab,ha='center', va='center',fontsize='12',bbox={'ec':'k','fc':'w'})
axins = inset_axes(ax,
width="5%", # width = 5% of parent_bbox width
height="100%", # height : 50%
loc='lower left',
bbox_to_anchor=(-0.1, 0., 1, 1),
bbox_transform=ax.transAxes,
borderpad=0,
)
sm = plt.cm.ScalarMappable(cmap='Greys', norm=plt.Normalize(vmin=0.0, vmax=1))
cbr = fig.colorbar(sm, cax=axins,alpha=0.5)
cbr.ax.tick_params(labelsize=18)
cbr.set_label(label='Percentage of population identifying as "Black alone"',fontsize=18)
axins = inset_axes(ax,
width="5%", # width = 5% of parent_bbox width
height="100%", # height : 50%
loc='lower left',
bbox_to_anchor=(1.01, 0., 1, 1),
bbox_transform=ax.transAxes,
borderpad=0,
)
sm2 = plt.cm.ScalarMappable(cmap='YlOrRd', norm=plt.Normalize(vmin=po_districts_n.ratio.min(), vmax=po_districts_n.ratio.max()))
cbr2 = fig.colorbar(sm2, cax=axins,alpha=0.8)
cbr2.ax.tick_params(labelsize=18)
cbr2.set_label(label='Complaints per 1000 Incidents',fontsize=18)
plt.text(0.8, 0.1, 'Marker Size related to Number of Incidents\n District 77 (Airport) not included',fontsize=26, horizontalalignment='center',verticalalignment='center', transform=ax.transAxes,bbox={'ec':'k','fc':'w'})
ax.set_axis_off()
ax.set_title('POLICE DISTRICT, RATIO OF COMPLAINTS PER 1000 INCIDENTS \n PERCENTAGE OF POPULATION IDENTIFYING AS BLACK',fontsize='32')
plt.savefig('PHL_MAP')
DISTRICT_ | perc_white | perc_black | perc_asian | perc_other | ratio | incidents | comp_count |
---|---|---|---|---|---|---|---|
35 | 9.47% | 72.70% | 6.75% | 11.07% | 4.24 | 162,935 | 691 |
14 | 18.56% | 71.66% | 1.53% | 8.25% | 3.67 | 148,235 | 544 |
19 | 9.10% | 83.63% | 2.18% | 5.10% | 2.74 | 172,064 | 472 |
18 | 27.47% | 55.67% | 8.59% | 8.27% | 2.70 | 139,114 | 376 |
39 | 17.27% | 72.65% | 2.60% | 7.49% | 2.48 | 124,017 | 308 |
16 | 24.80% | 58.36% | 8.34% | 8.49% | 2.41 | 93,365 | 225 |
25 | 27.44% | 36.08% | 3.20% | 33.29% | 2.35 | 180,793 | 424 |
22 | 16.45% | 73.79% | 3.47% | 6.28% | 2.25 | 188,417 | 424 |
5 | 77.79% | 11.94% | 2.59% | 7.68% | 2.17 | 38,290 | 83 |
12 | 9.41% | 79.83% | 3.60% | 7.17% | 2.12 | 161,608 | 343 |
8 | 73.46% | 12.05% | 6.15% | 8.34% | 2.00 | 87,432 | 175 |
17 | 43.36% | 42.48% | 6.97% | 7.19% | 1.93 | 88,161 | 170 |
15 | 48.66% | 27.28% | 6.04% | 18.03% | 1.91 | 221,014 | 423 |
7 | 69.76% | 8.96% | 13.20% | 8.09% | 1.85 | 52,539 | 97 |
24 | 50.09% | 18.11% | 3.02% | 28.77% | 1.78 | 202,544 | 361 |
1 | 49.57% | 33.19% | 11.54% | 5.69% | 1.71 | 56,095 | 96 |
6 | 62.55% | 17.34% | 11.86% | 8.25% | 1.61 | 120,764 | 194 |
2 | 38.41% | 31.33% | 13.25% | 17.01% | 1.58 | 138,937 | 219 |
26 | 52.46% | 23.23% | 5.21% | 19.10% | 1.44 | 107,303 | 154 |
9 | 66.21% | 14.37% | 10.82% | 8.60% | 1.41 | 105,538 | 149 |
3 | 66.54% | 8.44% | 16.32% | 8.69% | 1.18 | 137,634 | 162 |
It appears that there is a significant correlation between both the location of the neighborhood (north side Philly), racial composition of the district and the ratio of complaints to incidents. Mainly it does seem significant that the districts with a majority of black population seemed to have more complaints per each incident and should be further investigated to identify driving factors.
The Jupyter notebook used to run the analysis is located at my github.