Data Exploration with GeoPandas

With the police reform at the forefront of the conversation for most Americans, I thought it would be interesting to look at what information regarding policing is freely available for my current city, Philadelphia.

Enter OpenDataPhilly. A great repository for information regarding Philadelphia.

The most informative datasets I found regarding policing in the city are the following:

The first dataset provides a listing of complaints against police officers linking them to their police district

The second dataset contains information and geographical location of crime incidents.

Finally the last provides the boundaries of each police district, allowing us to do some data aggregation over the district.

A great tool to do spatial analysis is Geopandas. It behaves similar to a pandas database but with some additional spatial features.

The first step is to load our different datasets into a geopandas df.

import geopandas
import pandas as pd

complaints=pd.read_csv('ppd_complaints.csv')
po_districts=geopandas.read_file("Boundaries_District.geojson")
disciplines=pd.read_csv('ppd_complaint_disciplines.csv')
inc=geopandas.read_file("../incidents_part1_part2.geojson")

To check our dataset we can plot it.

po_districts.plot()

In order to simplify the analysis we do some cleanup of the data

complaint_date=complaints.reset_index().set_index('complaint_id')['date_received'].to_dict()

disciplines['date_recieved']=disciplines.complaint_id.map(complaint_date)

disciplines['district_num']=pd.to_numeric(disciplines.po_assigned_unit.dropna().apply(lambda x: x.split(' ')[0][:2]),errors='coerce')

Now there are a couple of different aspects we would want to check for the data. The first would be to determine or get a metric of how may complaints each officer has. I would assume that pretty much every officer has some. But how many is too many? Are there any particular officers or precincts that have a significantly larger amount of than their peers.

We can look at the basic stats of the complaints dataset.

complaintsbyofficer=disciplines.groupby('officer_id').complaint_id.count().sort_values(ascending=False)

complaintsbyofficer.describe().to_frame().style.format("{:.2f}")

	complaint_id
count	3032.00
mean	2.66
std	2.59
min	1.00
25%	1.00
50%	2.00
75%	3.00
max	32.00

However this doesn't provide much information regarding the shape of the dataset and its outliers. Although one officer receiving 32 complaints seems high

One great way to get a quick overview of the outliers we can use a box plot. This is easy to do with seaborn.

import seaborne as sns

unit_officer=disciplines.groupby(['district_num','officer_id']).complaint_id.count().reset_index()

fig, ax = plt.subplots(1, 1,figsize=(35,10))
g=sns.boxplot(x="district_num", y="complaint_id",data=unit_officer[unit_officer['district_num'].isin(po_districts.DISTRICT_.unique())], ax=ax)
plt.xticks(rotation=90);
ax.tick_params(labelsize=24);
fig.suptitle('District Complaints/ Officer', fontsize=32);
ax.set_xlabel('District Number', fontsize=24);
ax.set_ylabel('Number of Complaints', fontsize=24);

disctrict_officer

From the chart it seems that district 14 and district 35 have a significantly amount of officers that receive more complaints than their peers.

While we are at it we can pull out the outliers per category using the following.

outlier_officers=[]
for c, g in unit_officer.reset_index().groupby('district_num'):
    stats=boxplot_stats(g.complaint_id)
    outliers = [y for stat in boxplot_stats(g['complaint_id']) for y in stat['fliers']]
    outlier_officers=np.append(outlier_officers,list(g[g['complaint_id'].isin(outliers)].officer_id.values))

Now that we have the outliers is there an easily identifiable patter for the officers, for example race.

outlier_race=disciplines[disciplines['officer_id'].isin(outlier_officers)].groupby('officer_id')['po_race'].unique()

fig, ax = plt.subplots(1, 1,figsize=(20,10))
outlier_race.apply(lambda x:x[0]).value_counts().plot(kind='bar', ax=ax);
ax.tick_params(labelsize=24)

outlier_hist

From the histogram it appears that a majority of the incidents come from police officers identified as white. However this could also just be because the majority of police officers are white.

fig, ax = plt.subplots(1, 1,figsize=(20,10))
disciplines.groupby('officer_id').po_race.unique().apply(lambda x:x[0]).value_counts().plot(kind='bar');
ax.tick_params(labelsize=24)

officer_hist

From that the histogram of the overall police officers receiving complaints we see that the outlier population seems to closely mirror the overall population so the race of the officer might not have a significant impact in whether or nor an officer receives a large amount of complaints.

Continuing with the racial information contained by the complaints we can check to see if that has an effect on the findings related to the complaints.

colors_tc=cm.Spectral(np.linspace(0, 1,len(disciplines.investigative_findings.unique()) ) )

fig, ax = plt.subplots(1, 1,figsize=(10,10))
disciplines.investigative_findings.value_counts().plot(kind='pie', ax=ax,textprops={'fontsize': 14},colors=colors_tc,wedgeprops={"edgecolor":"k",'linewidth': 1, 'antialiased': True})

inv_res

Discarding the Pending investigations it looks like overall 20-25% of complaints have resulted in sustained findings.

race_results=disciplines.groupby('po_race').investigative_findings.value_counts().to_frame()
race_results.columns=['counts']
race_results=race_results.reset_index()
r_f=race_results[~race_results.investigative_findings.isin(['Pending'])]


cdict={}
for i,c in zip(disciplines.investigative_findings.unique(), colors_tc):
    cdict[i]=c

def plt_pie(x,y,**kwargs):
    cd=[cdict[l] for l in x]
    plt.pie(y,labels=x, colors=cd,wedgeprops={"edgecolor":"k",'linewidth': 1, 'antialiased': True})

g=sns.FacetGrid(r_f,col="po_race", col_wrap=2, sharey=False,height=6)

g=(g.map(plt_pie, "investigative_findings", "counts",cdict=cdict))

inv_res_race

pd.pivot_table(r_f,values='counts',index=['po_race'],columns=['investigative_findings']).apply(lambda x: x / x.sum(), axis=1).fillna(0).style.format("{:.2%}")

investigative_findings	No Sustained Findings	Not Applicable	Sustained Finding
po_race
asian	79.29%	4.14%	16.57%
black	76.20%	3.07%	20.73%
indian	80.00%	20.00%	0.00%
latino	74.45%	2.35%	23.20%
other	80.65%	0.00%	19.35%
white	77.30%	2.43%	20.26%

It does not appear that race of the officer greatly affects the investigation results, although it does appear that Latino officers then to have more complaints sustained and officers of Asian descent have less.

In order to get a better idea of what this looks like spatially we can plot the values and show the amount of complaints by district.
In order to provide a nice basemap we use contextily and make sure we convert to the right coordinate system.

import contextily as ctx

comp_dict=disciplines.groupby('district_num').complaint_id.count().to_dict()
po_districts['comp_count']=pd.to_numeric(po_districts.DISTRICT_).apply(lambda x: comp_dict[x])

fig, ax = plt.subplots(1, 1,figsize=(30,30))
po_districts.to_crs(epsg=3857).plot(column='comp_count',cmap='YlOrRd',legend=True, ax=ax, alpha=0.5)
po_districts.to_crs(epsg=3857).boundary.plot(ax=ax, color='k')
ctx.add_basemap(ax, source=ctx.providers.Stamen.TonerLite)

for l,name in zip(po_districts.to_crs(epsg=3857).centroid,po_districts.DISTRICT_):
    plt.text(l.x,l.y,name,ha='center', va='center',fontsize='32')

ax.set_axis_off()
ax.set_title('POLICE DISTRICT, COMPLAINT COUNTS',fontsize='56');

phl_district

From this plot we can see that districts 14 and 35 (previously identified as having outlier officers) have a large amount of complaints.

However complaint counts by themselves don't necessarily provide us the full picture. After all there could be districts with a higher number of crime incidents leading to a higher number of complaints.

We can get a count of crime incidents per district relatively easy and use that to get a measurement of complaints per 1000 incidents. We can do so using spatial joins.

inc_in_z=geopandas.sjoin(inc,po_districts,how='inner',op='within')

inc_in_d=inc_in_z.groupby('DISTRICT_').objectid.count()

inc_in_d_dict=inc_in_d.to_dict()

po_districts['incidents']=po_districts['DISTRICT_'].apply(lambda x: inc_in_d_dict[x])

po_districts['ratio']=po_districts.apply(lambda x:(x.comp_count*1000)/x.incidents,axis=1)

ratio_dict=po_districts[['DISTRICT_','ratio']].set_index('DISTRICT_').to_dict()['ratio']

inc_in_d=inc_in_d.loc[lambda x: x.index != 77]

xmax=inc_in_d.max()
xmin=inc_in_d.min()

def scale_into_range(x,a,b,xmin,xmax):
    res=(((b-a)*(x-xmin))/(xmax-xmin))+a
    return res

inc_in_d_scale=inc_in_d.apply(lambda x: scale_into_range(x,1000,20000,xmin,xmax))
inc_in_d_scale_dict=inc_in_d_scale.to_dict()

po_districts_n=po_districts[po_districts['DISTRICT_']!=77].copy()
po_districts_n['Inc_Sc']=po_districts_n.DISTRICT_.map(inc_in_d_scale)

po_districts_n['comp_count']=pd.to_numeric(po_districts_n.DISTRICT_).apply(lambda x: comp_dict[x])

fig, ax = plt.subplots(1, 1,figsize=(30,30))
po_districts_n.to_crs(epsg=3857).plot(column='comp_count',cmap='Blues', ax=ax, alpha=0.5)

po_districts_n.to_crs(epsg=3857).boundary.plot(ax=ax, color='k')
ctx.add_basemap(ax, source=ctx.providers.Stamen.TonerLite)
plt.scatter(list(po_districts_n.to_crs(epsg=3857).centroid.x),list(po_districts_n.to_crs(epsg=3857).centroid.y),alpha=0.5,s=list(po_districts_n.Inc_Sc),c=list(po_districts_n.ratio),cmap='YlOrRd',edgecolor='k')  
for l,name in zip(po_districts_n.to_crs(epsg=3857).centroid,po_districts_n.DISTRICT_):
    mlab='District: '+str(name)+'\nIncidents: '+'{:,}'.format(inc_in_d_dict[name])+'\nComplaints: '+str(comp_dict[name])
    plt.text(l.x,l.y,mlab,ha='center', va='center',fontsize='12')

axins = inset_axes(ax,
                   width="5%",  # width = 5% of parent_bbox width
                   height="100%",  # height : 50%
                   loc='lower left',
                   bbox_to_anchor=(-0.1, 0., 1, 1),
                   bbox_transform=ax.transAxes,
                   borderpad=0,
                   )

sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=po_districts_n.comp_count.min(), vmax=po_districts_n.comp_count.max()))
cbr = fig.colorbar(sm, cax=axins,alpha=0.5)
cbr.ax.tick_params(labelsize=18)
cbr.set_label(label='Complaint Counts',fontsize=18)

axins = inset_axes(ax,
                   width="5%",  # width = 5% of parent_bbox width
                   height="100%",  # height : 50%
                   loc='lower left',
                   bbox_to_anchor=(1.1, 0., 1, 1),
                   bbox_transform=ax.transAxes,
                   borderpad=0,
                   )

sm2 = plt.cm.ScalarMappable(cmap='YlOrRd', norm=plt.Normalize(vmin=po_districts_n.ratio.min(), vmax=po_districts_n.ratio.max()))
cbr2 = fig.colorbar(sm2, cax=axins,alpha=0.5)
cbr2.ax.tick_params(labelsize=18)
cbr2.set_label(label='Complaints per 1000 Incidents',fontsize=18)

plt.text(0.8, 0.1, 'Marker Size related to Number of Incidents\n District 77 (Airport) not included',fontsize=26, horizontalalignment='center',verticalalignment='center', transform=ax.transAxes,bbox={'ec':'k','fc':'w'})
ax.set_axis_off()
ax.set_title('POLICE DISTRICT, RATIO OF COMPLAINTS PER 1000 INCIDENTS',fontsize='56')

phl_map_1

In this plot the circles correlate to number of incidents, the color of them related to the ratio of complaints to 1000 incidents. The color of the district correlated to number of complaints.

District 35 once again pops out as having a high ratio of complaints.

Perhaps this district has a large number of complaints that are found to not be valid. We can check that for all districts relatively easy.

dn_results=disciplines.groupby('district_num').investigative_findings.value_counts().to_frame()
dn_results.columns=['counts']
dn_results=dn_results.reset_index()
dn_f=dn_results[~dn_results.investigative_findings.isin(['Pending'])]

display(HTML('<h1>INVESTIGATION RESULTS PER DISTRICT</h1>'))
dn_perc=pd.pivot_table(dn_f,values='counts',index=['district_num'],columns=['investigative_findings']).apply(lambda x: x / x.sum(), axis=1)
dn_perc.loc[po_districts.DISTRICT_.unique()].sort_values('Sustained Finding',ascending=False).style.format("{:.2%}")

fig, ax = plt.subplots(1, 1,figsize=(20,10))
dn_perc.loc[po_districts.DISTRICT_.unique()].plot(kind='bar',figsize=(20,10),stacked=True,ax=ax);
ax.set_title('COMPLAINT INVESTIGATION RESULTS PER DISTRICT',fontsize='32');
ax.tick_params(labelsize=18)

inv_res_district

Form the resulting bar chart it does not appear that District 35 (or anyother outside of 22) have a distinctly different distribution of complaint findings.

Is there anything unique about the districts with a high ratio of complaints per 1000 incidents? One specific aspect to look at is the racial makeup of each district.

We can use the ACS Census information to obtain that information per census block and join that to the district information.

First we query the API to get the state and county codes.

state_codes=requests.get('https://api.census.gov/data/2010/dec/sf1?get=NAME&for=state:*')

state_dict={}
for x in state_codes.json()[1:]:
    state_dict[x[0]]=x[1]
state_code=state_dict['Pennsylvania']

county_codes=requests.get('https://api.census.gov/data/2010/dec/sf1?get=NAME&for=county:*&in=state:'+state_code)

county_dict={}
for x in county_codes.json()[1:]:
    county_dict[x[0]]=x[2]
county_code=county_dict['Philadelphia County, Pennsylvania']

After we get both the state code and the county information we can request the population information form the api.

census_race=requests.get('https://api.census.gov/data/2018/acs/acs5?get=group(B02001)&for=block%20group:*&in=state:'+state_code+'%20'+'county:'+county_code)

census_race_df=pd.DataFrame(census_race.json()[1:],columns=census_race.json()[0])

#We do some data cleanup there, eventually creating a key for its block number

census_race_df['bkey']=census_race_df.apply(lambda x: x['tract']+'|'+x['block group'],axis=1)

From OpenDataPhilly we also download the census block geojson and joining it with the police district with the census block info.

c_bgroup=geopandas.read_file("Census_Block_Groups_2010.geojson")
c_bgroup['bkey']=c_bgroup.TRACTCE10+['|']+c_bgroup.BLKGRPCE10
cbgroups_in_pd=geopandas.sjoin(c_bgroup,po_districts,how='inner',op='intersects')

We combine the district geojson information with the race population counts.

cb_div=cbgroups_in_pd.join(census_race_df.drop(['Geography','Total','state','county','tract','block group'], axis=1),on='bkey',how='left')

We can convert that to percentages and combine some columns and create some dictionaries for mapping to our geopandas df.

district_race_perc=cb_div[['DISTRICT_']+race_col].groupby('DISTRICT_').sum().apply(lambda x: x / x.sum(), axis=1).dropna()

cats=['White alone','Black or African American alone','Asian alone']
district_race_perc['Other']=district_race_perc.drop(cats,axis=1).sum(axis=1)

white_dict=district_race_perc['White alone'].to_dict()
black_dict=district_race_perc['Black or African American alone'].to_dict()
asian_dict=district_race_perc['Asian alone'].to_dict()
other_dict=district_race_perc['Other'].to_dict()

Now we can update our previous Philly map to include the raccial information

po_districts_n['perc_white']=pd.to_numeric(po_districts_n.DISTRICT_).map(white_dict)
po_districts_n['perc_black']=pd.to_numeric(po_districts_n.DISTRICT_).map(black_dict)
po_districts_n['perc_asian']=pd.to_numeric(po_districts_n.DISTRICT_).map(asian_dict)
po_districts_n['perc_other']=pd.to_numeric(po_districts_n.DISTRICT_).map(other_dict)

fig, ax = plt.subplots(1, 1,figsize=(30,30))
po_districts_n.to_crs(epsg=3857).plot(column='perc_black',cmap='Greys', ax=ax, alpha=0.8)

po_districts_n.to_crs(epsg=3857).boundary.plot(ax=ax, color='k')
ctx.add_basemap(ax, source=ctx.providers.Stamen.TonerLite)
plt.scatter(list(po_districts_n.to_crs(epsg=3857).centroid.x),list(po_districts_n.to_crs(epsg=3857).centroid.y),alpha=0.5,s=list(po_districts_n.Inc_Sc),c=list(po_districts_n.ratio),cmap='YlOrRd',edgecolor='k')  
for l,name in zip(po_districts_n.to_crs(epsg=3857).centroid,po_districts_n.DISTRICT_):
    mlab='District: '+str(name)+'\nIncidents: '+'{:,}'.format(inc_in_d_dict[name])+'\nComplaints: '+str(comp_dict[name])
    plt.text(l.x,l.y,mlab,ha='center', va='center',fontsize='12',bbox={'ec':'k','fc':'w'})

axins = inset_axes(ax,
                   width="5%",  # width = 5% of parent_bbox width
                   height="100%",  # height : 50%
                   loc='lower left',
                   bbox_to_anchor=(-0.1, 0., 1, 1),
                   bbox_transform=ax.transAxes,
                   borderpad=0,
                   )

sm = plt.cm.ScalarMappable(cmap='Greys', norm=plt.Normalize(vmin=0.0, vmax=1))
cbr = fig.colorbar(sm, cax=axins,alpha=0.5)
cbr.ax.tick_params(labelsize=18)
cbr.set_label(label='Percentage of population identifying as "Black alone"',fontsize=18)

axins = inset_axes(ax,
                   width="5%",  # width = 5% of parent_bbox width
                   height="100%",  # height : 50%
                   loc='lower left',
                   bbox_to_anchor=(1.01, 0., 1, 1),
                   bbox_transform=ax.transAxes,
                   borderpad=0,
                   )

sm2 = plt.cm.ScalarMappable(cmap='YlOrRd', norm=plt.Normalize(vmin=po_districts_n.ratio.min(), vmax=po_districts_n.ratio.max()))
cbr2 = fig.colorbar(sm2, cax=axins,alpha=0.8)
cbr2.ax.tick_params(labelsize=18)
cbr2.set_label(label='Complaints per 1000 Incidents',fontsize=18)

plt.text(0.8, 0.1, 'Marker Size related to Number of Incidents\n District 77 (Airport) not included',fontsize=26, horizontalalignment='center',verticalalignment='center', transform=ax.transAxes,bbox={'ec':'k','fc':'w'})
ax.set_axis_off()
ax.set_title('POLICE DISTRICT, RATIO OF COMPLAINTS PER 1000 INCIDENTS \n PERCENTAGE OF POPULATION IDENTIFYING AS BLACK',fontsize='32')
plt.savefig('PHL_MAP')

phl_map_2

DISTRICT_	perc_white	perc_black	perc_asian	perc_other	ratio	incidents	comp_count
35	9.47%	72.70%	6.75%	11.07%	4.24	162,935	691
14	18.56%	71.66%	1.53%	8.25%	3.67	148,235	544
19	9.10%	83.63%	2.18%	5.10%	2.74	172,064	472
18	27.47%	55.67%	8.59%	8.27%	2.70	139,114	376
39	17.27%	72.65%	2.60%	7.49%	2.48	124,017	308
16	24.80%	58.36%	8.34%	8.49%	2.41	93,365	225
25	27.44%	36.08%	3.20%	33.29%	2.35	180,793	424
22	16.45%	73.79%	3.47%	6.28%	2.25	188,417	424
5	77.79%	11.94%	2.59%	7.68%	2.17	38,290	83
12	9.41%	79.83%	3.60%	7.17%	2.12	161,608	343
8	73.46%	12.05%	6.15%	8.34%	2.00	87,432	175
17	43.36%	42.48%	6.97%	7.19%	1.93	88,161	170
15	48.66%	27.28%	6.04%	18.03%	1.91	221,014	423
7	69.76%	8.96%	13.20%	8.09%	1.85	52,539	97
24	50.09%	18.11%	3.02%	28.77%	1.78	202,544	361
1	49.57%	33.19%	11.54%	5.69%	1.71	56,095	96
6	62.55%	17.34%	11.86%	8.25%	1.61	120,764	194
2	38.41%	31.33%	13.25%	17.01%	1.58	138,937	219
26	52.46%	23.23%	5.21%	19.10%	1.44	107,303	154
9	66.21%	14.37%	10.82%	8.60%	1.41	105,538	149
3	66.54%	8.44%	16.32%	8.69%	1.18	137,634	162

It appears that there is a significant correlation between both the location of the neighborhood (north side Philly), racial composition of the district and the ratio of complaints to incidents. Mainly it does seem significant that the districts with a majority of black population seemed to have more complaints per each incident and should be further investigated to identify driving factors.

The Jupyter notebook used to run the analysis is located at my github.

Philly PD Incident Data Exploration with GeoPandas

Fernando Rodriguez

Fernando Rodriguez

Managing a SharePoint site using Microsoft Graph API and Python

Building a Fitness Tracking Dashboard with Python. Pt 3 Flask and Deployment