Pandas tricks I wish I knew when I started

Pandas is great, I use it daily and its a godsend when coming from using massive excel spreadsheets. There are probably about a million beginner tutorials out there teaching you how to use pandas, however most of them seem to be about slicing, importing data from a csv or creating simple charts.

Now pandas user manual is fantastic and pretty much covers everything so definitely check it out but to be fair TLDR.

Below are a few that I have found aren't usually in beginner tutorials.

Dropping Columns

Most tutorials will usually just show you how you can select columns form your database. But sometimes I just one to keep all the columns except for one and I don't feel like typing out all the ones I want. Don't forget that this needs the axis=1 to define we are dropping columns. [1]

df.drop(['col1','col2'],axis=1)

  1. Pandas: drop ↩︎

Creating a timeseries

Pretty often I need to generate an array with a range of dates or months. There are different ways to do this but since most of the time I am already using pandas I prefer to use their method.[1]

#For days
pd.date_range(start='01-01-2020',end='12-31-2020',freq='D')

#For months
pd.date_range(start='01-01-2020',end='12-31-2020',freq='M')

  1. Pandas: date_range ↩︎

Creating a Dictionary

Quite often I find myself having to create a dictionary. Sometimes I need to do something along the lines of db.iterrows to create the dictionary. However most of the time just using the following will suffice. [1]

df.to_dict()

#The type of dictionary can de specified as such

df.to_dict('series')
df.to_dict('records')
df.to_dict('index')


  1. Pandas: to_dict ↩︎

Iterate Groups

For a long time I got away with using db.Col.unique and then iterating in a loop with filtering on that column. It worked, but it also not very efficient or clean looking.

I love pandas groupby method, use it all the time. However I don't love groupby objects. Pretty often I find myself having to operate on the groupby object. I prefer iterating over the groups rather than doing something like filtering by unique column values.[1]

gs=df.groupby('A')
for group_name, groutp in gs:
    print(group_name)
    #Perform whatever we want to do the group

  1. Pandas: groupby ↩︎

Grouping by Months, Days, Years

Groupby can also give us an easy way to get monthly, daily or yearly data.
If we have a datetime column for example we can easily get different frequencies using grouper [1].

df.groupby(pd.Grouper(key='date',freq='M')

  1. Pandas: Grouper ↩︎

Replacing values and cleaning up strings

Most of the time the data you are going to get is going to have quite a few missign values, somebody is going to get cute with whitespace everywhere, or decided that they needed a category that is only going to be used once and now you need to change it to other.

If we just need to get rid of the all the whitespace after our string in a column its pretty straight forward.

df.Col1.apply(lambda x: strip(x))

If we want to replace specific values inside a column we can just do [1]

df.replace({'Col1':{'A1':'A','A2':'A'},'Col2':{'B1':'B','B2','B'}})

If we wanted to get fancy we can also apply regex to our values to, for example, convert multiple whitespaces to underscores. So for example ' Col 1 tp Col_1 I mostly use this on items I know Im going to eventually use as columns. To apply to the values on a column I can to

import re
df.apply(lambda x: re.sub(r'\s+','_',x.strip())

# To apply to the columns
df.columns=[re.sub(r'\s+','_',x.strip() for x in df.columns]

  1. Pandas: replace ↩︎

Pulling values from a database

Another thing to love about pandas is how easy it is to pull data from a database. Most of the time I'm just pulling data from a SQL Server (Sqlite, MySQL or SQL Server 90% of the time). So theres not really need to get to fancy with it and pyodbc will suffice.[1]

import pyodbc
import pandas as pd

server="SQL_SERVER"
user="USER"
database="DB"
pwd="PASSWORD"
conn=pyodbc.connect('Driver={SQL Server};'
                      'Server='+server+';'
                      'Database='+database+';'
                      'UID='+user+';'
                      'PWD='+pwd)#+password_qvw)


pd.read_sql("SELECT * FROM TABLE", conn)

You can also use sqlalchemy and session or engine to read the table or pass a query

from sqlalchemy.orm import sessionmaker

from mysqlalchemymodels import Model

conn_string='mysql://{}:{}@{}:{}/{}?charset=utf8'.format(DB_USER, DB_PASS, DB_HOST, DB_PORT, DATABASE)

engine=create_engine(conn_string)
base.Base.metadata.create_all(engine)

db=pd.read_sql_table("table", con=engine)

#Using SQLAlchemy we can also use our classes, along with SQLAlchemy filtering and joining if so inclined

Session = sessionmaker(bind=engine)
session = Session()

db=pd.read_sql(session.query(Model).statement,session.bind)

  1. Pandas: read_sql ↩︎