User Tools

Site Tools


Creating a DataFrame

Create empty DataFrame

import pandas
index = pandas.date_range('2013-03-01', '2013-03-02 23:45:00', freq='15min')
df = pandas.DataFrame(index=index, columns=['A','B', 'C'])

Create DataFrame from pandas.Series

import pandas
s1 = pandas.Series([1,2,3], index=[0,1,2])
s2 = pandas.Series(['a','b','c'], index=[2,3,4])
df = pandas.DataFrame( {'column1':s1, 'column2':s2})  
# Note:
# pandas uses the index of both series to join the correct rows together:
print df
   column1 column2
0        1     NaN
1        2     NaN
2        3       a
3      NaN       b
4      NaN       c

Import data from CSV

import pandas
data = pandas.read_csv("data.csv", index_col="time", parse_dates=True) 

Parse index as timestamp:

import pandas
data = pandas.read_csv("data.csv", parse_dates=True) # tries to parse index as timestamp

Parse some columns as timestamp

import pandas
data = pandas.read_csv("data.csv", index_col="time", parse_dates=["from_time", "to_time"]) # list of ints or names, list of lists, or dict

Import data from DB

import as sql
import psycopg2
c = psycopg2.connect(host='localhost', database='big', user='sec', password='ret')
frame = sql.read_sql("select * from my_table", c)

Working with DataFrames

Multi-axis indexing

Starting in 0.11.0, more explicit location based indexing is supported. Pandas now supports three types of multi-axis indexing, by label (loc), by position (iloc) and mixed (ix):

df.loc['a'] # row with index a
df.time_stamp.iloc[0]  # first row, column "time_stamp"
df.time_stamp.iloc[-1] # last row, column "time_stamp"

ix supports mixed integer and label based access: it will try to use arguments as labels first, then fallback to integer positional access). The documentations warns that it is better to be explicit and use .iloc or .loc whenever possible.

Iterate over rows

for (index_value, series) in data_frame.iterrows():
   # do something

maybe a little bit more intuitive, but not so sure about the performance

for (idx, col1, col2) in zip(df.index, df.col1, df.col2):
    # do something

Iterate over columns

for (column_name, timeseries) in data_frame.iterkv():
   # do something

Add Column

# simple version
data['new_column'] = 0 
# complicated version (not sure if this has an advantage)
data['new_column'] = pandas.Series(numpy.zeros(len(data)), index=data.index)


Suppose you have a dataframe with an incomplete timeseries of a day, and you want to fill the missing values with nan:

    complete_day = pandas.date_range('2013-05-18', periods=96, freq='15min')


data_frame.sort(inplace=True, ascending=True) # sort by index


from pandas.tseries.resample import TimeGrouper
for name, group in groupby(TimeGrouper(freq='15min')):
    print name, group

Get row index

From SO

from numpy import where
import pandas
from datetime import datetime
index = pandas.date_range('2013-03-01', '2013-03-02 23:45:00', freq='15min')
df = pandas.DataFrame(index=index, columns=['A','B', 'C'])
ts = datetime.strptime("2013-03-01 01","%Y-%m-%d %H") # row index: 4
print where(df.index == ts)[0] # The value returned is an array since there could be more than one row with a particular index or value in a column.


Suppose you have a pandas DataFrame with a Datetimeindex and you want all entries in the hours 0-3

data[data.index.hour < 4]

For multiple criteria it is easier to define an explicit variable:

where = (data.speed > 10 ) |  ( data.speed < 90 )  
# dont forget the ()  or you get an error. Also note the bit-wise or-operator!

Modify Selection

This can be very counter-intuitive in pandas. The correct way to modify a DataFrame via boolean selection is to specify the field to be modifed FIRST and the boolean index afterwards.

data.ix[where, 'is_filtered'] = 1 # probably this is the recommended version? 
# Be aware that the version where the boolean index is specified first DOES NOT WORK
# data[where]['is_filtered'] = 1    # THIS IS WRONG
# because data[where] calls __getitem__ which creates a COPY for boolean-indexed-slices



from pandas.tseries.resample import TimeGrouper
def filter_func(grp):
    grp = grp.copy()  # otherwise the original dataframe will be modified!
    where = ( grp['speed'] < 10 ) | (grp['speed'] > 90 ) 
    grp.filtered[where] = 1
    return grp
filtered_data = data.groupby(TimeGrouper(freq='15min')).transform(filter_func)

Combining DataFrames

Combining Data with different intervals

The Series.asof Function is well suited to do that:

Example: Join every row with the 15min Median of the previous time interval

data = ....
median15 = data.resample("15min", how="median", label="right") # note: label="right" causes every record in data to be joined with the PREVIOUS median
data['last_median']  =  [median15.speed.asof(t) for t in data.index]


Details in Pandas doc

result = df1.join(df2, how="outer", lsuffix="_df1", rsuffix="_df2")


# add column
data['new_column'] = 0 
# add entries


from pandas import Series, date_range
from numpy.random import randn
rng1 = date_range('2011-04-01', periods=5, freq='H')
rng2 = date_range('2011-04-02', periods=7, freq='H')
df1 = Series(randn(len(rng1)), index=rng1)
df2 = Series(randn(len(rng2)), index=rng2)
# align
adf1, adf2 = df1.align(df2) # returns two aligned dataframes, default: outer join

Stack / Pivot


Suppose you have data like this:

time attribute value
12:00 speed 45.0
12:00 count 150

But you prefer to have data like this:

time speed count
12:00 45.0 150
from datetime import time 
import pandas
df = pandas.DataFrame(dict(timestamp=[time(12,0), time(12,0)], attribute=["speed","count"], value=[45.0, 150.0]))
df.pivot(index="timestamp", columns="attribute", values="value")

Output / Conversion

Pandas offers a lot of good output functionality. Some examples


to create a excel file with more sheets:

writer = pd.ExcelWriter('tmp.xls')
df1.to_excel(writer, sheet_name='sheet1')
df2.to_excel(writer, sheet_name='sheet2')

save dataframe into a database if_exists can be fail, replace or append

from import sql
import psycopg2
# database connection
c = psycopg2.connect(host='localhost', database='big', user='sec', password='ret')
# if_exists checks if the table exists and acts accordingly
#   fail: throw a ValueError if the table exists
#   replace: drop the table and create a new one
#   append: append the data to the table
sql.write_frame(sf, 'table_name', con, if_exists='append')

Convert from Unix Timestamp

import pandas as pd
import numpy as np
import time
# create a dataframe with unix timestamp
value = np.arange(100)
ts = time.time() + value
df = pd.DataFrame(dict(time=ts, value=value))
# convert unix timestamp to a valid numpy datetime
df['time'] = df.time.astype('M8[s]')

DataFrame to numpy recarray

The integrated method to_records creates a numpy recarray

recarray = data.to_records()

To select only specific columns

recarray = data[['col1','col2']].to_records(index=False)

DataFrame to matrix

ma = data.as_matrix(['col1', 'col2'])


Plot days on top of each other

data = ... #  contains data of several days
pylab.plot(data.index.time, data.value1)


Colouring can be done by selecting a colormap:

pandas.txt · Last modified: 2016/05/20 10:24 by mantis