Pandas - DataFrame creation and exploration¶

Pandas is Python Data Analysis library. Series and Dataframes are major data structures in Pandas. Pandas is built on top of NumPy arrays.

Series¶

Series is 1 dimensional data structure. It is similar to numpy array, but each data point has a label in the place of an index.

Create a series¶

In [1]:

Copied!

import pandas as pd
import numpy as np
import pandas as pd
import numpy as np

In [2]:

Copied!





#from a list
l1 = [1,2,3,4,5]
ser1 = pd.Series(data = l1)  #when you dont specify labels for index, it is autogenerated
ser1
#from a list
l1 = [1,2,3,4,5]
ser1 = pd.Series(data = l1)  #when you dont specify labels for index, it is autogenerated
ser1

Out[2]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [3]:

Copied!





#from a numpy array
arr1 = np.array(l1)
l2 = ['a', 'b', 'c','e', 'd']
ser2 = pd.Series(data=arr1, index=l2) #indices can of any data type, here string
ser2
#from a numpy array
arr1 = np.array(l1)
l2 = ['a', 'b', 'c','e', 'd']
ser2 = pd.Series(data=arr1, index=l2) #indices can of any data type, here string
ser2

Out[3]:

a    1
b    2
c    3
e    4
d    5
dtype: int32

In [4]:

Copied!





#from a dictionary
d1 = {'usa':1, 'india':2, 'germany':3, 'japan':'china', 'china':4}
ser3 = pd.Series(d1)
ser3
#from a dictionary
d1 = {'usa':1, 'india':2, 'germany':3, 'japan':'china', 'china':4}
ser3 = pd.Series(d1)
ser3

Out[4]:

china          4
germany        3
india          2
japan      china
usa            1
dtype: object

Thus Series can have different datatypes.

Operations on series¶

You can add, multiply and other numerical opertions on Series just like on numpy arrays.

In [5]:

Copied!

ser1a = pd.Series(l1)
ser1 + ser1a #each individual element with matching index/label is summed
ser1a = pd.Series(l1)
ser1 + ser1a #each individual element with matching index/label is summed

Out[5]:

0     2
1     4
2     6
3     8
4    10
dtype: int64

When labels dont match, it puts a nan. Thus when two series are added, you may or may not get the same number of elements

In [6]:

Copied!

ser1 + ser3
ser1 + ser3

Out[6]:

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
china      NaN
germany    NaN
india      NaN
japan      NaN
usa        NaN
dtype: object

DataFrames¶

Creating dataFrames¶

Pandas DataFrames are built on top of Series. It looks similar to a NumPy array, but has labels for both columns and rows.

In [2]:

Copied!

arr1 = np.random.rand(4,4)
arr1
arr1 = np.random.rand(4,4)
arr1

Out[2]:

array([[ 0.13430231,  0.62520675,  0.97098126,  0.71760504],
       [ 0.71376565,  0.77318189,  0.05968879,  0.45089927],
       [ 0.05898965,  0.90430093,  0.43148674,  0.08768307],
       [ 0.50989094,  0.50103695,  0.24427914,  0.76313491]])

In [3]:

Copied!





row_lables = ['Car1', 'Car2', 'Car3', 'Car4']
col_labels = ['reliability', 'cost', 'competition', 'halflife']

#create a dataframe
df1 = pd.DataFrame(data=arr1, index=row_lables, columns=col_labels)
df1
row_lables = ['Car1', 'Car2', 'Car3', 'Car4']
col_labels = ['reliability', 'cost', 'competition', 'halflife']

#create a dataframe
df1 = pd.DataFrame(data=arr1, index=row_lables, columns=col_labels)
df1

Out[3]:

	reliability	cost	competition	halflife
Car1	0.134302	0.625207	0.970981	0.717605
Car2	0.713766	0.773182	0.059689	0.450899
Car3	0.058990	0.904301	0.431487	0.087683
Car4	0.509891	0.501037	0.244279	0.763135

Slicing and dicing DataFrames¶

You can access DataFrames similar to Series and slice it similar to NumPy arrays

Access columns¶

DataFrameObj['column_name'] ==> returns a pandas.core.series.Series

In [15]:

Copied!

# Accessing a whole column
df1['reliability']
# Accessing a whole column
df1['reliability']

Out[15]:

Car1    0.894051
Car2    0.014127
Car3    0.351610
Car4    0.601929
Name: reliability, dtype: float64

In [16]:

Copied!

#can access as a property, but this is not advisable 
#since it can clobber builtin methods and properties
df1.reliability
#can access as a property, but this is not advisable 
#since it can clobber builtin methods and properties
df1.reliability

Out[16]:

Car1    0.894051
Car2    0.014127
Car3    0.351610
Car4    0.601929
Name: reliability, dtype: float64

Access rows¶

DataFrameobj.loc['row_label'] also returns a Series. Notice the .loc

In [18]:

Copied!

df1.loc['Car4']
df1.loc['Car4']

Out[18]:

reliability    0.601929
cost           0.854320
competition    0.391956
halflife       0.759363
Name: Car4, dtype: float64

In [19]:

Copied!

type(df1.loc['Car3'])
type(df1.loc['Car3'])

Out[19]:

pandas.core.series.Series

Accessing using index number¶

If you don't know the labels, but know the index like in an array, use iloc and pass the index number.

In [21]:

Copied!





#get first row, first col
val1 = df1.iloc[0,0]
print(val1)
print(type(val1))
#get first row, first col
val1 = df1.iloc[0,0]
print(val1)
print(type(val1))

0.894051123737
<class 'numpy.float64'>

In [23]:

Copied!

#get full first row
val2 = df1.iloc[0,:]
val2
#get full first row
val2 = df1.iloc[0,:]
val2

Out[23]:

reliability    0.894051
cost           0.849727
competition    0.538400
halflife       0.863986
Name: Car1, dtype: float64

In [24]:

Copied!

type(val2)
type(val2)

Out[24]:

pandas.core.series.Series

Dicing DataFrames¶

Dicing using labels ==> use DataFrameObj.loc[[row_labels],[col_labels]]

In [26]:

Copied!

#Get cost and competition of cars 2,3
df1.loc[['Car2', 'Car3'], ['cost', 'competition']]
#Get cost and competition of cars 2,3
df1.loc[['Car2', 'Car3'], ['cost', 'competition']]

Out[26]:

	cost	competition
Car2	0.935368	0.719570
Car3	0.659950	0.605077

With index number, dice using

DataFrameObj.iloc[[row_indices], [col_indices]]

In [27]:

Copied!

df1.iloc[[1,2], [1,2]]
df1.iloc[[1,2], [1,2]]

Out[27]:

	cost	competition
Car2	0.935368	0.719570
Car3	0.659950	0.605077

Conditional selection¶

When running a condition on a DataFrame, you are returned a Bool dataframe.

In [9]:

Copied!

df1
df1

Out[9]:

	reliability	cost	competition	halflife
Car1	0.776415	0.435083	0.236151	0.169087
Car2	0.790403	0.987459	0.370570	0.734146
Car3	0.884783	0.233803	0.691639	0.725398
Car4	0.693038	0.716824	0.766937	0.490821

In [10]:

Copied!

# find cars with reliability > 0.85
df1['reliability'] > 0.85
# find cars with reliability > 0.85
df1['reliability'] > 0.85

Out[10]:

Car1    False
Car2    False
Car3     True
Car4    False
Name: reliability, dtype: bool

In [11]:

Copied!

#to get the car select the data elements using the bool series
df1[df1['reliability'] > 0.85]
#to get the car select the data elements using the bool series
df1[df1['reliability'] > 0.85]

Out[11]:

	reliability	cost	competition	halflife
Car3	0.884783	0.233803	0.691639	0.725398

In [20]:

Copied!

#To get only the car name, which in this case is the index
df1[df1['reliability'] > 0.85].index[0]
#To get only the car name, which in this case is the index
df1[df1['reliability'] > 0.85].index[0]

Out[20]:

'Car3'

Chaining conditions¶

In a Pythonic way, you can chain conditions

df[df condition][selection][selection]

In [21]:

Copied!

#to get the actual value of reliablity for this car
df1[df1['reliability'] > 0.85]['reliability']
#to get the actual value of reliablity for this car
df1[df1['reliability'] > 0.85]['reliability']

Out[21]:

Car3    0.884783
Name: reliability, dtype: float64

In [22]:

Copied!

# get both reliability and cost
df1[df1['reliability'] > 0.85][['reliability', 'cost']]
# get both reliability and cost
df1[df1['reliability'] > 0.85][['reliability', 'cost']]

Out[22]:

	reliability	cost
Car3	0.884783	0.233803

Multiple conditions¶

You can select dataframe elements with multiple conditions. Note cannot use Python and , or. Instead use &, |

In [24]:

Copied!

#select cars that have reliability > 0.7 but competition less than 0.5
df1[(df1['reliability'] > 0.7) & (df1['competition'] < 0.5)]
#select cars that have reliability > 0.7 but competition less than 0.5
df1[(df1['reliability'] > 0.7) & (df1['competition'] < 0.5)]

Out[24]:

	reliability	cost	competition	halflife
Car1	0.776415	0.435083	0.236151	0.169087
Car2	0.790403	0.987459	0.370570	0.734146

In [26]:

Copied!

# select cars that have half life > 0.5 or competition < 0.4
df1[(df1['halflife'] > 0.5) | (df1['competition'] < 0.4)]
# select cars that have half life > 0.5 or competition < 0.4
df1[(df1['halflife'] > 0.5) | (df1['competition'] < 0.4)]

Out[26]:

	reliability	cost	competition	halflife
Car1	0.776415	0.435083	0.236151	0.169087
Car2	0.790403	0.987459	0.370570	0.734146
Car3	0.884783	0.233803	0.691639	0.725398

Operations on DataFrames¶

Adding new columns¶

Create new columns just like adding a kvp to a dictionary.

DataFrameObj['new_col'] = Series

In [4]:

Copied!

#add full life column
df1['full_life'] = df1['halflife'] * 2 #similar to array, series broadcast multiplication
df1
#add full life column
df1['full_life'] = df1['halflife'] * 2 #similar to array, series broadcast multiplication
df1

Out[4]:

	reliability	cost	competition	halflife	full_life
Car1	0.134302	0.625207	0.970981	0.717605	1.435210
Car2	0.713766	0.773182	0.059689	0.450899	0.901799
Car3	0.058990	0.904301	0.431487	0.087683	0.175366
Car4	0.509891	0.501037	0.244279	0.763135	1.526270

Dropping rows and columns¶

DataFrameObj.drop(label, axis, inplace=True / False)

Row labels are axis = 0 and columns are axis = 1

In [5]:

Copied!

df1.drop('full_life', axis=1, inplace=False)
df1.drop('full_life', axis=1, inplace=False)

Out[5]:

	reliability	cost	competition	halflife
Car1	0.134302	0.625207	0.970981	0.717605
Car2	0.713766	0.773182	0.059689	0.450899
Car3	0.058990	0.904301	0.431487	0.087683
Car4	0.509891	0.501037	0.244279	0.763135

In [6]:

Copied!

df1.drop('Car3') #all else is the default
df1.drop('Car3') #all else is the default

Out[6]:

	reliability	cost	competition	halflife	full_life
Car1	0.134302	0.625207	0.970981	0.717605	1.435210
Car2	0.713766	0.773182	0.059689	0.450899	0.901799
Car4	0.509891	0.501037	0.244279	0.763135	1.526270

Drop a row based on a condition.

In [7]:

Copied!

df1.drop(df1[df1['cost'] > 0.65].index, inplace=False)
df1.drop(df1[df1['cost'] > 0.65].index, inplace=False)

Out[7]:

	reliability	cost	competition	halflife	full_life
Car1	0.134302	0.625207	0.970981	0.717605	1.43521
Car4	0.509891	0.501037	0.244279	0.763135	1.52627

DataFrame Index¶

So far, Car1, Car2.. is the index for rows. If you would like to set a different column as an index, use set_index. If you want to make index as a column rather, and use numerals for index, use reset_index

Set index¶

In [30]:

Copied!

#set car names as index for the data frame
car_names = 'altima outback taurus mustang'.split()
car_names
#set car names as index for the data frame
car_names = 'altima outback taurus mustang'.split()
car_names

Out[30]:

['altima', 'outback', 'taurus', 'mustang']

In [36]:

Copied!

df1['car_names'] = car_names
df1
df1['car_names'] = car_names
df1

Out[36]:

	reliability	cost	competition	halflife	car_names
Car1	0.776415	0.435083	0.236151	0.169087	altima
Car2	0.790403	0.987459	0.370570	0.734146	outback
Car3	0.884783	0.233803	0.691639	0.725398	taurus
Car4	0.693038	0.716824	0.766937	0.490821	mustang

In [37]:

Copied!

df_new_index = df1.set_index(keys= df1['car_names'], inplace=False)
df_new_index
df_new_index = df1.set_index(keys= df1['car_names'], inplace=False)
df_new_index

Out[37]:

	reliability	cost	competition	halflife	car_names
car_names
altima	0.776415	0.435083	0.236151	0.169087	altima
outback	0.790403	0.987459	0.370570	0.734146	outback
taurus	0.884783	0.233803	0.691639	0.725398	taurus
mustang	0.693038	0.716824	0.766937	0.490821	mustang

Note, the old index is lost.

Rest index¶

In [38]:

Copied!

#reset df1 index to numerals and convert existing to a column
df1.reset_index()
#reset df1 index to numerals and convert existing to a column
df1.reset_index()

Out[38]:

	index	reliability	cost	competition	halflife	car_names
0	Car1	0.776415	0.435083	0.236151	0.169087	altima
1	Car2	0.790403	0.987459	0.370570	0.734146	outback
2	Car3	0.884783	0.233803	0.691639	0.725398	taurus
3	Car4	0.693038	0.716824	0.766937	0.490821	mustang