Pandas - DataFrame creation and exploration¶
Pandas is Python Data Analysis library. Series and Dataframes are major data structures in Pandas. Pandas is built on top of NumPy arrays.
import pandas as pd
import numpy as np
#from a list
l1 = [1,2,3,4,5]
ser1 = pd.Series(data = l1) #when you dont specify labels for index, it is autogenerated
ser1
0 1 1 2 2 3 3 4 4 5 dtype: int64
#from a numpy array
arr1 = np.array(l1)
l2 = ['a', 'b', 'c','e', 'd']
ser2 = pd.Series(data=arr1, index=l2) #indices can of any data type, here string
ser2
a 1 b 2 c 3 e 4 d 5 dtype: int32
#from a dictionary
d1 = {'usa':1, 'india':2, 'germany':3, 'japan':'china', 'china':4}
ser3 = pd.Series(d1)
ser3
china 4 germany 3 india 2 japan china usa 1 dtype: object
Thus Series can have different datatypes.
Operations on series¶
You can add, multiply and other numerical opertions on Series just like on numpy arrays.
ser1a = pd.Series(l1)
ser1 + ser1a #each individual element with matching index/label is summed
0 2 1 4 2 6 3 8 4 10 dtype: int64
When labels dont match, it puts a nan. Thus when two series are added, you may or may not get the same number of elements
ser1 + ser3
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN china NaN germany NaN india NaN japan NaN usa NaN dtype: object
arr1 = np.random.rand(4,4)
arr1
array([[ 0.13430231, 0.62520675, 0.97098126, 0.71760504], [ 0.71376565, 0.77318189, 0.05968879, 0.45089927], [ 0.05898965, 0.90430093, 0.43148674, 0.08768307], [ 0.50989094, 0.50103695, 0.24427914, 0.76313491]])
row_lables = ['Car1', 'Car2', 'Car3', 'Car4']
col_labels = ['reliability', 'cost', 'competition', 'halflife']
#create a dataframe
df1 = pd.DataFrame(data=arr1, index=row_lables, columns=col_labels)
df1
reliability | cost | competition | halflife | |
---|---|---|---|---|
Car1 | 0.134302 | 0.625207 | 0.970981 | 0.717605 |
Car2 | 0.713766 | 0.773182 | 0.059689 | 0.450899 |
Car3 | 0.058990 | 0.904301 | 0.431487 | 0.087683 |
Car4 | 0.509891 | 0.501037 | 0.244279 | 0.763135 |
# Accessing a whole column
df1['reliability']
Car1 0.894051 Car2 0.014127 Car3 0.351610 Car4 0.601929 Name: reliability, dtype: float64
#can access as a property, but this is not advisable
#since it can clobber builtin methods and properties
df1.reliability
Car1 0.894051 Car2 0.014127 Car3 0.351610 Car4 0.601929 Name: reliability, dtype: float64
Access rows¶
DataFrameobj.loc['row_label'] also returns a Series. Notice the .loc
df1.loc['Car4']
reliability 0.601929 cost 0.854320 competition 0.391956 halflife 0.759363 Name: Car4, dtype: float64
type(df1.loc['Car3'])
pandas.core.series.Series
Accessing using index number¶
If you don't know the labels, but know the index like in an array, use iloc
and pass the index number.
#get first row, first col
val1 = df1.iloc[0,0]
print(val1)
print(type(val1))
0.894051123737 <class 'numpy.float64'>
#get full first row
val2 = df1.iloc[0,:]
val2
reliability 0.894051 cost 0.849727 competition 0.538400 halflife 0.863986 Name: Car1, dtype: float64
type(val2)
pandas.core.series.Series
Dicing DataFrames¶
Dicing using labels ==> use DataFrameObj.loc[[row_labels],[col_labels]]
#Get cost and competition of cars 2,3
df1.loc[['Car2', 'Car3'], ['cost', 'competition']]
cost | competition | |
---|---|---|
Car2 | 0.935368 | 0.719570 |
Car3 | 0.659950 | 0.605077 |
With index number, dice using
DataFrameObj.iloc[[row_indices], [col_indices]]
df1.iloc[[1,2], [1,2]]
cost | competition | |
---|---|---|
Car2 | 0.935368 | 0.719570 |
Car3 | 0.659950 | 0.605077 |
Conditional selection¶
When running a condition on a DataFrame, you are returned a Bool dataframe.
df1
reliability | cost | competition | halflife | |
---|---|---|---|---|
Car1 | 0.776415 | 0.435083 | 0.236151 | 0.169087 |
Car2 | 0.790403 | 0.987459 | 0.370570 | 0.734146 |
Car3 | 0.884783 | 0.233803 | 0.691639 | 0.725398 |
Car4 | 0.693038 | 0.716824 | 0.766937 | 0.490821 |
# find cars with reliability > 0.85
df1['reliability'] > 0.85
Car1 False Car2 False Car3 True Car4 False Name: reliability, dtype: bool
#to get the car select the data elements using the bool series
df1[df1['reliability'] > 0.85]
reliability | cost | competition | halflife | |
---|---|---|---|---|
Car3 | 0.884783 | 0.233803 | 0.691639 | 0.725398 |
#To get only the car name, which in this case is the index
df1[df1['reliability'] > 0.85].index[0]
'Car3'
Chaining conditions¶
In a Pythonic way, you can chain conditions
df[df condition][selection][selection]
#to get the actual value of reliablity for this car
df1[df1['reliability'] > 0.85]['reliability']
Car3 0.884783 Name: reliability, dtype: float64
# get both reliability and cost
df1[df1['reliability'] > 0.85][['reliability', 'cost']]
reliability | cost | |
---|---|---|
Car3 | 0.884783 | 0.233803 |
Multiple conditions¶
You can select dataframe elements with multiple conditions. Note cannot use Python and
, or
. Instead use &
, |
#select cars that have reliability > 0.7 but competition less than 0.5
df1[(df1['reliability'] > 0.7) & (df1['competition'] < 0.5)]
reliability | cost | competition | halflife | |
---|---|---|---|---|
Car1 | 0.776415 | 0.435083 | 0.236151 | 0.169087 |
Car2 | 0.790403 | 0.987459 | 0.370570 | 0.734146 |
# select cars that have half life > 0.5 or competition < 0.4
df1[(df1['halflife'] > 0.5) | (df1['competition'] < 0.4)]
reliability | cost | competition | halflife | |
---|---|---|---|---|
Car1 | 0.776415 | 0.435083 | 0.236151 | 0.169087 |
Car2 | 0.790403 | 0.987459 | 0.370570 | 0.734146 |
Car3 | 0.884783 | 0.233803 | 0.691639 | 0.725398 |
#add full life column
df1['full_life'] = df1['halflife'] * 2 #similar to array, series broadcast multiplication
df1
reliability | cost | competition | halflife | full_life | |
---|---|---|---|---|---|
Car1 | 0.134302 | 0.625207 | 0.970981 | 0.717605 | 1.435210 |
Car2 | 0.713766 | 0.773182 | 0.059689 | 0.450899 | 0.901799 |
Car3 | 0.058990 | 0.904301 | 0.431487 | 0.087683 | 0.175366 |
Car4 | 0.509891 | 0.501037 | 0.244279 | 0.763135 | 1.526270 |
Dropping rows and columns¶
DataFrameObj.drop(label, axis, inplace=True / False)
Row labels are axis = 0
and columns are axis = 1
df1.drop('full_life', axis=1, inplace=False)
reliability | cost | competition | halflife | |
---|---|---|---|---|
Car1 | 0.134302 | 0.625207 | 0.970981 | 0.717605 |
Car2 | 0.713766 | 0.773182 | 0.059689 | 0.450899 |
Car3 | 0.058990 | 0.904301 | 0.431487 | 0.087683 |
Car4 | 0.509891 | 0.501037 | 0.244279 | 0.763135 |
df1.drop('Car3') #all else is the default
reliability | cost | competition | halflife | full_life | |
---|---|---|---|---|---|
Car1 | 0.134302 | 0.625207 | 0.970981 | 0.717605 | 1.435210 |
Car2 | 0.713766 | 0.773182 | 0.059689 | 0.450899 | 0.901799 |
Car4 | 0.509891 | 0.501037 | 0.244279 | 0.763135 | 1.526270 |
Drop a row based on a condition.
df1.drop(df1[df1['cost'] > 0.65].index, inplace=False)
reliability | cost | competition | halflife | full_life | |
---|---|---|---|---|---|
Car1 | 0.134302 | 0.625207 | 0.970981 | 0.717605 | 1.43521 |
Car4 | 0.509891 | 0.501037 | 0.244279 | 0.763135 | 1.52627 |
#set car names as index for the data frame
car_names = 'altima outback taurus mustang'.split()
car_names
['altima', 'outback', 'taurus', 'mustang']
df1['car_names'] = car_names
df1
reliability | cost | competition | halflife | car_names | |
---|---|---|---|---|---|
Car1 | 0.776415 | 0.435083 | 0.236151 | 0.169087 | altima |
Car2 | 0.790403 | 0.987459 | 0.370570 | 0.734146 | outback |
Car3 | 0.884783 | 0.233803 | 0.691639 | 0.725398 | taurus |
Car4 | 0.693038 | 0.716824 | 0.766937 | 0.490821 | mustang |
df_new_index = df1.set_index(keys= df1['car_names'], inplace=False)
df_new_index
reliability | cost | competition | halflife | car_names | |
---|---|---|---|---|---|
car_names | |||||
altima | 0.776415 | 0.435083 | 0.236151 | 0.169087 | altima |
outback | 0.790403 | 0.987459 | 0.370570 | 0.734146 | outback |
taurus | 0.884783 | 0.233803 | 0.691639 | 0.725398 | taurus |
mustang | 0.693038 | 0.716824 | 0.766937 | 0.490821 | mustang |
Note, the old index is lost.
Rest index¶
#reset df1 index to numerals and convert existing to a column
df1.reset_index()
index | reliability | cost | competition | halflife | car_names | |
---|---|---|---|---|---|---|
0 | Car1 | 0.776415 | 0.435083 | 0.236151 | 0.169087 | altima |
1 | Car2 | 0.790403 | 0.987459 | 0.370570 | 0.734146 | outback |
2 | Car3 | 0.884783 | 0.233803 | 0.691639 | 0.725398 | taurus |
3 | Car4 | 0.693038 | 0.716824 | 0.766937 | 0.490821 | mustang |