Pandas - Unique, IsNull, Time series¶

unique - find unique rows¶

Find unique rows in dataset

In [28]:

Copied!

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:

Copied!





comp_data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

comp_df = pd.DataFrame(comp_data)
comp_df
comp_data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

comp_df = pd.DataFrame(comp_data)
comp_df

Out[2]:

	Company	Person	Sales
0	GOOG	Sam	200
1	GOOG	Charlie	120
2	MSFT	Amy	340
3	MSFT	Vanessa	124
4	FB	Carl	243
5	FB	Sarah	350

In [3]:

Copied!

# find unique company names
comp_df['Company'].unique()
# find unique company names
comp_df['Company'].unique()

Out[3]:

array(['GOOG', 'MSFT', 'FB'], dtype=object)

nunique - find number of unique rows¶

More efficient than finding the unique array and finding the length of it.

In [4]:

Copied!

comp_df['Company'].nunique()
comp_df['Company'].nunique()

Out[4]:

value_counts - find unique values and number of occurrences¶

In [5]:

Copied!

comp_df['Company'].value_counts()
comp_df['Company'].value_counts()

Out[5]:

GOOG    2
FB      2
MSFT    2
Name: Company, dtype: int64

apply - batch process column values¶

Calling the apply() is similar to calling the map() in Python. It can apply an operation on all records of a selected column. For instance, to find the squared sales, do the following

In [6]:

Copied!

comp_df['sq_sales'] = comp_df['Sales'].apply(lambda x:x*x)
comp_df
comp_df['sq_sales'] = comp_df['Sales'].apply(lambda x:x*x)
comp_df

Out[6]:

	Company	Person	Sales	sq_sales
0	GOOG	Sam	200	40000
1	GOOG	Charlie	120	14400
2	MSFT	Amy	340	115600
3	MSFT	Vanessa	124	15376
4	FB	Carl	243	59049
5	FB	Sarah	350	122500

We can also define a function and call that within the apply() method. This can accept values of one or more columns to calculate a new column.

In [9]:

Copied!





def cuber(row):
    return row['Sales'] * row['sq_sales']

comp_df['cu_sales'] = comp_df.apply(cuber, axis=1) 
#note - how the function is called as an obj
# note - how I need to set axis to 1, instead of 0 which is defualt.
comp_df
def cuber(row):
    return row['Sales'] * row['sq_sales']

comp_df['cu_sales'] = comp_df.apply(cuber, axis=1) 
#note - how the function is called as an obj
# note - how I need to set axis to 1, instead of 0 which is defualt.
comp_df

Out[9]:

	Company	Person	Sales	sq_sales	cu_sales
0	GOOG	Sam	200	40000	8000000
1	GOOG	Charlie	120	14400	1728000
2	MSFT	Amy	340	115600	39304000
3	MSFT	Vanessa	124	15376	1906624
4	FB	Carl	243	59049	14348907
5	FB	Sarah	350	122500	42875000

sort_values - sorting rows¶

In [10]:

Copied!

comp_df.sort_values('Sales')
comp_df.sort_values('Sales')

Out[10]:

	Company	Person	Sales	sq_sales	cu_sales
1	GOOG	Charlie	120	14400	1728000
3	MSFT	Vanessa	124	15376	1906624
0	GOOG	Sam	200	40000	8000000
4	FB	Carl	243	59049	14348907
2	MSFT	Amy	340	115600	39304000
5	FB	Sarah	350	122500	42875000

Note how the index remains attached to the original rows.

In [12]:

Copied!

#sorting along multiple columns
comp_df.sort_values(['Company','Sales'])
#sorting along multiple columns
comp_df.sort_values(['Company','Sales'])

Out[12]:

	Company	Person	Sales	sq_sales	cu_sales
4	FB	Carl	243	59049	14348907
5	FB	Sarah	350	122500	42875000
1	GOOG	Charlie	120	14400	1728000
0	GOOG	Sam	200	40000	8000000
3	MSFT	Vanessa	124	15376	1906624
2	MSFT	Amy	340	115600	39304000

isnull - finding null values throughout the DataFrame¶

In [13]:

Copied!

comp_df.isnull()
comp_df.isnull()

Out[13]:

	Company	Person	Sales	sq_sales	cu_sales
0	False	False	False	False	False
1	False	False	False	False	False
2	False	False	False	False	False
3	False	False	False	False	False
4	False	False	False	False	False
5	False	False	False	False	False

Working with time series data¶

This section explains how to specify datatypes of columns while reading data and how to define column converters to ease certain data types.

In [23]:

Copied!

registrant_df = pd.read_csv('./registrant.csv')
registrant_df.head()
registrant_df = pd.read_csv('./registrant.csv')
registrant_df.head()

Out[23]:

	Unnamed: 0	Registration Date	Country	Organization	Current customer?	What would you like to learn?
0	0	11/08/2019 06:09 PM EST	Jamaica	The University of the West Indies	NaN	NaN
1	1	11/08/2019 06:09 PM EST	Japan	iLand6 Co.,Ltd.	no	I am interested ArcGIS.
2	2	11/08/2019 05:56 PM EST	Canada	Safe Software Inc	yes	data science workflos
3	3	11/08/2019 05:51 PM EST	Canada	Le Groupe GeoInfo Inc	yes	general information
4	4	11/08/2019 05:26 PM EST	Canada	Safe Software Inc.	NaN	NaN

The Registration Date should be of type datetime and the Current customer? should be of bool. However, are they?

In [24]:

Copied!

registrant_df.dtypes
registrant_df.dtypes

Out[24]:

Unnamed: 0                        int64
Registration Date                object
Country                          object
Organization                     object
Current customer?                object
What would you like to learn?    object
dtype: object

Everything is a generic object. Let us re-read, this time knowing what their data types should be.

In [25]:

Copied!





# define a function (lambda in this case) which will convert a column to bool depending on the 
# value of the cell
convertor_fn = lambda x: x in ['Yes', 'yes', 'YES']
convertor_map = {'Current customer?': convertor_fn}

# re-read data
registrant_df2 = pd.read_csv('./registrant.csv', 
                             parse_dates=['Registration Date'], 
                             converters = convertor_map)

registrant_df2.dtypes
# define a function (lambda in this case) which will convert a column to bool depending on the 
# value of the cell
convertor_fn = lambda x: x in ['Yes', 'yes', 'YES']
convertor_map = {'Current customer?': convertor_fn}

# re-read data
registrant_df2 = pd.read_csv('./registrant.csv', 
                             parse_dates=['Registration Date'], 
                             converters = convertor_map)

registrant_df2.dtypes

/Users/atma6951/anaconda3/lib/python3.7/site-packages/dateutil/parser/_parser.py:1206: UnknownTimezoneWarning: tzname EST identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  category=UnknownTimezoneWarning)
/Users/atma6951/anaconda3/lib/python3.7/site-packages/dateutil/parser/_parser.py:1206: UnknownTimezoneWarning: tzname EDT identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  category=UnknownTimezoneWarning)

Out[25]:

Unnamed: 0                                int64
Registration Date                datetime64[ns]
Country                                  object
Organization                             object
Current customer?                          bool
What would you like to learn?            object
dtype: object

Plotting time series¶

Now that the Registration date is datetime, we can plot the number of registrants by time. But before that, we need to set it as the index.

In [26]:

Copied!

registrant_df2.set_index('Registration Date', inplace=True)
registrant_df2.drop(axis=1, columns=['Unnamed: 0'], inplace=True) # drop bad column
registrant_df2.head()
registrant_df2.set_index('Registration Date', inplace=True)
registrant_df2.drop(axis=1, columns=['Unnamed: 0'], inplace=True) # drop bad column
registrant_df2.head()

Out[26]:

	Country	Organization	Current customer?	What would you like to learn?
Registration Date
2019-11-08 18:09:00	Jamaica	The University of the West Indies	False	NaN
2019-11-08 18:09:00	Japan	iLand6 Co.,Ltd.	False	I am interested ArcGIS.
2019-11-08 17:56:00	Canada	Safe Software Inc	True	data science workflos
2019-11-08 17:51:00	Canada	Le Groupe GeoInfo Inc	True	general information
2019-11-08 17:26:00	Canada	Safe Software Inc.	False	NaN

Sort dataframe by time¶

In [30]:

Copied!

registrant_df2.sort_index(inplace=True)
registrant_df2.sort_index(inplace=True)

Add a counter column to the dataframe¶

Note. It is important to count up only after sorting. Else the numbers are going to be all over the place.

In [31]:

Copied!

registrant_df2['registration_count'] = range(1, len(registrant_df2)+1) # goes from 1 to 284
registrant_df2['registration_count'] = range(1, len(registrant_df2)+1) # goes from 1 to 284

In [32]:

Copied!

plt.figure(figsize=(10,7))
registrant_df2['registration_count'].plot(kind='line')
plt.title('Number of registrants over time');
plt.figure(figsize=(10,7))
registrant_df2['registration_count'].plot(kind='line')
plt.title('Number of registrants over time');

No description has been provided for this image