Pandas - Unique, IsNull, Time series¶
unique - find unique rows¶
Find unique rows in dataset
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
comp_data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
comp_df = pd.DataFrame(comp_data)
comp_df
Company | Person | Sales | |
---|---|---|---|
0 | GOOG | Sam | 200 |
1 | GOOG | Charlie | 120 |
2 | MSFT | Amy | 340 |
3 | MSFT | Vanessa | 124 |
4 | FB | Carl | 243 |
5 | FB | Sarah | 350 |
# find unique company names
comp_df['Company'].unique()
array(['GOOG', 'MSFT', 'FB'], dtype=object)
nunique - find number of unique rows¶
More efficient than finding the unique array and finding the length of it.
comp_df['Company'].nunique()
3
value_counts - find unique values and number of occurrences¶
comp_df['Company'].value_counts()
GOOG 2 FB 2 MSFT 2 Name: Company, dtype: int64
apply - batch process column values¶
Calling the apply() is similar to calling the map()
in Python. It can apply an operation on all records of a selected column. For instance, to find the squared sales, do the following
comp_df['sq_sales'] = comp_df['Sales'].apply(lambda x:x*x)
comp_df
Company | Person | Sales | sq_sales | |
---|---|---|---|---|
0 | GOOG | Sam | 200 | 40000 |
1 | GOOG | Charlie | 120 | 14400 |
2 | MSFT | Amy | 340 | 115600 |
3 | MSFT | Vanessa | 124 | 15376 |
4 | FB | Carl | 243 | 59049 |
5 | FB | Sarah | 350 | 122500 |
We can also define a function and call that within the apply()
method. This can accept values of one or more columns to calculate a new column.
def cuber(row):
return row['Sales'] * row['sq_sales']
comp_df['cu_sales'] = comp_df.apply(cuber, axis=1)
#note - how the function is called as an obj
# note - how I need to set axis to 1, instead of 0 which is defualt.
comp_df
Company | Person | Sales | sq_sales | cu_sales | |
---|---|---|---|---|---|
0 | GOOG | Sam | 200 | 40000 | 8000000 |
1 | GOOG | Charlie | 120 | 14400 | 1728000 |
2 | MSFT | Amy | 340 | 115600 | 39304000 |
3 | MSFT | Vanessa | 124 | 15376 | 1906624 |
4 | FB | Carl | 243 | 59049 | 14348907 |
5 | FB | Sarah | 350 | 122500 | 42875000 |
sort_values - sorting rows¶
comp_df.sort_values('Sales')
Company | Person | Sales | sq_sales | cu_sales | |
---|---|---|---|---|---|
1 | GOOG | Charlie | 120 | 14400 | 1728000 |
3 | MSFT | Vanessa | 124 | 15376 | 1906624 |
0 | GOOG | Sam | 200 | 40000 | 8000000 |
4 | FB | Carl | 243 | 59049 | 14348907 |
2 | MSFT | Amy | 340 | 115600 | 39304000 |
5 | FB | Sarah | 350 | 122500 | 42875000 |
Note how the index remains attached to the original rows.
#sorting along multiple columns
comp_df.sort_values(['Company','Sales'])
Company | Person | Sales | sq_sales | cu_sales | |
---|---|---|---|---|---|
4 | FB | Carl | 243 | 59049 | 14348907 |
5 | FB | Sarah | 350 | 122500 | 42875000 |
1 | GOOG | Charlie | 120 | 14400 | 1728000 |
0 | GOOG | Sam | 200 | 40000 | 8000000 |
3 | MSFT | Vanessa | 124 | 15376 | 1906624 |
2 | MSFT | Amy | 340 | 115600 | 39304000 |
isnull - finding null values throughout the DataFrame¶
comp_df.isnull()
Company | Person | Sales | sq_sales | cu_sales | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | False | False | False |
2 | False | False | False | False | False |
3 | False | False | False | False | False |
4 | False | False | False | False | False |
5 | False | False | False | False | False |
Working with time series data¶
This section explains how to specify datatypes of columns while reading data and how to define column converters to ease certain data types.
registrant_df = pd.read_csv('./registrant.csv')
registrant_df.head()
Unnamed: 0 | Registration Date | Country | Organization | Current customer? | What would you like to learn? | |
---|---|---|---|---|---|---|
0 | 0 | 11/08/2019 06:09 PM EST | Jamaica | The University of the West Indies | NaN | NaN |
1 | 1 | 11/08/2019 06:09 PM EST | Japan | iLand6 Co.,Ltd. | no | I am interested ArcGIS. |
2 | 2 | 11/08/2019 05:56 PM EST | Canada | Safe Software Inc | yes | data science workflos |
3 | 3 | 11/08/2019 05:51 PM EST | Canada | Le Groupe GeoInfo Inc | yes | general information |
4 | 4 | 11/08/2019 05:26 PM EST | Canada | Safe Software Inc. | NaN | NaN |
The Registration Date
should be of type datetime
and the Current customer?
should be of bool
. However, are they?
registrant_df.dtypes
Unnamed: 0 int64 Registration Date object Country object Organization object Current customer? object What would you like to learn? object dtype: object
Everything is a generic object
. Let us re-read, this time knowing what their data types should be.
# define a function (lambda in this case) which will convert a column to bool depending on the
# value of the cell
convertor_fn = lambda x: x in ['Yes', 'yes', 'YES']
convertor_map = {'Current customer?': convertor_fn}
# re-read data
registrant_df2 = pd.read_csv('./registrant.csv',
parse_dates=['Registration Date'],
converters = convertor_map)
registrant_df2.dtypes
/Users/atma6951/anaconda3/lib/python3.7/site-packages/dateutil/parser/_parser.py:1206: UnknownTimezoneWarning: tzname EST identified but not understood. Pass `tzinfos` argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /Users/atma6951/anaconda3/lib/python3.7/site-packages/dateutil/parser/_parser.py:1206: UnknownTimezoneWarning: tzname EDT identified but not understood. Pass `tzinfos` argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning)
Unnamed: 0 int64 Registration Date datetime64[ns] Country object Organization object Current customer? bool What would you like to learn? object dtype: object
Plotting time series¶
Now that the Registration date is datetime, we can plot the number of registrants by time. But before that, we need to set it as the index.
registrant_df2.set_index('Registration Date', inplace=True)
registrant_df2.drop(axis=1, columns=['Unnamed: 0'], inplace=True) # drop bad column
registrant_df2.head()
Country | Organization | Current customer? | What would you like to learn? | |
---|---|---|---|---|
Registration Date | ||||
2019-11-08 18:09:00 | Jamaica | The University of the West Indies | False | NaN |
2019-11-08 18:09:00 | Japan | iLand6 Co.,Ltd. | False | I am interested ArcGIS. |
2019-11-08 17:56:00 | Canada | Safe Software Inc | True | data science workflos |
2019-11-08 17:51:00 | Canada | Le Groupe GeoInfo Inc | True | general information |
2019-11-08 17:26:00 | Canada | Safe Software Inc. | False | NaN |
Sort dataframe by time¶
registrant_df2.sort_index(inplace=True)
Add a counter column to the dataframe¶
Note. It is important to count up only after sorting. Else the numbers are going to be all over the place.
registrant_df2['registration_count'] = range(1, len(registrant_df2)+1) # goes from 1 to 284
plt.figure(figsize=(10,7))
registrant_df2['registration_count'].plot(kind='line')
plt.title('Number of registrants over time');