Data visualization with PandasĀ¶
Pandas implements some high level plotting functions using matplotlib. Note If you have seaborn imported, pandas will relay the plotting through seaborn and you get better looking plots for the same data and commands.
InĀ [1]:
Copied!
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
Read some dataĀ¶
InĀ [2]:
Copied!
df1 = pd.read_csv('/Users/atma6951/Documents/code/pychakras/pychakras/udemy_ml_bootcamp/Python-for-Data-Visualization/Pandas Built-in Data Viz/df1', index_col=0)
df2 = pd.read_csv('/Users/atma6951/Documents/code/pychakras/pychakras/udemy_ml_bootcamp/Python-for-Data-Visualization/Pandas Built-in Data Viz/df2')
df1 = pd.read_csv('/Users/atma6951/Documents/code/pychakras/pychakras/udemy_ml_bootcamp/Python-for-Data-Visualization/Pandas Built-in Data Viz/df1', index_col=0)
df2 = pd.read_csv('/Users/atma6951/Documents/code/pychakras/pychakras/udemy_ml_bootcamp/Python-for-Data-Visualization/Pandas Built-in Data Viz/df2')
InĀ [3]:
Copied!
df1.head()
df1.head()
Out[3]:
A | B | C | D | |
---|---|---|---|---|
2000-01-01 | 1.339091 | -0.163643 | -0.646443 | 1.041233 |
2000-01-02 | -0.774984 | 0.137034 | -0.882716 | -2.253382 |
2000-01-03 | -0.921037 | -0.482943 | -0.417100 | 0.478638 |
2000-01-04 | -1.738808 | -0.072973 | 0.056517 | 0.015085 |
2000-01-05 | -0.905980 | 1.778576 | 0.381918 | 0.291436 |
InĀ [4]:
Copied!
df2.head()
df2.head()
Out[4]:
a | b | c | d | |
---|---|---|---|---|
0 | 0.039762 | 0.218517 | 0.103423 | 0.957904 |
1 | 0.937288 | 0.041567 | 0.899125 | 0.977680 |
2 | 0.780504 | 0.008948 | 0.557808 | 0.797510 |
3 | 0.672717 | 0.247870 | 0.264071 | 0.444358 |
4 | 0.053829 | 0.520124 | 0.552264 | 0.190008 |
3 ways of calling plot from a DataFrameĀ¶
df.plot()
and specify the plot type, the X and Y columns etcdf.plot.hist()
calling plot in OO fashion. Only specify teh X and Y and color or size columnsdf['column'].plot.plotname()
- calling plot on a series
Types of plot that can be called: area, bar, line, scatter, box, hexbin, kde etc.
Ways of plotting histogramĀ¶
InĀ [8]:
Copied!
df1.plot(x='A', kind='hist')
df1.plot(x='A', kind='hist')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1142ece80>
InĀ [10]:
Copied!
df1['A'].plot.hist(bins=30)
df1['A'].plot.hist(bins=30)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1169717f0>
Plotting a histogram of all numeric columns in the dataframe:Ā¶
InĀ [7]:
Copied!
df1.hist()
df1.hist()
Out[7]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11a872eb8>, <matplotlib.axes._subplots.AxesSubplot object at 0x11abd27f0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x11abf7e80>, <matplotlib.axes._subplots.AxesSubplot object at 0x11ac27550>]], dtype=object)
In reality, you have a lot more columns. You can prettify the above by creating a layout and figsize:
InĀ [8]:
Copied!
ax_list = df1.hist(bins=25, layout=(2,2), figsize=(7,7))
plt.tight_layout()
ax_list = df1.hist(bins=25, layout=(2,2), figsize=(7,7))
plt.tight_layout()
Plotting histogram of all columns and sharing axesĀ¶
The chart above might make more sense if you shared the X as well as Y axes for different columns. This helps in comparing the distribution of values visually.
InĀ [14]:
Copied!
ax_list = df1.hist(bins=25, sharex=True, sharey=True, layout=(1,4), figsize=(15,4))
ax_list = df1.hist(bins=25, sharex=True, sharey=True, layout=(1,4), figsize=(15,4))
InĀ [15]:
Copied!
ax_list = df1.hist(bins=25, sharex=True, sharey=True, layout=(2,2), figsize=(8,8))
ax_list = df1.hist(bins=25, sharex=True, sharey=True, layout=(2,2), figsize=(8,8))
InĀ [13]:
Copied!
plt.style.use('dark_background')
df2.plot.area()
plt.style.use('dark_background')
df2.plot.area()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x116b35f98>
Bar chartĀ¶
Another style is fivethirtyeight
InĀ [14]:
Copied!
plt.style.use('fivethirtyeight')
df2.plot.bar()
plt.style.use('fivethirtyeight')
df2.plot.bar()
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x116c92cc0>
Line plotĀ¶
This is suited for time series data
InĀ [18]:
Copied!
#reset the style
plt.style.use('default')
# pass figsize to the matplotlib backend engine and `lw` is line width
df1.plot.line(x=df1.index, y='A', figsize=(12,2), lw=1)
#reset the style
plt.style.use('default')
# pass figsize to the matplotlib backend engine and `lw` is line width
df1.plot.line(x=df1.index, y='A', figsize=(12,2), lw=1)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x116f61f98>
Scatter plotĀ¶
Use colormap
or size
to bring in a visualize a 3rd variable in your scatter
InĀ [21]:
Copied!
df1.plot.scatter(x='A', y='B',c='C', cmap='coolwarm')
df1.plot.scatter(x='A', y='B',c='C', cmap='coolwarm')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1176e09b0>
InĀ [22]:
Copied!
# you could specify size s='c' however the points come out tiny.
# had to scale it by 100, hence using actual series data and not the column name
df2.plot.scatter(x='a',y='b', s=df2['c']*100)
# you could specify size s='c' however the points come out tiny.
# had to scale it by 100, hence using actual series data and not the column name
df2.plot.scatter(x='a',y='b', s=df2['c']*100)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x117440f60>