4
Data Manipulation and Analysis with pandas
In this chapter, you will learn about the Python pandas library built upon NumPy, which provides data manipulation and analysis methods for structured data frames. e name pandas is derived from panel data, an econometrics term for multidimensional structured datasets, according to the Wikipedia page on pandas.
e pandas library contains two fundamental data structures to represent and
manipulate structured rectangular datasets with a variety of indexing options: Series and DataFrames. Both use the index data structure.
Most operations in the processing of financial data in Python are based upon DataFrames. A DataFrame is like an Excel worksheet a two-dimensional table that may contain multiple time series stored in columns. erefore, we recommend you execute all the examples in this chapter yourself in your environment to get practice with the syntax and to better know what is possible.
74 Data Manipulation and Analysis with pandas
In this chapter, we are going to cover the following topics:
Introducing pandas Series, pandas DataFrames, and pandas Indexes Learning essential operations on pandas DataFrames
Exploring file operations with pandas DataFrames
Technical requirements
e Python code used in this chapter is available in the Chapter04/pandas.ipynb notebook in the book’s code repository.
Introducing pandas Series, pandas DataFrames, and pandas Indexes
pandas Series, pandas DataFrames, and pandas Indexes are the fundamental pandas data structures.
pandas.Series
e pandas.Series data structure represents a one-dimensional series of homogenous values (integer values, string values, double values, and so on). Series are a type of list and can contain only a single list with an index. A Data Frame, on the other hand, is a collection of one or more series.
Let’s create a pandas.Series data structure:
import pandas as pd
ser1 = pd.Series(range(1, 6)); ser1
at series contains the index in the first column, and in the second column, the index’s corresponding values:
0 1
1 2
2 3
3 4
Introducing pandas Series, pandas DataFrames, and pandas Indexes 75
We can specify custom index names by specifying the index parameter:
ser2 = pd.Series(range(1, 6),
index=['a', 'b', 'c', 'd', 'e']); ser2
e output will look like the following:
a 1
b 2
c 3
d 4
e 5 dtype: int64
We can also create a series by specifying the index -> value mapping via a dictionary:
ser3 = pd.Series({ 'a': 1.0, 'b': 2.0, 'c': 3.0, 'd': 4.0, 'e': 5.0 });
ser3
e output is as follows:
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
dtype: float64
e pandas.Series.index attribute lets us access the index:
ser3.index
e index is of type pandas.Index:
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
e values of the series can be accessed using the pandas.Series.values attribute:
ser3.values
76 Data Manipulation and Analysis with pandas
e values are as follows:
array([ 1., 2., 3., 4., 5.])
We can assign the series a name by modifying the pandas.Series.name attribute:
ser3.name = 'Alphanumeric'; ser3
e output is as follows:
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
Name: Alphanumeric, dtype: float64
e preceding examples demonstrated numerous ways how to construct a pandas Series. Let’s learn about DataFrames, a data structure that may contain multiple Series.
pandas.DataFrame
e pandas.DataFrame data structure is a collection of multiple pandas.Series objects of possibly different types indexed by the same common Index object.
e majority of all statistical time series operations are performed on DataFrames and pandas.DataFrame is optimized for parallel super-fast processing of DataFrames, much faster than if the processing was done on separate series.
We can create a DataFrame from a dictionary, where the key is the column name and the value of that key contains the data for the corresponding series/column:
df1 = pd.DataFrame({'A': range(1,5,1),
'B': range(10,50,10),
'C': range(100, 500, 100)}); df1
e output is as follows:
A B C
Introducing pandas Series, pandas DataFrames, and pandas Indexes 77
2 3 30 300 3 4 40 400
We can also pass the index= parameter here to label the indices:
df2 = pd.DataFrame({'A': range(1,5,1),
'B': range(10,50,10),
'C': range(100, 500, 100)}, index=['a', 'b', 'c', 'd']); df2
is constructs the following DataFrame:
A B C
a 1 10 100 b 2 20 200 c 3 30 300 d 4 40 400
e pandas.DataFrame.columns attribute returns the names of the different columns:
df2.columns
e result is an Index object:
Index(['A', 'B', 'C'], dtype='object')
e indices can be accessed from the pandas.DataFrame.index attribute:
df2.index
at gives us this:
Index(['a', 'b', 'c', 'd'], dtype='object')
e DataFrame also contains the pandas.DataFrame.values attribute, which
returns the values contained in the columns:
df2.values
78 Data Manipulation and Analysis with pandas
e result is the following 2D array:
array([[ 1, 10, 100], [ 2, 20, 200], [ 3, 30, 300], [ 4, 40, 400]])
We can add a new column to the DataFrame with specified values and the same index with the following:
df2['D'] = range(1000,5000,1000); df2
e updated DataFrame is as follows:
A B C D
a 1 10 100 1000 b 2 20 200 2000 c 3 30 300 3000 d 4 40 400 4000
We can assign names to the DataFrame’s index and columns.
We can name the index by modifying the pandas.DataFrame.index.name attribute:
df2.index.name = 'lowercase'; df2
And that yields the following updated DataFrame:
A B C D lowercase
a 1 10 100 1000 b 2 20 200 2000 c 3 30 300 3000 d 4 40 400 4000
e columns can be renamed using the pandas.DataFrame.columns.name attribute:
df2.columns.name = 'uppercase'; df2
Introducing pandas Series, pandas DataFrames, and pandas Indexes 79
e new DataFrame is as follows:
uppercase A B C D lowercase
a 1 10 100 1000 b 2 20 200 2000 c 3 30 300 3000 d 4 40 400 4000
e preceding examples demonstrated how a DataFrame can be constructed. pandas.Index
Both the pandas.Series and pandas.DataFrame data structures utilize the pandas.Index data structure.
ere are many special types of Index objects:
Int64Index: Int64Index contains integer index values.
MultiIndex: MultiIndex contains indices that are tuples used in hierarchical indexing, which we will explore in this chapter.
DatetimeIndex: DatetimeIndex, which we have seen before, contains datetime index values for time series datasets.
We can create a pandas.Index object by doing the following:
ind2 = pd.Index(list(range(5))); ind2
e result is this:
Int64Index([0, 1, 2, 3, 4], dtype='int64')
Note
Index objects are immutable and thus cannot be modified in place.
Let’s see what happens if we try to modify an element in an Index object:
80 Data Manipulation and Analysis with pandas
We get the following output:
--------------------------------------------------------------- ------------
TypeError Traceback (most recent call last)
----> 1 ind2[0] = -1
...
TypeError: Index does not support mutable operations
Python warns us that we cannot manually modify the index object.
We have now learned how to construct series and DataFrames. Let’s explore the essential operations done on DataFrames.
Learning essential pandas.DataFrame operations
is section describes the essential operations done on DataFrames. Knowing they exist and how to use them will save you an enormous amount of time.
Indexing, selection, and filtering of DataFrames
pandas data structures are indexed by special Index objects (while numpy.ndarrays and Python list objects are only indexable by integers). e steps for this lesson are as follows:
- Let’s inspect the contents of the df2 DataFrame created earlier in the chapter: df2
e output is as follows:
uppercase A B C D lowercase
a 1 10 100 1000 b 2 20 200 2000 c 3 30 300 3000 d 4 40 400 4000
Learning essential pandas.DataFrame operations 81
- We can select the Series of values in column B by performing the following operation:
df2['B']
is yields the following Series:
lowercase
a 10
b 20
c 30
d 40
Name: B, dtype: int64
- We can select multiple columns by passing a list of column names (somewhat similar to what we saw with numpy.ndarrays):
df2[['A', 'C']]
is yields the following DataFrame with two columns:
uppercase A C
lowercase
a 1 100
b 2 200
c 3 300
d 4 400
- We can use Boolean selection with DataFrames by doing the following: df2[(df2['D'] > 1000) & (df2['D'] <= 3000)]
is selects the following rows, which satisfy the provided condition:
uppercase A B C D lowercase
b 2 20 200 2000
c 3 30 300 3000
- e pandas.DataFrame.loc[...] attribute lets us index rows instead of columns. e following selects the two rows c and d:
df2.loc[['c', 'd']]
82 Data Manipulation and Analysis with pandas
is yields the following subset DataFrame:
uppercase A B C D lowercase
c 3 30 300 3000 d 4 40 400 4000
- pandas DataFrames still support standard integer indexing through the pandas. DataFrame.iloc[...] attribute. We can select the first row by doing this:
df2.iloc[[0]]
is selects the following single-row DataFrame:
uppercase A B C D lowercase
a 1 10 100 1000
We can modify the DataFrame with an operation like this:
df2[df2['D'] == 2000] = 0; df2
is updates the DataFrame to this new DataFrame:
uppercase A B C D lowercase
a 1 10 100 1000 b 0 0 0 0
c 3 30 300 3000 d 4 40 400 4000
In this section, we have learned how to index, select, and filter DataFrames. In the next section, we will learn how to drop rows and columns.
Dropping rows and columns from a DataFrame
Dropping rows and columns from a DataFrame is a critical operation it not only helps save the computer’s memory but also ensures that the DataFrame contains only logically needed information. e steps are as follows:
- Let’s display the current DataFrame:
df2
Learning essential pandas.DataFrame operations 83
is DataFrame contains the following:
uppercase A B C D lowercase
a 1 10 100 1000 b 0 0 0 0
c 3 30 300 3000 d 4 40 400 4000
- To drop the row at index b, we use the pandas.DataFrame.drop(...) method: df2.drop('b')
is yields a new DataFrame without the row at index b:
uppercase A B C D
lowercase
a 1 10 100 1000
c 3 30 300 3000
d 4 40 400 4000
Let’s check whether the original DataFrame was changed:
df2
e output shows that it was not, that is, pandas.DataFrame.drop(...) is not in place by default:
uppercase A B C D lowercase
a 1 10 100 1000
b 0 0 0 0
c 3 30 300 3000
d 4 40 400 4000
- To modify the original DataFrame, we use the inplace= parameter:
df2.drop('b', inplace=True); df2
e new in-place modified DataFrame is as follows:
uppercase A B C D lowercase
84 Data Manipulation and Analysis with pandas
a 1 10 100 1000 c 3 30 300 3000 d 4 40 400 4000
- We can drop multiple rows as well: df2.drop(['a', 'd'])
is returns the following new DataFrame:
uppercase A B C D lowercase
c 3 30 300 3000
- To drop columns instead of rows, we specify the additional axis= parameter: df2.drop(['A', 'B'], axis=1)
is gives us this new DataFrame with two dropped columns:
uppercase C D
lowercase
a 100 1000
c 300 3000
d 400 4000
We have learned how to drop rows and columns in this section. In the next section, we will learn how to sort values and rand them.
Sorting values and ranking the values' order within a DataFrame
First, let’s create a DataFrame with integer row indices, integer column names, and random values:
import numpy as np
df = pd.DataFrame(np.random.randn(5,5),
index=np.random.randint(0, 100, size=5), columns=np.random.randint(0,100,size=5)); df
Learning essential pandas.DataFrame operations 85
e DataFrame contains the following data:
87 79 74 3 61
7 0.355482 -0.246812 -1.147618 -0.293973 -0.560168 52 1.748274 0.304760 -1.346894 -0.548461 0.457927 80 -0.043787 -0.680384 1.918261 1.080733 1.346146 29 0.237049 0.020492 1.212589 -0.462218 1.284134 0 -0.153209 0.995779 0.100585 -0.350576 0.776116
pandas.DataFrame.sort_index(...) sorts the DataFrame by index values:
df.sort_index()
e result is as follows:
87 79 74 3 61
0 -0.153209 0.995779 0.100585 -0.350576 0.776116 7 0.355482 -0.246812 -1.147618 -0.293973 -0.560168 29 0.237049 0.020492 1.212589 -0.462218 1.284134 52 1.748274 0.304760 -1.346894 -0.548461 0.457927 80 -0.043787 -0.680384 1.918261 1.080733 1.346146
We can also sort by column name values by specifying the axis parameter:
df.sort_index(axis=1)
is yields the following DataFrame with the columns arranged in order:
3 61 74 79 87
7 -0.293973 -0.560168 -1.147618 -0.246812 0.355482 52 -0.548461 0.457927 -1.346894 0.304760 1.748274 80 1.080733 1.346146 1.918261 -0.680384 -0.043787 29 -0.462218 1.284134 1.212589 0.020492 0.237049 0 -0.350576 0.776116 0.100585 0.995779 -0.153209
To sort the values in the DataFrame, we use the pandas.DataFrame.sort_ values(...) method, which takes a by= parameter specifying which column(s) to sort by:
df.sort_values(by=df.columns[0])
86 Data Manipulation and Analysis with pandas
is yields the following DataFrame sorted by the values in the first column:
87 79 74 3 61
0 -0.153209 0.995779 0.100585 -0.350576 0.776116 80 -0.043787 -0.680384 1.918261 1.080733 1.346146 29 0.237049 0.020492 1.212589 -0.462218 1.284134 7 0.355482 -0.246812 -1.147618 -0.293973 -0.560168 52 1.748274 0.304760 -1.346894 -0.548461 0.457927
e pandas.DataFrame.rank(...) method yields a DataFrame containing the rank/order of values in each column:
df.rank()
e output contains the rank (in ascending order) of values:
87 79 74 3 61 7 4.0 2.0 2.0 4.0 1.0 52 5.0 4.0 1.0 1.0 2.0 80 2.0 1.0 5.0 5.0 5.0 29 3.0 3.0 4.0 2.0 4.0 0 1.0 5.0 3.0 3.0 3.0
With this lesson completed, in the next section we will perform arithmetic operations on DataFrames.
Arithmetic operations on DataFrames
First, let’s create two DataFrames for our examples:
df1 = pd.DataFrame(np.random.randn(3,2),
index=['A', 'C', 'E'],
columns=['colA', 'colB']); df1
e df1 DataFrame contains the following:
colA colB
A 0.519105 -0.127284 C -0.840984 -0.495306 E -0.137020 0.987424
Learning essential pandas.DataFrame operations 87
Now we create the df2 DataFrame:
df2 = pd.DataFrame(np.random.randn(4,3),
index=['A', 'B', 'C', 'D'],
columns=['colA', 'colB', 'colC']); df2
is contains the following:
colA colB colC
A -0.718550 1.938035 0.220391 B -0.475095 0.238654 0.405642 C 0.299659 0.691165 -1.905837 D 0.282044 -2.287640 -0.551474
We can add the two DataFrames together. Note that they have different index values as
well as different columns:
df1 + df2
e output is a summation of elements if the index and column exists in both
DataFrames, otherwise it is NaN:
colA colB colC A -0.199445 1.810751 NaN B NaN NaN NaN C -0.541325 0.195859 NaN D NaN NaN NaN E NaN NaN NaN
We can use the pandas.DataFrame.add(...) method with fill_value= to a value to be used instead of NaN (in this case 0):
df1.add(df2, fill_value=0)
e output is as follows:
colA colB colC
A -0.199445 1.810751 0.220391 B -0.475095 0.238654 0.405642
88 Data Manipulation and Analysis with pandas
C -0.541325 0.195859 -1.905837 D 0.282044 -2.287640 -0.551474 E -0.137020 0.987424 NaN
We can perform arithmetic operations between DataFrames and Series as well:
df1 - df2[['colB']]
e output of this operation is the following (since the right-hand-side only had colB):
colA colB
A NaN -2.065319 B NaN NaN
C NaN -1.186471 D NaN NaN
E NaN NaN
Let’s now learn how to merge and combine multiple DataFrames into a single Dataframe.
Merging and combining multiple DataFrames into a single DataFrame
Let’s start by creating two DataFrames, df1 and df2:
df1.index.name = 'Index'; df1.columns.name = 'Columns'; df1 e df1 DataFrame has the following data:
Columns colA colB Index
A 0.519105 -0.127284 C -0.840984 -0.495306 E -0.137020 0.987424
Now we create df2:
df2.index.name = 'Index'; df2.columns.name = 'Columns'; df2 e df2 DataFrame has the following data:
Learning essential pandas.DataFrame operations 89
A -0.718550 1.938035 0.220391 B -0.475095 0.238654 0.405642 C 0.299659 0.691165 -1.905837 D 0.282044 -2.287640 -0.551474
e pandas.merge(...) method joins/merges two DataFrames. e left_index= and right_index= parameters indicate that the merge should be performed on Index values in both DataFrames:
pd.merge(df1, df2, left_index=True, right_index=True)
at yields the following merged DataFrame. e _x and _y suffixes are added to differentiate between le and right DataFrame columns with the same name:
Columns colA_x colB_x colA_y colB_y colC Index
A 0.519105 -0.127284 -0.718550 1.938035 0.220391 C -0.840984 -0.495306 0.299659 0.691165 -1.905837
We can specify custom suffixes with the suffixes= parameter:
pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2'))
e result is the following DataFrame with the suffixes we provided:
Columns colA_1 colB_1 colA_2 colB_2 colC Index
A 0.519105 -0.127284 -0.718550 1.938035 0.220391 C -0.840984 -0.495306 0.299659 0.691165 -1.905837
We can specify the behavior of the join (outer, inner, le , or right join) using the how= parameter:
pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2'), how='outer')
is yields the following DataFrame with NaNs for missing values:
Columns colA_1 colB_1 colA_2 colB_2 colC Index
A 0.519105 -0.127284 -0.718550 1.938035 0.220391
90 Data Manipulation and Analysis with pandas
B NaN NaN -0.475095 0.238654 0.405642 C -0.840984 -0.495306 0.299659 0.691165 -1.905837 D NaN NaN 0.282044 -2.287640 -0.551474 E -0.137020 0.987424 NaN NaN NaN
pandas DataFrames themselves have a pandas.DataFrame.merge(...) method that behaves the same way:
df1.merge(df2, left_index=True, right_index=True, suffixes=('_1', '_2'), how='outer')
is yields the following:
Columns colA_1 colB_1 colA_2 colB_2 colC Index
A 0.519105 -0.127284 -0.718550 1.938035 0.220391 B NaN NaN -0.475095 0.238654 0.405642 C -0.840984 -0.495306 0.299659 0.691165 -1.905837 D NaN NaN 0.282044 -2.287640 -0.551474 E -0.137020 0.987424 NaN NaN NaN
Another alternative is the pandas.DataFrame.join(...) method:
df1.join(df2, lsuffix='_1', rsuffix='_2')
And the output of the join (le join by default) is as follows:
Columns colA_1 colB_1 colA_2 colB_2 colC Index
A 0.519105 -0.127284 -0.718550 1.938035 0.220391 C -0.840984 -0.495306 0.299659 0.691165 -1.905837 E -0.137020 0.987424 NaN NaN NaN
e pandas.concat(...) method combines DataFrames by concatenating rows together:
pd.concat([df1, df2])
is yields the following concatenated DataFrame with NaNs for missing values:
colA colB colC Index
Learning essential pandas.DataFrame operations 91
A 0.519105 -0.127284 NaN
C -0.840984 -0.495306 NaN
E -0.137020 0.987424 NaN
A -0.718550 1.938035 0.220391 B -0.475095 0.238654 0.405642 C 0.299659 0.691165 -1.905837 D 0.282044 -2.287640 -0.551474
We can concatenate across columns by specifying the axis= parameter:
pd.concat([df1, df2], axis=1)
is yields the following DataFrame with additional columns from df2:
Columns colA colB colA colB colC
A 0.519105 -0.127284 -0.718550 1.938035 0.220391 B NaN NaN -0.475095 0.238654 0.405642 C -0.840984 -0.495306 0.299659 0.691165 -1.905837 D NaN NaN 0.282044 -2.287640 -0.551474 E -0.137020 0.987424 NaN NaN NaN
We will now look at hierarchical indexing.
Hierarchical indexing
So far, we have been dealing with Index objects that were a simple single value. Hierarchical indexing uses MultiIndex objects, which are tuples of multiple values per Index. is lets us create sub-DataFrames inside a single DataFrame.
Let’s create a MultiIndex DataFrame:
df = pd.DataFrame(np.random.randn(10, 2), index=[list('aaabbbccdd'),
[1, 2, 3, 1, 2, 3, 1, 2, 1, 2]], columns=['A', 'B']);
df
is is the layout of the MultiIndex DataFrame that uses hierarchical indexing:
92 Data Manipulation and Analysis with pandas
2 -0.409463 -1.103412 3 0.812444 -1.950786 b 1 -1.549981 0.947575 2 0.344725 -0.709320 3 1.384979 -0.716733 c 1 -0.319983 0.887631 2 -1.763973 1.601361
d 1 0.171177 -1.285323 2 -0.143279 0.020981
We can assign names to the MultiIndex object with the pandas.MultiIndex. names attribute it requires a list of names with the same dimension as the dimensions of the MultiIndex DataFrame (in this case, two elements):
df.index.names = ['alpha', 'numeric']; df is yields the following:
A B
alpha numeric
a 1 0.289379 -0.157919 2 -0.409463 -1.103412 3 0.812444 -1.950786 ...
e pandas.DataFrame.reset_index(...) method removes all indexing levels from a MultiIndex DataFrame by default, but can be used to remove one or more levels:
df.reset_index()
is leads to the following integer indexed DataFrame and the MultiIndex values are added as columns in this DataFrame:
alpha numeric A B
0 a 1 0.289379 -0.157919 1 a 2 -0.409463 -1.103412
2 a 3 0.812444 -1.950786 ...
Learning essential pandas.DataFrame operations 93
e pandas.DataFrame.unstack(...) method has similar behavior and pivots the inner level of indexing and converts them to columns:
df.unstack()
Let’s inspect the new DataFrame where the innermost indexing level [1, 2, 3] becomes columns:
A B
numeric 1 2 3 1 2 3
alpha
a 0.289379 -0.409463 0.812444 -0.157919 -1.103412 -1.950786
b -1.549981 0.344725 1.384979 0.947575 -0.709320 -0.716733
c -0.319983 -1.763973 NaN 0.887631 1.601361 NaN
d 0.171177 -0.143279 NaN
-1.285323 0.020981 NaN
e pandas.DataFrame.stack(...) method does the opposite of unstack(...):
df.stack()
e output DataFrame is the original DataFrame with hierarchical indexing:
alpha numeric
a 1 A 0.289379 B -0.157919 2 A -0.409463 B -1.103412 3 A 0.812444 B -1.950786 ...
94 Data Manipulation and Analysis with pandas
Let’s examine the structure of the MultiIndex DataFrame. Note that we first call pandas.DataFrame.stack(...) to convert the columns [A, B] into a third level of indexing in the MultiIndex DataFrame:
df.stack().index
is gives us a MultiIndex object with three levels of indexing:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3], ['A', 'B']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['alpha', 'numeric', None]) Now we will learn how to group operations in DataFrames.
Grouping operations in DataFrames
Grouping operations in pandas generally follow the split-apply-combine process of operations:
- First, the data is split into groups based on one or more keys.
- en we apply necessary functions to these groups to compute the desired results.
- Finally, we combine them to build the transformed dataset.
us, grouping a single indexed DataFrame builds a hierarchical DataFrame. e steps are as follows:
- Let’s use the pandas.DataFrame.reset_index(…) method to remove all hierarchical indexing from our previous df DataFrame:
df = df.reset_index(); df
is returns the following DataFrame with integer indexing:
alpha numeric A B
0 a 1 -0.807285 0.170242 1 a 2 0.704596 1.568901 2 a 3 -1.417366 0.573896 3 b 1 1.110121 0.366712 ...
Learning essential pandas.DataFrame operations 95
- Let’s use the pandas.DataFrame.groupby(...) method to group the A and B columns by the alpha column:
grouped = df[['A','B']].groupby(df['alpha']); grouped
is yields the following DataFrameGroupBy object, which we can subsequently operate on:
<pandas.core.groupby.DataFrameGroupBy object at 0x7fd21f24cc18>
- We can use the DataFrameGroupBy.describe(...) method to collect summary descriptive statistics:
grouped.describe()
is yields the following output where statistics for A and B are generated but grouped by the alpha column:
A B
alpha
a count 3.000000 3.000000 mean -0.506685 0.771013
std 1.092452 0.719863
min -1.417366 0.170242
25% -1.112325 0.372069
50% -0.807285 0.573896
75% -0.051344 1.071398
max 0.704596 1.568901
...
- We can apply the pandas.DataFrame.unstack(...) method using the DataFrameGroupBy.apply(...) method, which accepts different functions and applies them to each group of the grouped object:
grouped.apply(pd.DataFrame.unstack)
is generates the following hierarchical DataFrame:
alpha
a A 0 -0.807285
1 0.704596
2 -1.417366
96 Data Manipulation and Analysis with pandas
B 0 0.170242 1 1.568901 2 0.573896 ...
dtype: float64
- ere also exists the DataFrameGroupBy.agg(...) method, which accepts functions and aggregates each column for each group using that method. e next example aggregates using the mean method:
grouped[['A', 'B']].agg('mean')
e output contains the mean for columns A and B grouped by values in alpha:
A B
alpha
a -0.506685 0.771013
b 0.670435 0.868550
c 0.455688 -0.497468
d -0.786246 0.107246
- A similar method is the DataFrameGroupBy.transform(...) method, with the only difference being that transform works on one column at a time and returns a sequence of values of the same length as the series, while apply can return any
type of result:
from scipy import stats
grouped[['A', 'B']].transform(stats.zscore)
is generates the Z score for columns A and B, which we explained in Chapter 2, Exploratory Data Analysis:
A B
0 -0.337002 -1.022126
1 1.357964 1.357493
2 -1.020962 -0.335367
3 0.610613 -0.567813
4 -1.410007 1.405598
5 0.799394 -0.837785
6 -1.000000 1.000000
Learning essential pandas.DataFrame operations 97
8 -1.000000 -1.000000 9 1.000000 1.000000
We will now learn how to transform values in DataFrames’ axis indices.
Transforming values in DataFrames' axis indices
Let’s first reinspect the df2 DataFrame that we will be using in these examples:
df2
is contains the following data:
Columns colA colB colC Index
A -2.071652 0.742857 0.632307 B 0.113046 -0.384360 0.414585 C 0.690674 1.511816 2.220732 D 0.184174 -1.069291 -0.994885
We can rename the Index labels using the pandas.DataFrame.index attribute as we saw before:
df2.index = ['Alpha', 'Beta', 'Gamma', 'Delta']; df2
is generates the following transformed DataFrame:
Columns colA colB colC Alpha -2.071652 0.742857 0.632307 Beta 0.113046 -0.384360 0.414585 Gamma 0.690674 1.511816 2.220732 Delta 0.184174 -1.069291 -0.994885
e pandas.Index.map(...) method applies functions to transform the Index. In the following example, the map function takes the first three characters of the name
and sets that as the new name:
df2.index = df2.index.map(lambda x : x[:3]); df2
98 Data Manipulation and Analysis with pandas
e output is as follows:
Columns colA colB colC
Alp -2.071652 0.742857 0.632307 Bet 0.113046 -0.384360 0.414585 Gam 0.690674 1.511816 2.220732 Del 0.184174 -1.069291 -0.994885
e pandas.DataFrame.rename(...) method lets us transform both Index names and column names and accepts a dictionary mapping from the old name to the new name:
df2.rename(index={'Alp': 0, 'Bet': 1, 'Gam': 2, 'Del': 3}, columns={'colA': 'A', 'colB': 'B', 'colC': 'C'})
e resulting DataFrame has new labels on both axes:
Columns A B C
0 -2.071652 0.742857 0.632307 1 0.113046 -0.384360 0.414585 2 0.690674 1.511816 2.220732 3 0.184174 -1.069291 -0.994885
With this lesson learned, we will learn how to handle missing data in DataFrames.
Handling missing data in DataFrames
Missing data is a common phenomenon in data science and can happen for multiple reasons for example, technical error, human error, market holiday.
Filtering out missing data
When dealing with missing data, the first option is to remove all observations with any missing data.
is code block modifies the df2 DataFrame using the pandas.DataFrame.at[...] attribute and sets some values to NaN:
for row, col in [('Bet', 'colA'), ('Bet', 'colB'),
('Bet', 'colC'), ('Del', 'colB'), ('Gam', 'colC')]: df2.at[row, col] = np.NaN
df2
Learning essential pandas.DataFrame operations 99
e modified DataFrame is as follows:
Columns colA colB colC
Alp -1.721523 -0.425150 1.425227 Bet NaN NaN NaN
Gam -0.408566 -1.121813 NaN
Del 0.361053 NaN 0.580435
e pandas.DataFrame.isnull(...) method finds missing values in a DataFrame:
df2.isnull()
e result is a DataFrame with True where values are missing and False otherwise:
Columns colA colB colC Alp False False False Bet True True True Gam False False True Del False True False
e pandas.DataFrame.notnull(...) method does the opposite (detects non-missing values):
df2.notnull()
e output is the following DataFrame:
Columns colA colB colC Alp True True True Bet False False False Gam True True False Del True False True
e pandas.DataFrame.dropna(...) method allows us to drop rows with missing values. e additional how= parameter controls which rows get dropped. To drop rows that have NaN for all fields, we do the following:
df2.dropna(how='all')
100 Data Manipulation and Analysis with pandas
e result is the following modified DataFrame with the Bet row removed since that was the only one with all NaN:
Columns colA colB colC
Alp -1.721523 -0.425150 1.425227 Gam -0.408566 -1.121813 NaN
Del 0.361053 NaN 0.580435
Setting how= to any removes rows with any NaN values:
df2.dropna(how='any')
is gives us the following DataFrame with all non-NaN values:
Columns colA colB colC
Alp -1.721523 -0.42515 1.425227
We will now look at how to fill in missing data. Filling in missing data
e second option when dealing with missing data is to fill in the missing values either with a value of our choice or using other valid values in the same column to duplicate/ extrapolate the missing values.
Let’s start by re-inspecting the df2 DataFrame:
df2
is yields the following DataFrame with some missing values:
Columns colA colB colC
Alp -1.721523 -0.425150 1.425227 Bet NaN NaN NaN
Gam -0.408566 -1.121813 NaN
Del 0.361053 NaN 0.580435
Now, let’s use the pandas.DataFrame.fillna(...) method with the method='backfill' and inplace=True arguments to use the backfill method to backward fill the missing values from the other values and change the DataFrame in place:
df2.fillna(method='backfill', inplace=True); df2
Learning essential pandas.DataFrame operations 101
e new DataFrame contains the following:
Columns colA colB colC
Alp -1.721523 -0.425150 1.425227 Bet -0.408566 -1.121813 0.580435 Gam -0.408566 -1.121813 0.580435 Del 0.361053 NaN 0.580435
e NaN value at (Del,colB) is because there were no observations a er that row, so backfill could not be performed. at can be fixed instead with forward fill.
The transformation of DataFrames with functions and mappings
pandas DataFrame values can also be modified by passing functions and dictionary mappings that operate on one or more data values and generate new transformed values.
Let’s modify the df2 DataFrame by adding a new column, Category, containing discrete text data:
df2['Category'] = ['HIGH', 'LOW', 'LOW', 'HIGH']; df2
e new DataFrame contains the following:
Columns colA colB colC Category Alp 1.017961 1.450681 -0.328989 HIGH
Bet -0.079838 -0.519025 1.460911 LOW
Gam -0.079838 -0.519025 1.460911 LOW
Del 0.359516 NaN 1.460911 HIGH
e pandas.Series.map(...) method accepts a dictionary containing a mapping from the old value to the new value and transforms the values. e following snippet changes the text values in Category to single characters:
df2['Category'] = df2['Category'].map({'HIGH': 'H', 'LOW': 'L'}); df2
e updated DataFrame is as follows:
Columns colA colB colC Category Alp 1.017961 1.450681 -0.328989 H
102 Data Manipulation and Analysis with pandas
Bet -0.079838 -0.519025 1.460911 L Gam -0.079838 -0.519025 1.460911 L Del 0.359516 NaN 1.460911 H
e pandas.DataFrame.applymap(...) method allows us to apply functions to data values in a DataFrame.
e following code applies the numpy.exp(...) method, which calculates the exponential:
df2.drop('Category', axis=1).applymap(np.exp)
e result is a DataFrame containing exponential values of the original DataFrame’s values (except the NaN value):
Columns colA colB colC
Alp 2.767545 4.266020 0.719651 Bet 0.923266 0.595101 4.309883 Gam 0.923266 0.595101 4.309883 Del 1.432636 NaN 4.309883
Now that we’ve learned how to transform DataFrames, we will see how to discretize and bucket values in DataFrames.
Discretization/bucketing of DataFrame values
e simplest way to achieve discretization is to create ranges of values and assign a single discrete label to all values that fall within a certain bucket.
First, let’s generate a random valued ndarray for our use:
arr = np.random.randn(10); arr
is contains the following:
array([ 1.88087339e-01, 7.94570445e-01, -5.97384701e-01, -3.01897668e+00, -5.42185315e-01, 1.10094663e+00, 1.16002554e+00, 1.51491444e-03, -2.21981570e+00, 1.11903929e+00])
Learning essential pandas.DataFrame operations 103
e pandas.cut(...) method can be used to discretize these values. e following code uses the bins= and labels=[...] arguments to bin the values into five discrete values with the labels provided:
cat = pd.cut(arr, bins=5, labels=['Very Low', 'Low', 'Med', 'High', 'Very High']); cat
We get the discrete values a er the transformation:
[High, Very High, Med, Very Low, Med, Very High, Very High, High, Very Low, Very High]
Categories (5, object): [Very Low < Low < Med < High < Very High]
e pandas.qcut(...) method is similar but uses quartiles to bin the continuous values to discrete values so that each category has the same amount of observations.
e following builds five discrete bins using the q= parameter:
qcat = pd.qcut(arr, q=5, labels=['Very Low', 'Low', 'Med', 'High', 'Very High']); qcat
And the quartile discretization yields the following categories:
[Med, High, Low, Very Low, Low, High, Very High, Med, Very Low, Very High]
Categories (5, object): [Very Low < Low < Med < High < Very High]
e following code block builds a pandas DataFrame consisting of the original continuous values as well as the categories generated from cut and qcut:
pd.DataFrame({'Value': arr, 'Category': cat, 'Quartile Category': qcat})
is DataFrame allows side-by-side comparison:
Category Quartile Category Value
0 High Med 0.188087 1 Very High High 0.794570 2 Med Low -0.597385
104 Data Manipulation and Analysis with pandas
3 Very Low Very Low -3.018977 4 Med Low -0.542185 5 Very High High 1.100947 6 Very High Very High 1.160026 7 High Med 0.001515 8 Very Low Very Low -2.219816 9 Very High Very High 1.119039
e pandas.Categorical.categories attribute provides us with the
bucket ranges:
pd.cut(arr, bins=5).categories
In this case, the buckets/range of values are as follows:
Index(['(-3.0232, -2.183]', '(-2.183, -1.347]', '(-1.347, -0.512]', '(-0.512, 0.324]',
'(0.324, 1.16]'],
dtype='object')
We can inspect the buckets for qcut as well:
pd.qcut(arr, q=5).categories
ey are slightly different from the previous buckets and they are shown as follows:
Index(['[-3.019, -0.922]', '(-0.922, -0.216]', '(-0.216, 0.431]', '(0.431, 1.105]',
'(1.105, 1.16]'],
dtype='object')
We will now look at permuting and sampling DataFrame values to generate new DataFrames.
Permuting and sampling DataFrame values to generate new DataFrames
Permuting available datasets to generate new datasets and sampling datasets to either sub-sample (reduce the number of observations) or super-sample (increase the number of observations) are common operations in statistical analysis.
Learning essential pandas.DataFrame operations 105
First, let’s generate a DataFrame of random values to work with:
df = pd.DataFrame(np.random.randn(10,5),
index=np.sort(np.random.randint(0, 100, size=10)), columns=list('ABCDE'));
df
e result is the following:
A B C D E 0 -0.564568 -0.188190 -1.678637 -0.128102 -1.880633 0 -0.465880 0.266342 0.950357 -0.867568 1.504719 29 0.589315 -0.968324 -0.432725 0.856653 -0.683398 ...
e numpy.random.permutation(...) method, when applied to a DataFrame, randomly shuffles along the Index axis and can be used to permute the rows in the dataset:
df.loc[np.random.permutation(df.index)]
is yields the following DataFrame with the rows randomly shuffled:
A B C D E 42 0.214554 1.108811 1.352568 0.238083 -1.090455 0 -0.564568 -0.188190 -1.678637 -0.128102 -1.880633 0 -0.465880 0.266342 0.950357 -0.867568 1.504719 62 -0.266102 0.831051 -0.164629 0.349047 1.874955 ...
We can use the numpy.random.randint(...) method to generate random integers within a certain range and then use the pandas.DataFrame.iloc[...] attribute to randomly sample with replacement (the same observation can be picked more than once) from our DataFrame.
e following code block picks out five rows randomly sampled with replacement:
df.iloc[np.random.randint(0, len(df), size=5)]
106 Data Manipulation and Analysis with pandas
is yields the following randomly sub-sampled DataFrame:
A B C D E 54 0.692757 -0.584690 -0.176656 0.728395 -0.434987 98 -0.517141 0.109758 -0.132029 0.614610 -0.235801 29 0.589315 -0.968324 -0.432725 0.856653 -0.683398 35 0.520140 0.143652 0.973510 0.440253 1.307126 62 -0.266102 0.831051 -0.164629 0.349047 1.874955
In the following section, we will look at exploring file operations with pandas. DataFrames.
Exploring file operations with pandas. DataFrames
pandas supports the persistence of DataFrames in both plain-text and binary formats. e common text formats are CSV and JSON files, the most used binary formats are Excel XLSX, HDF5, and pickle.
In this book, we focus on plain-text persistence.
CSV files
CSV files (comma-separated values files) are data-exchange standard files. Writing CSV files
Writing a pandas DataFrame to a CSV file is easily achievable using the pandas. DataFrame.to_csv(...) method. e header= parameter controls whether a header is written to the top of the file or not and the index= parameter controls whether the Index axis values are written to the file or not:
df.to_csv('df.csv', sep=',', header=True, index=True)
We can inspect the file written to disk using the following Linux command typed into the notebook. e ! character instructs the notebook to run a shell command:
Exploring file operations with pandas.DataFrames 107
e file contains the following lines:
,A,B,C,D,E
4,-0.6329164608486778,0.3733235944037599,0.8225354680198685,- 0.5171618315489593,0.5492241692404063
17,0.7664860447792711,0.8427366352142621,0.9621402130525599,- 0.41134468872009666,-0.9704305306626816
24,-0.22976016405853183,0.38081314413811984,- 1.526376189972014,0.07229102135441286,-0.3297356221604555
Reading CSV files
Reading a CSV file and building a pandas DataFrame from the data in it can be achieved using the pandas.read_csv(...) method. Here we will specify the character (although that is the default for read_csv), the index_col= parameter to specify
which column to treat as the Index of the DataFrame, and the nrows= parameter to
specify how many rows to read in:
pd.read_csv('df.csv', sep=',', index_col=0, nrows=5)
is builds the following DataFrame, which is the same DataFrame that was written to disk:
A B C D E 4 -0.632916 0.373324 0.822535 -0.517162 0.549224 17 0.766486 0.842737 0.962140 -0.411345 -0.970431 24 -0.229760 0.380813 -1.526376 0.072291 -0.329736 33 0.662259 -1.457732 -2.268573 0.332456 0.496143 33 0.335710 0.452842 -0.977736 0.677470 1.164602
We can also specify the chunksize= parameter, which reads in the specified number
of lines at a time, which can help when exploring very large datasets contained in very
large files:
pd.read_csv('df.csv', sep=',', index_col=0, chunksize=2)
at returns a pandas TextFileReader generator, which we can iterate through as
needed instead of loading the entire file at once:
<pandas.io.parsers.TextFileReader at 0x7fb4e9933a90>
108 Data Manipulation and Analysis with pandas
We can force the generator to finish evaluation by wrapping it in a list and observe the entire DataFrame loaded in chunks of two lines:
list(pd.read_csv('df.csv', sep=',', index_col=0, chunksize=2))
at gives us the following list of two-line blocks:
[ A B C D E 4 -0.632916 0.373324 0.822535 -0.517162 0.549224 17 0.766486 0.842737 0.962140 -0.411345 -0.970431, A B C D E 24 -0.229760 0.380813 -1.526376 0.072291 -0.329736 33 0.662259 -1.457732 -2.268573 0.332456 0.496143, ...
We will now look at how to explore file operations in JSON files.
JSON files
JSON files are based upon data structures identical to Python dictionaries. is makes JSON files very convenient for many purposes including representing DataFrames as well as representing configuration files.
e pandas.DataFrame.to_json(...) method conveniently writes a DataFrame to a JSON file on disk. Here we write only the first four rows:
df.iloc[:4].to_json('df.json') Let’s check out the JSON file written to disk:
!cat df.json
is gives us the following dictionary-style JSON file written to disk:
{"A":{"4":-0.6329164608,"17":0.7664860448,"24":- 0.2297601641,"33":0.6622594878},"B":{"4":0.3733235944, "17":0.8427366352,"24":0.3808131441,"33":-1.4577321521}, "C":{"4":0.822535468,"17":0.9621402131,"24":-1.52637619,"33":- 2.2685732447},"D":{"4":-0.5171618315,"17":-0.4113446887 ,"24":0.0722910214,"33":0.3324557226},"E":{"4":0.5492241692 ,"17":-0.9704305307,"24":-0.3297356222,"33":0.4961425281}}
Summary 109
Reading JSON files back into Pandas DataFrames is just as easy with the pandas.read_ json(...) method:
pd.read_json('df.json')
is gives us back the original four-row DataFrame that was written to disk:
A B C D E 4 -0.632916 0.373324 0.822535 -0.517162 0.549224 17 0.766486 0.842737 0.962140 -0.411345 -0.970431 24 -0.229760 0.380813 -1.526376 0.072291 -0.329736 33 0.662259 -1.457732 -2.268573 0.332456 0.496143
Congrats on successfully completing this lesson!
Summary
is chapter introduced us to the pandas library, upon which the majority, if not all, time- series operations in Python are done. We have learned how to create a DataFrame, how to alter it, and how to persist it.
Pandas DataFrames are principally for high-performance bulk data manipulation, selecting and reshaping data. ey are the Python version of Excel worksheets.
In the next chapter, we will investigate visualization in Python using Matplotlib.
'Finance' 카테고리의 다른 글
Statistical Estimation, Inference, and Prediction (0) | 2023.03.20 |
---|---|
Data Visualization Using Matplotlib (0) | 2023.03.20 |
High-Speed Scientific Computing Using (0) | 2023.03.20 |
Exploratory Data Analysis in Python (0) | 2023.03.20 |
3. High-Speed Scientific Computing UsingNumPy (0) | 2023.02.14 |
댓글