Pandas Tricks that Expedite Data Analysis Process

栏目: IT技术 · 发布时间: 1个月前


内容简介:Speed-up your data analysis process with these simple tricks.Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite use


Pandas Tricks that Expedite Data Analysis Process

Speed-up your data analysis process with these simple tricks.

Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in data analysis.

Photo by Daniel Cheung on Unsplash

As always we start with importing numpy and pandas.

import numpy as np
import pandas as pd

Let’s create a sample dataframe to work on. Pandas is a versatile library that usually offers multiple ways to do a task. Thus, there are many ways to create a dataframe. One common method is to pass a dictionary that includes columns as key-value pairs.

values = np.random.randint(10, size=10)years = np.arange(2010,2020)groups = ['A','A','B','A','B','B','C','A','C','C']df = pd.DataFrame({'group':groups, 'year':years, 'value':values})df

We also used numpy to create arrays to be used as values in columns. np.arange returns a range values within specified interval. np.random.randint returns random integer values based on the specified range and size.

The dataframe contains some yearly values of 3 different groups. We may only be interested in yearly values but there are some cases in which we also need a cumulative sum. Pandas provides an easy-to-use function to calculate cumulative sum which is cumsum .

df['cumsum'] = df['value'].cumsum()df

We created a column named “cumsum” which contains cumulative sum of the numbers in value column. However, it does not take the groups into consideration. This kind of cumulative values may be useless in some cases because we are not able to distinguish between groups. Don’t worry! There is a very simple and convenient solution for this issue. We can apply groupby function.

df['cumsum'] = df[['value','group']].groupby('group').cumsum()df

We first applied groupby on “group” column then cumsum function. Now the values are summed up within each group. To make the dataframe look nicer, we may want to sort the values based on group instead of year so that we can visually separate groups.


We applied sort_values function and reset the index with reset_index function. As we can see in the returned dataframe, original index is kept as a column. We can eliminate it by setting drop parameter of reset_index function as True.

df = df.sort_values(by='group').reset_index(drop=True)df

It looks better now. When we want to add a new column to a dataframe, it is added at the end by default. However, pandas offers the option to add the new column in any position using insert function.

new = np.random.randint(5, size=10)df.insert(2, 'new_col', new)df

We specified the position by passing an index as first argument. This value must be an integer. Column indices start from zero just like row indices. The second argument is column name and the third argument is the object that includes values which can be Series or an array-like object.

Consider we want to remove a column from a dataframe but also want store keep that column as a separate series. One way is to assign the column to a series and then use drop function. A simpler way is to use pop functionn.

value = df.pop('value')df

With one line of code, we remove the value column from the dataframe and store it in a pandas series.

We sometimes need to filter a dataframe based on a condition or apply a mask to get certain values. One easy way to filter a dataframe is query function. I will use the sample dataframe we have been using. Let’s first insert the “value” column back:

df.insert(2, 'value', value)df

It is very simple to use query function which only requires the condition.

df.query('value < new_col')

It returned the rows in which “value” is less then “new_col”. We can set more complex conditions and also use additional operators.

df.query('2*new_col > value')

We can also combine multiple conditions into one query.

df.query('2*new_col > value & cumsum < 15')

There are many aggregations functions that we can use to calculate basic statistics on columns such as mean, sum, count and so on. We can apply each of these function to a column. However, in some cases, we may need to check more than one type statistics. For instance, both count and mean might be important in some cases. Instead of applying functions separately, pandas offers agg function to apply multiple aggregation functions.


It makes more sense to see both mean and count. We can easily detect outliers that have extreme mean values but very low number of observations.

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网






Data Structures and Algorithm Analysis in Java

Data Structures and Algorithm Analysis in Java

Mark A. Weiss / Pearson / 2011-11-18 / GBP 129.99

Data Structures and Algorithm Analysis in Java is an “advanced algorithms” book that fits between traditional CS2 and Algorithms Analysis courses. In the old ACM Curriculum Guidelines, this course wa......一起来看看 《Data Structures and Algorithm Analysis in Java》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

MD5 加密
MD5 加密

MD5 加密工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具