Pandas Tricks that Expedite Data Analysis Process

栏目: IT技术 · 发布时间: 3年前

内容简介:Speed-up your data analysis process with these simple tricks.Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite use

Pandas Tricks that Expedite Data Analysis Process

Speed-up your data analysis process with these simple tricks.

Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in data analysis.

Photo by Daniel Cheung on Unsplash

As always we start with importing numpy and pandas.

import numpy as np
import pandas as pd

Let’s create a sample dataframe to work on. Pandas is a versatile library that usually offers multiple ways to do a task. Thus, there are many ways to create a dataframe. One common method is to pass a dictionary that includes columns as key-value pairs.

values = np.random.randint(10, size=10)years = np.arange(2010,2020)groups = ['A','A','B','A','B','B','C','A','C','C']df = pd.DataFrame({'group':groups, 'year':years, 'value':values})df

We also used numpy to create arrays to be used as values in columns. np.arange returns a range values within specified interval. np.random.randint returns random integer values based on the specified range and size.

The dataframe contains some yearly values of 3 different groups. We may only be interested in yearly values but there are some cases in which we also need a cumulative sum. Pandas provides an easy-to-use function to calculate cumulative sum which is cumsum .

df['cumsum'] = df['value'].cumsum()df

We created a column named “cumsum” which contains cumulative sum of the numbers in value column. However, it does not take the groups into consideration. This kind of cumulative values may be useless in some cases because we are not able to distinguish between groups. Don’t worry! There is a very simple and convenient solution for this issue. We can apply groupby function.

df['cumsum'] = df[['value','group']].groupby('group').cumsum()df

We first applied groupby on “group” column then cumsum function. Now the values are summed up within each group. To make the dataframe look nicer, we may want to sort the values based on group instead of year so that we can visually separate groups.

df.sort_values(by='group').reset_index()

We applied sort_values function and reset the index with reset_index function. As we can see in the returned dataframe, original index is kept as a column. We can eliminate it by setting drop parameter of reset_index function as True.

df = df.sort_values(by='group').reset_index(drop=True)df

It looks better now. When we want to add a new column to a dataframe, it is added at the end by default. However, pandas offers the option to add the new column in any position using insert function.

new = np.random.randint(5, size=10)df.insert(2, 'new_col', new)df

We specified the position by passing an index as first argument. This value must be an integer. Column indices start from zero just like row indices. The second argument is column name and the third argument is the object that includes values which can be Series or an array-like object.

Consider we want to remove a column from a dataframe but also want store keep that column as a separate series. One way is to assign the column to a series and then use drop function. A simpler way is to use pop functionn.

value = df.pop('value')df

With one line of code, we remove the value column from the dataframe and store it in a pandas series.

We sometimes need to filter a dataframe based on a condition or apply a mask to get certain values. One easy way to filter a dataframe is query function. I will use the sample dataframe we have been using. Let’s first insert the “value” column back:

df.insert(2, 'value', value)df

It is very simple to use query function which only requires the condition.

df.query('value < new_col')

It returned the rows in which “value” is less then “new_col”. We can set more complex conditions and also use additional operators.

df.query('2*new_col > value')

We can also combine multiple conditions into one query.

df.query('2*new_col > value & cumsum < 15')

There are many aggregations functions that we can use to calculate basic statistics on columns such as mean, sum, count and so on. We can apply each of these function to a column. However, in some cases, we may need to check more than one type statistics. For instance, both count and mean might be important in some cases. Instead of applying functions separately, pandas offers agg function to apply multiple aggregation functions.

df[['group','value']].groupby('group').agg(['mean','count'])

It makes more sense to see both mean and count. We can easily detect outliers that have extreme mean values but very low number of observations.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

七步掌握业务分析

七步掌握业务分析

芭芭拉·A·卡克诺德 / 2010-9 / 49.00元

《七步掌握业务分析》内容简介:业务分析师是新兴的专业职务。在组织或项目中,业务分析师通过与项目干系人合作,采取一系列技术和知识,分析、理解组织或项目需求,并实现组织或项目目标,提出解决方案。《七步掌握业务分析》作者是国际业务分析协会(IIBA)的《业务分析知识体系指南》BABOK创作委员会的核心成员,全书结合BABOK的标准,以通俗易懂的语言阐述了业务分析的基本概念、任务与目标,介绍了从初级业务分......一起来看看 《七步掌握业务分析》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换