Valuable Data Analysis with Pandas Value Counts

栏目: IT技术 · 发布时间: 5年前

内容简介:In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or no

The data

In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or not the patient showed up to their appointment.

It can be downloaded here .

In the code below I have imported the data and the libraries that I will be using throughout the article.

import pandas as pdimport matplotlib.pyplot as plt
%matplotlib inlinedata = pd.read_csv('KaggleV2-May-2016.csv')
data.head()
The first few rows of the Medical Appointments No-Show data set from Kaggle.com

Basic counts

The value_counts() function can be used in the following way to get a count of unique values for one column in the data set. The code below gives a count of each value in the Gender column.

data['Gender'].value_counts()

To sort values in ascending or descending order we can use the sort argument. In the code below I have added sort=True to display the counts in the Age column in descending order.

data['Age'].value_counts(sort=True)

Combine with groupby()

The value_counts function can be combined with other Panadas functions for richer analysis techniques. One example is to combine with the groupby() function. In the below example I am counting values in the Gender column and applying groupby() to further understand the number of no-shows in each group.

data['No-show'].groupby(data['Gender']).value_counts(sort=True)

Normalize

In the above example displaying the absolute values does not easily enable us to understand the differences between the two groups. A better solution would be to show the relative frequencies of the unique values in each group.

We can add the normalize argument to value_counts() to display the values in this way.

data['No-show'].groupby(data['Gender']).value_counts(normalize=True)

Binning

For columns where there are a large number of unique values the output of the value_counts() function is not always particularly useful. A good example of this would be the Age column which we displayed value counts for earlier in this post.

Fortunately value_counts() has a bins argument. This parameter allows us to specificy the number of bins (or groups we want to split the data into) as an integer. In the example below I have added bins=5 to split the Age counts into 5 groups. We now have a count of values in each of these bins.

data['Age'].value_counts(bins=5)

Once again showing absolute numbers is not particularly useful so let’s add the normalize=True argument as well. Now we have a useful piece of analysis.

data['Age'].value_counts(bins=5, normalize=True)

Combine with nlargest()

There are other columns in our data set which have a large number of unique values where binning is still not going to provide us with a useful piece of analysis. A good example of this would be the Neighbourhood column.

If we simply run value_counts() against this we get an output that is not particularly insightful.

data['Neighbourhood'].value_counts(sort=True)

A better way to display this might be to view the top 10 neighbourhoods. We can do this by combining with another Pandas function called nlargest() as shown below.

data['Neighbourhood'].value_counts(sort=True).nlargest(10)

We can also use nsmallest() to display the bottom 10 neighbourhoods which might also prove useful.

data['Neighbourhood'].value_counts(sort=True).nsmallest(10)

Plotting

Another handy combination is the Pandas plotting functionality together with value_counts(). Having the ability to display the analyses we get from value_counts() as visualisations can make it far easier to view trends and patterns.

We can display all of the above examples and more with most plot types available in the Pandas library. A full list of available options can be found here .

Let’s look a few examples.

We can use a bar plot to view the top 10 neighbourhoods.

data['Neighbourhood'].value_counts(sort=True).nlargest(10).plot.bar()

We can make a pie chart to better visualise the Gender column.

data['Gender'].value_counts().plot.pie()

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

打造Facebook

打造Facebook

王淮、祝文让 / 印刷工业出版社 / 2013-2-1 / 39.80元

《打造Facebook》新书发布会,王淮与读者面对面,活动链接:http://www.douban.com/event/18166913/ 这本书的书名——《打造Facebook:亲历Facebook爆发的5年》很嚣张,谁有资格可以说这句话呢,当然,扎克伯格最有资格,但他不会亲自来告诉你,至少从目前的情况来看,近几年都不大可能。而且,这不是一个人的公司。里面的每一人,尤其是工程师,既是公司文......一起来看看 《打造Facebook》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具