Accelerate your Exploratory Data Analysis with Pandas Profiling

栏目: IT技术 · 发布时间: 6年前

Accelerate Your Exploratory Data Analysis With Pandas-Profiling

Exploratory Data Analysis is tedious. Automate the process and generate detailed interactive reports with a single line of code using Pandas-Profiling

Sukanta Roy

Apr 19 ·8min read

When starting a new data science project, the first step after getting your hands on the data set for the first time is to understand it. We achieve this by performing Exploratory Data Analysis (EDA). This includes finding out the data type of each variable, the distribution of the target variable, number of distinct values for each predictor variable, if there is any duplicate or missing values in the data set etc.

If you have ever done EDA on any data set (and I assume you have as you are reading this article), I don’t need to tell you how time consuming this process can be. And if you have been a part of many data science projects (be it in your job or by doing personal projects) you know how repetitive all these process can be. But with the Open source library Pandas-profiling that doesn’t have to be the case anymore.

What is Pandas-Profiling?

Pandas-profiling is an open source library that can generate beautiful interactive reports for any data set, with just a single line of code. Sound’s interesting? Let’s take a look at the documentation to get a better understanding of what it does.

Pandas-profiling generates profile reports from a pandas DataFrame . The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics — if relevant for the column type — are presented in an interactive HTML report:

Type inference: detect the types of columns in a data frame.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, inter-quartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables(Spearman, Pearson and Kendall matrices)
Missing values matrix , count, heatmap and dendrogram of missing values
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

Now that we know what pandas-profiling is all about, let’s see how to install it and use it in a Jupyter Notebook or in Google Colab in the following section.

Install Pandas-profiling:

Using pip

You can install pandas-profiling very easily using pip package manager with the following command:

pip install pandas-profiling[notebook,html]

Alternatively, you could install the latest version directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using Conda

If you are using conda, then you can use the following command to installation

conda install -c conda-forge pandas-profiling

Installation in Google Colab

Google colab comes pre-installed with Pandas-profiling, but unfortunately it comes with an older version of it (v1.4). If you are following this article or the GitHub documentation, then the code will not run on Google Colab unless you install the latest version of the library (v2.6).

To do that, you need to first uninstall the existing library and install the latest one as follows:

# To uninstall
!pip uninstall !pip uninstall pandas_profiling

Now to install, we need to run the pip install command.

!pip install pandas-profiling[notebook,html]

Generate Reports:

Now that we are done with the prerequisites, let’s get into the fun part of analyzing some data set.

The data set I will be using for this example is the Titanic data set.

Load the libraries:

import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

Import the data

file = cache_file("titanic.csv",
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")data = pd.read_csv(file)

Generate report:

To generate the report, run the following code in the notebook.

profile = ProfileReport(data, title="Titanic Dataset", html={'style': {'full_width': True}}, sort="None")

That’s it. With a single line of code you have generated the a detailed profile report. Now let us see the results by including the report in the notebook.

Include the report in Notebook as IFrame

profile.to_notebook_iframe()

This will include the interactive report as HTML iframe in the notebook.

Saving the report

Save the report as a HTML file using the following code:

profile.to_file(output_file="your_report.html")

Or obtain the data as JSON using:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file(output_file="your_report.json")

The Results:

Now that we know how to generate reports using pandas-profiling, let’s look at the result.

Overview:

Pandas_profiling creates a very descriptive overview of the predictor variables, by calculating the total missing cells, duplicate rows, number of distinct values, missing values, zeros for the predictor variables. It also marks the variables that have high cardinality or have missing values in the warning section, as you can see in the above image.

Besides all these, it generates detailed analysis for each variable. I will go through some of them in this article, to see the full report with all the codes, find the colab link at the end of the article.

Class distribution:

Numerical Features:

For the numerical features, besides having detailed statistics like mean, standard deviation, min, max, Interquartile range (IQR) etc. it also plots the histogram, gives the list of common and extreme values.

Categorical Features:

Similar to the numerical features, for categorical features it calculates common values, lengths, characters etc.

Interactions:

One of the most interesting things is the interactions and correlation sections of the report. In the interaction section the pandas_profiling library automatically generates interaction plots for every pair of variables . You can get the interaction plot of any pair by selecting the specific variables from the two headers (Like in this example, I have selected passengerId and Age)

Correlation Matrix:

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'’ is less than the average weight of people 5'6'’, and their average weight is less than that of people 5'7'’, etc. Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.

The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).

When it comes to generating correlation matrix for all the numerical features, the pandas_profiling library gives us all the popular options to choose from including Pearson’s r , Spearman’s ρ etc.

Now that, we know the advantages of using pandas_profiling, it is also useful to note the disadvantage that this library has.

Disadvantage:

The main disadvantage of pandas profiling is its use with large data sets. With the increase in the size of the data the time to generate the report also increases a lot.

One way to solve this problem is to generate the profile report for a part of the data set. But while doing this, it is very important to make sure that the data is randomly sampled so that it is representative of all the data we have. We can do this by:

from pandas_profiling import ProfileReport# Generate report for 10000 data points
profile = ProfileReport(data.sample(n = 10000), title="Titanic Data set", html={'style': {'full_width': True}}, sort="None")# save to file
profile.to_file(output_file='10000datapoints.html')

Alternatively, if you are insistent on getting the report on the whole data set, you can do that by using the minimal mode . In the minimal mode a simplified report will be generated with less information than the full one but it can be generated relatively quickly for a large data set. The code for the same is given below:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file(output_file="output.html")

Conclusion:

Now that you know what is pandas-profiling and how to use it, I hope it will save you a ton of time which you can use for more advanced analysis specific to the problem in hand.

If you want to get the full report with working code, you can take a look at the following notebook. And if you would like to read some of my other articles then you can find the links below.

Demo

Demo on Titanic Data set

colab.research.google.com

Pandas-Profiling GitHub repo:

pandas-profiling/pandas-profiling

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for…

github.com

If you loved this article, you may also like some of my the other articles.

The Trap of tutorials and online courses

How tutorials and online courses can create an illusion of competence, and how not to fall into this trap

towardsdatascience.com

Machine Learning Case Study: A data-driven approach to predict the success of bank telemarketing

Predicting whether a customer will subscribe a term deposit or not given customer relationship data

towardsdatascience.com

What is ACM ICPC and how to prepare for it (the beginner’s guide)

What is ACM ICPC?

codeburst.io

About Me:

Hi, I am Sukanta Roy. A software developer, an aspiring Machine Learning Engineer, Former Google Summer of Code 2018 student and a huge psychology buff. If any of these things interest you, you can follow me on medium or you can connect with me on LinkedIn .

以上所述就是小编给大家介绍的《Accelerate your Exploratory Data Analysis with Pandas Profiling》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

免费

克里斯•安德森 / 蒋旭峰、冯斌、璩静 / 中信出版社 / 2012-10 / 68.00元

一种商业模式既可以统摄未来的市场，也可以挤垮当前的市场——在我们这个现代经济社会里，这并不是一件不可能的事情。 “免费”就是这样的一种商业模式，它所代表的正是数字化网络时代的商业未来。在《免费》这本书中，克里斯•安德森认为，新型的“免费”并不是一种左口袋出、右口袋进的营销伎俩，而是一种把货物和服务的成本压低到零的新型卓越能力。在20世纪“免费”是一种强有力的推销手段，而在21世纪它已经成为......一起来看看《免费》这本书的介绍吧!

码农工具

Accelerate your Exploratory Data Analysis with Pandas Profiling