A Practical Guide on Data Visualization

栏目: IT技术 · 发布时间: 6年前

内容简介:One picture is worth a thousand wordsWe live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explor

A Practical Guide on Data Visualization

One picture is worth a thousand words

We live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explore the data. There comes in the power of visualizations which are great tools in exploratory data analysis when used efficiently and appropriately. Visualizations also help to deliver a message to your audience or inform them about your findings. There is no one-fits-all kind of visualization method so certain tasks require different kind of visualizations. In this post, we will cover how to create basic plots and efficiently use them.

We need a sample dataframe to work on. In this post, we will use two different datasets both of which are available on kaggle. First one is telco customer churn dataset and the other one is US cars dataset.

import pandas as pd
import numpy as npdf = pd.read_csv("Projects/churn_prediction/Telco-Customer-Churn.csv")df.shape
(7043, 21)

The dataset includes 21 columns. “Churn” column indicates whether a customer has churned (i.e. left the company) and remaining columns include information about the customer or the products that customer have.

Note: There are many tools and software packages to create great visualizations. In this post, I will use two of the most common ones which are matplotlib and seaborn. Feel free to use any package as long as you get what you want.

import matplotlib.pyplot as plt
import seaborn as snssns.set(style="darkgrid")%matplotlib inline

%matplotlib inline command allows to render the figures in the notebook so we can see them instantly.

Before starting on creating visualizations, I would like to emphasize a point. The main goal of visualizing data is to explore and analyze the data or interpret the results and findings. Ofcourse, we need to pay attention to how the figures look and try to create appealing figures. However, very beautiful visualizations without any informative power are useless in data analysis. Let’s start with keeping this point in mind.

The main object of this dataset is customer churn. So, it is better to check how this variable looks:

plt.figure(figsize=(8,5))sns.countplot('Churn', data=df)

We created a figure object with a specified size with matplotlib backend. Then, added a countplot using seaborn. This figure obviously tells us the company is good at keeping its customers because churn rate is actually low.

This figure is plain and simple. Let’s add some informative power to it. We can see how churn changes depending on “SeniorCitizen” and “gender” columns:

sns.catplot('Churn', hue='SeniorCitizen', 
 col='gender', kind='count', 
 height=4, aspect=1, data=df)

Gender seems to be not changing the churn rate but there is a difference between senior and non-senior citizens. Senior citizens are more likely to churn. We can expand our analysis by trying other columns in this way.

Another way to explore data is to check the distributions of variables which give us an idea about the spread and density. Let’s check it on “tenure” and “MonthlyCharges” features.

fig, axs = plt.subplots(ncols=2, figsize=(10,6))sns.distplot(df.tenure, ax=axs[0]).set_title("Distribution of Tenure")sns.distplot(df.MonthlyCharges, ax=axs[1]).set_title("Distribution of MonthlyCharges")

We created the figure object with two subplots. Then, created distribution plots using seaborn. We also added titles using set_title :

Tenure variable indicates how long a customer has been a customer in months. Most of the customers are pretty new or have been a customer for a long time. MonthlyCharges variable exhibits a strange distribution but the high density is visible on the lowest amount.

Another way to have an idea about the dispersion of data is boxplot .

plt.figure(figsize=(10,6))sns.boxplot(x="Contract", y="MonthlyCharges", data=df)

The line in the box represents the median. The lower and upper edges of the boxes show first and third quantile, respectively. So, tall boxes indicates the values are more spread out. What we can understand from this plot:

  • Short-term contracts have smaller price range
  • As the contract period increases, monthly charges tend to decrease

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

马化腾自述-我的互联网思维

马化腾自述-我的互联网思维

赵黎 / 石油工业出版社 / 2014-8-1 / 35

马化腾自述:我的互联网思维》讲述了些人说移动互联网就是加了“移动”两个字,互联网十几年了,移动互联网应该是个延伸。我的感受是,移动互联网远远不只是一个延伸,甚至是一个颠覆。互联网是一个开放交融、瞬息万变的大生态,企业作为互联网生态里面的物种,需要像自然界的生物一样,各个方面都具有与生态系统汇接、和谐、共生的特性。开放和分享并不是一个宣传口号,也不是一个简单的概念。开放很多时候被看作一种姿态,但是我......一起来看看 《马化腾自述-我的互联网思维》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具