A Practical Guide on Data Visualization

栏目: IT技术 · 发布时间: 3年前

内容简介:One picture is worth a thousand wordsWe live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explor

A Practical Guide on Data Visualization

One picture is worth a thousand words

We live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explore the data. There comes in the power of visualizations which are great tools in exploratory data analysis when used efficiently and appropriately. Visualizations also help to deliver a message to your audience or inform them about your findings. There is no one-fits-all kind of visualization method so certain tasks require different kind of visualizations. In this post, we will cover how to create basic plots and efficiently use them.

We need a sample dataframe to work on. In this post, we will use two different datasets both of which are available on kaggle. First one is telco customer churn dataset and the other one is US cars dataset.

import pandas as pd
import numpy as npdf = pd.read_csv("Projects/churn_prediction/Telco-Customer-Churn.csv")df.shape
(7043, 21)

The dataset includes 21 columns. “Churn” column indicates whether a customer has churned (i.e. left the company) and remaining columns include information about the customer or the products that customer have.

Note: There are many tools and software packages to create great visualizations. In this post, I will use two of the most common ones which are matplotlib and seaborn. Feel free to use any package as long as you get what you want.

import matplotlib.pyplot as plt
import seaborn as snssns.set(style="darkgrid")%matplotlib inline

%matplotlib inline command allows to render the figures in the notebook so we can see them instantly.

Before starting on creating visualizations, I would like to emphasize a point. The main goal of visualizing data is to explore and analyze the data or interpret the results and findings. Ofcourse, we need to pay attention to how the figures look and try to create appealing figures. However, very beautiful visualizations without any informative power are useless in data analysis. Let’s start with keeping this point in mind.

The main object of this dataset is customer churn. So, it is better to check how this variable looks:

plt.figure(figsize=(8,5))sns.countplot('Churn', data=df)

We created a figure object with a specified size with matplotlib backend. Then, added a countplot using seaborn. This figure obviously tells us the company is good at keeping its customers because churn rate is actually low.

This figure is plain and simple. Let’s add some informative power to it. We can see how churn changes depending on “SeniorCitizen” and “gender” columns:

sns.catplot('Churn', hue='SeniorCitizen', 
 col='gender', kind='count', 
 height=4, aspect=1, data=df)

Gender seems to be not changing the churn rate but there is a difference between senior and non-senior citizens. Senior citizens are more likely to churn. We can expand our analysis by trying other columns in this way.

Another way to explore data is to check the distributions of variables which give us an idea about the spread and density. Let’s check it on “tenure” and “MonthlyCharges” features.

fig, axs = plt.subplots(ncols=2, figsize=(10,6))sns.distplot(df.tenure, ax=axs[0]).set_title("Distribution of Tenure")sns.distplot(df.MonthlyCharges, ax=axs[1]).set_title("Distribution of MonthlyCharges")

We created the figure object with two subplots. Then, created distribution plots using seaborn. We also added titles using set_title :

Tenure variable indicates how long a customer has been a customer in months. Most of the customers are pretty new or have been a customer for a long time. MonthlyCharges variable exhibits a strange distribution but the high density is visible on the lowest amount.

Another way to have an idea about the dispersion of data is boxplot .

plt.figure(figsize=(10,6))sns.boxplot(x="Contract", y="MonthlyCharges", data=df)

The line in the box represents the median. The lower and upper edges of the boxes show first and third quantile, respectively. So, tall boxes indicates the values are more spread out. What we can understand from this plot:

  • Short-term contracts have smaller price range
  • As the contract period increases, monthly charges tend to decrease

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

HTML5与CSS3基础教程(第8版)

HTML5与CSS3基础教程(第8版)

[美] Elizabeth Castro、[美] Bruce Hyslop / 望以文 / 人民邮电出版社 / 2014-5 / 69.00元

本书是风靡全球的HTML和CSS入门教程的最新版,至第6版累积销量已超过100万册,被翻译为十多种语言,长期雄踞亚马逊书店计算机图书排行榜榜首。 第8版秉承作者直观透彻、循序渐进、基础知识与案例实践紧密结合的讲授特色,采用独特的双栏图文并排方式,手把手指导读者从零开始轻松入门。相较第7版,全书2/3以上的内容进行了更新,全面反映了HTML5和CSS3的最新特色,细致阐述了响应式Web设计与移......一起来看看 《HTML5与CSS3基础教程(第8版)》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具