A Practical Guide on Data Visualization

栏目: IT技术 · 发布时间: 4年前

内容简介：One picture is worth a thousand wordsWe live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explor

A Practical Guide on Data Visualization

One picture is worth a thousand words

We live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explore the data. There comes in the power of visualizations which are great tools in exploratory data analysis when used efficiently and appropriately. Visualizations also help to deliver a message to your audience or inform them about your findings. There is no one-fits-all kind of visualization method so certain tasks require different kind of visualizations. In this post, we will cover how to create basic plots and efficiently use them.

We need a sample dataframe to work on. In this post, we will use two different datasets both of which are available on kaggle. First one is telco customer churn dataset and the other one is US cars dataset.

import pandas as pd
import numpy as npdf = pd.read_csv("Projects/churn_prediction/Telco-Customer-Churn.csv")df.shape
(7043, 21)

The dataset includes 21 columns. “Churn” column indicates whether a customer has churned (i.e. left the company) and remaining columns include information about the customer or the products that customer have.

Note: There are many tools and software packages to create great visualizations. In this post, I will use two of the most common ones which are matplotlib and seaborn. Feel free to use any package as long as you get what you want.

import matplotlib.pyplot as plt
import seaborn as snssns.set(style="darkgrid")%matplotlib inline

%matplotlib inline command allows to render the figures in the notebook so we can see them instantly.

Before starting on creating visualizations, I would like to emphasize a point. The main goal of visualizing data is to explore and analyze the data or interpret the results and findings. Ofcourse, we need to pay attention to how the figures look and try to create appealing figures. However, very beautiful visualizations without any informative power are useless in data analysis. Let’s start with keeping this point in mind.

The main object of this dataset is customer churn. So, it is better to check how this variable looks:

plt.figure(figsize=(8,5))sns.countplot('Churn', data=df)

We created a figure object with a specified size with matplotlib backend. Then, added a countplot using seaborn. This figure obviously tells us the company is good at keeping its customers because churn rate is actually low.

This figure is plain and simple. Let’s add some informative power to it. We can see how churn changes depending on “SeniorCitizen” and “gender” columns:

sns.catplot('Churn', hue='SeniorCitizen', 
 col='gender', kind='count', 
 height=4, aspect=1, data=df)

Gender seems to be not changing the churn rate but there is a difference between senior and non-senior citizens. Senior citizens are more likely to churn. We can expand our analysis by trying other columns in this way.

Another way to explore data is to check the distributions of variables which give us an idea about the spread and density. Let’s check it on “tenure” and “MonthlyCharges” features.

fig, axs = plt.subplots(ncols=2, figsize=(10,6))sns.distplot(df.tenure, ax=axs[0]).set_title("Distribution of Tenure")sns.distplot(df.MonthlyCharges, ax=axs[1]).set_title("Distribution of MonthlyCharges")

We created the figure object with two subplots. Then, created distribution plots using seaborn. We also added titles using set_title :

Tenure variable indicates how long a customer has been a customer in months. Most of the customers are pretty new or have been a customer for a long time. MonthlyCharges variable exhibits a strange distribution but the high density is visible on the lowest amount.

Another way to have an idea about the dispersion of data is boxplot .

plt.figure(figsize=(10,6))sns.boxplot(x="Contract", y="MonthlyCharges", data=df)

The line in the box represents the median. The lower and upper edges of the boxes show first and third quantile, respectively. So, tall boxes indicates the values are more spread out. What we can understand from this plot:

Short-term contracts have smaller price range
As the contract period increases, monthly charges tend to decrease

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

A Practical Guide on Data Visualization

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

设计原本

Frederick P. Brooks, Jr. / InfoQ中文站、王海鹏、高博 / 机械工业出版社 / 2011-1-1 / 55.00元

无论是软件开发、工程还是建筑，有效的设计都是工作的核心。《设计原本:计算机科学巨匠Frederick P. Brooks的思考》将对设计过程进行深入分析，揭示进行有效和优雅设计的方法。本书包含了多个行业设计者的特别领悟。Frederick P. Brooks, Jr.精确发现了所有设计项目中内在的不变因素，揭示了进行优秀设计的过程和模式。通过与几十位优秀设计者的对话，以及他自己在几个设计......一起来看看《设计原本》这本书的介绍吧!

码农工具

A Practical Guide on Data Visualization

A Practical Guide on Data Visualization

设计原本

Base64 编码/解码

MD5 加密

XML、JSON 在线转换