A Practical Guide on Data Visualization

栏目: IT技术 · 发布时间: 4年前

内容简介:One picture is worth a thousand wordsWe live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explor

A Practical Guide on Data Visualization

One picture is worth a thousand words

We live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explore the data. There comes in the power of visualizations which are great tools in exploratory data analysis when used efficiently and appropriately. Visualizations also help to deliver a message to your audience or inform them about your findings. There is no one-fits-all kind of visualization method so certain tasks require different kind of visualizations. In this post, we will cover how to create basic plots and efficiently use them.

We need a sample dataframe to work on. In this post, we will use two different datasets both of which are available on kaggle. First one is telco customer churn dataset and the other one is US cars dataset.

import pandas as pd
import numpy as npdf = pd.read_csv("Projects/churn_prediction/Telco-Customer-Churn.csv")df.shape
(7043, 21)

The dataset includes 21 columns. “Churn” column indicates whether a customer has churned (i.e. left the company) and remaining columns include information about the customer or the products that customer have.

Note: There are many tools and software packages to create great visualizations. In this post, I will use two of the most common ones which are matplotlib and seaborn. Feel free to use any package as long as you get what you want.

import matplotlib.pyplot as plt
import seaborn as snssns.set(style="darkgrid")%matplotlib inline

%matplotlib inline command allows to render the figures in the notebook so we can see them instantly.

Before starting on creating visualizations, I would like to emphasize a point. The main goal of visualizing data is to explore and analyze the data or interpret the results and findings. Ofcourse, we need to pay attention to how the figures look and try to create appealing figures. However, very beautiful visualizations without any informative power are useless in data analysis. Let’s start with keeping this point in mind.

The main object of this dataset is customer churn. So, it is better to check how this variable looks:

plt.figure(figsize=(8,5))sns.countplot('Churn', data=df)

We created a figure object with a specified size with matplotlib backend. Then, added a countplot using seaborn. This figure obviously tells us the company is good at keeping its customers because churn rate is actually low.

This figure is plain and simple. Let’s add some informative power to it. We can see how churn changes depending on “SeniorCitizen” and “gender” columns:

sns.catplot('Churn', hue='SeniorCitizen', 
 col='gender', kind='count', 
 height=4, aspect=1, data=df)

Gender seems to be not changing the churn rate but there is a difference between senior and non-senior citizens. Senior citizens are more likely to churn. We can expand our analysis by trying other columns in this way.

Another way to explore data is to check the distributions of variables which give us an idea about the spread and density. Let’s check it on “tenure” and “MonthlyCharges” features.

fig, axs = plt.subplots(ncols=2, figsize=(10,6))sns.distplot(df.tenure, ax=axs[0]).set_title("Distribution of Tenure")sns.distplot(df.MonthlyCharges, ax=axs[1]).set_title("Distribution of MonthlyCharges")

We created the figure object with two subplots. Then, created distribution plots using seaborn. We also added titles using set_title :

Tenure variable indicates how long a customer has been a customer in months. Most of the customers are pretty new or have been a customer for a long time. MonthlyCharges variable exhibits a strange distribution but the high density is visible on the lowest amount.

Another way to have an idea about the dispersion of data is boxplot .

plt.figure(figsize=(10,6))sns.boxplot(x="Contract", y="MonthlyCharges", data=df)

The line in the box represents the median. The lower and upper edges of the boxes show first and third quantile, respectively. So, tall boxes indicates the values are more spread out. What we can understand from this plot:

  • Short-term contracts have smaller price range
  • As the contract period increases, monthly charges tend to decrease

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

大教堂与集市

大教堂与集市

[美] Eric S. Raymond / 卫剑钒 / 机械工业出版社 / 2014-5 / 59.00元

当代软件技术领域最重要的著作,中文版首次出版! 《大教堂与集市》是开源运动的《圣经》,颠覆了传统的软件开发思路,影响了整个软件开发领域。作者Eric S. Raymond是开源运动的旗手、黑客文化第一理论家,他讲述了开源运动中惊心动魄的故事,提出了大量充满智慧的观念和经过检验的知识,给所有软件开发人员带来启迪。本书囊括了作者最著名的“五部曲”,并经过作者的全面更新,增加了大量注释,提高了可读......一起来看看 《大教堂与集市》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

URL 编码/解码
URL 编码/解码

URL 编码/解码