A Practical Guide on Data Visualization

栏目: IT技术 · 发布时间: 5年前

内容简介:One picture is worth a thousand wordsWe live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explor

A Practical Guide on Data Visualization

One picture is worth a thousand words

We live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explore the data. There comes in the power of visualizations which are great tools in exploratory data analysis when used efficiently and appropriately. Visualizations also help to deliver a message to your audience or inform them about your findings. There is no one-fits-all kind of visualization method so certain tasks require different kind of visualizations. In this post, we will cover how to create basic plots and efficiently use them.

We need a sample dataframe to work on. In this post, we will use two different datasets both of which are available on kaggle. First one is telco customer churn dataset and the other one is US cars dataset.

import pandas as pd
import numpy as npdf = pd.read_csv("Projects/churn_prediction/Telco-Customer-Churn.csv")df.shape
(7043, 21)

The dataset includes 21 columns. “Churn” column indicates whether a customer has churned (i.e. left the company) and remaining columns include information about the customer or the products that customer have.

Note: There are many tools and software packages to create great visualizations. In this post, I will use two of the most common ones which are matplotlib and seaborn. Feel free to use any package as long as you get what you want.

import matplotlib.pyplot as plt
import seaborn as snssns.set(style="darkgrid")%matplotlib inline

%matplotlib inline command allows to render the figures in the notebook so we can see them instantly.

Before starting on creating visualizations, I would like to emphasize a point. The main goal of visualizing data is to explore and analyze the data or interpret the results and findings. Ofcourse, we need to pay attention to how the figures look and try to create appealing figures. However, very beautiful visualizations without any informative power are useless in data analysis. Let’s start with keeping this point in mind.

The main object of this dataset is customer churn. So, it is better to check how this variable looks:

plt.figure(figsize=(8,5))sns.countplot('Churn', data=df)

We created a figure object with a specified size with matplotlib backend. Then, added a countplot using seaborn. This figure obviously tells us the company is good at keeping its customers because churn rate is actually low.

This figure is plain and simple. Let’s add some informative power to it. We can see how churn changes depending on “SeniorCitizen” and “gender” columns:

sns.catplot('Churn', hue='SeniorCitizen', 
 col='gender', kind='count', 
 height=4, aspect=1, data=df)

Gender seems to be not changing the churn rate but there is a difference between senior and non-senior citizens. Senior citizens are more likely to churn. We can expand our analysis by trying other columns in this way.

Another way to explore data is to check the distributions of variables which give us an idea about the spread and density. Let’s check it on “tenure” and “MonthlyCharges” features.

fig, axs = plt.subplots(ncols=2, figsize=(10,6))sns.distplot(df.tenure, ax=axs[0]).set_title("Distribution of Tenure")sns.distplot(df.MonthlyCharges, ax=axs[1]).set_title("Distribution of MonthlyCharges")

We created the figure object with two subplots. Then, created distribution plots using seaborn. We also added titles using set_title :

Tenure variable indicates how long a customer has been a customer in months. Most of the customers are pretty new or have been a customer for a long time. MonthlyCharges variable exhibits a strange distribution but the high density is visible on the lowest amount.

Another way to have an idea about the dispersion of data is boxplot .

plt.figure(figsize=(10,6))sns.boxplot(x="Contract", y="MonthlyCharges", data=df)

The line in the box represents the median. The lower and upper edges of the boxes show first and third quantile, respectively. So, tall boxes indicates the values are more spread out. What we can understand from this plot:

  • Short-term contracts have smaller price range
  • As the contract period increases, monthly charges tend to decrease

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

新媒体十讲

新媒体十讲

范卫锋 / 中信出版社 / 2015-8 / 38.00元

“新媒体参谋长”范卫锋解答:媒体人如何转型?怎么创业?如何看准新媒体项目进行投资? 作为“新媒体的参谋长”,本书作者范卫锋将十余年从业亲历的经验教训、行业内幕串联成册,从定位、突破、扩张、商业模式、价值几个方面剖析新媒体策略,解密国内媒体圈转型、创业、投资的实操法则。案例信手拈来,观点鞭辟入里,打造出国内第一本由专业新媒体投资人撰写的新媒体实战兵法,涉及媒体人转型、新媒体实操、媒体公关营销、......一起来看看 《新媒体十讲》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具