Valuable Data Analysis with Pandas Value Counts

栏目: IT技术 · 发布时间: 4年前

内容简介:In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or no

The data

In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or not the patient showed up to their appointment.

It can be downloaded here .

In the code below I have imported the data and the libraries that I will be using throughout the article.

import pandas as pdimport matplotlib.pyplot as plt
%matplotlib inlinedata = pd.read_csv('KaggleV2-May-2016.csv')
data.head()
The first few rows of the Medical Appointments No-Show data set from Kaggle.com

Basic counts

The value_counts() function can be used in the following way to get a count of unique values for one column in the data set. The code below gives a count of each value in the Gender column.

data['Gender'].value_counts()

To sort values in ascending or descending order we can use the sort argument. In the code below I have added sort=True to display the counts in the Age column in descending order.

data['Age'].value_counts(sort=True)

Combine with groupby()

The value_counts function can be combined with other Panadas functions for richer analysis techniques. One example is to combine with the groupby() function. In the below example I am counting values in the Gender column and applying groupby() to further understand the number of no-shows in each group.

data['No-show'].groupby(data['Gender']).value_counts(sort=True)

Normalize

In the above example displaying the absolute values does not easily enable us to understand the differences between the two groups. A better solution would be to show the relative frequencies of the unique values in each group.

We can add the normalize argument to value_counts() to display the values in this way.

data['No-show'].groupby(data['Gender']).value_counts(normalize=True)

Binning

For columns where there are a large number of unique values the output of the value_counts() function is not always particularly useful. A good example of this would be the Age column which we displayed value counts for earlier in this post.

Fortunately value_counts() has a bins argument. This parameter allows us to specificy the number of bins (or groups we want to split the data into) as an integer. In the example below I have added bins=5 to split the Age counts into 5 groups. We now have a count of values in each of these bins.

data['Age'].value_counts(bins=5)

Once again showing absolute numbers is not particularly useful so let’s add the normalize=True argument as well. Now we have a useful piece of analysis.

data['Age'].value_counts(bins=5, normalize=True)

Combine with nlargest()

There are other columns in our data set which have a large number of unique values where binning is still not going to provide us with a useful piece of analysis. A good example of this would be the Neighbourhood column.

If we simply run value_counts() against this we get an output that is not particularly insightful.

data['Neighbourhood'].value_counts(sort=True)

A better way to display this might be to view the top 10 neighbourhoods. We can do this by combining with another Pandas function called nlargest() as shown below.

data['Neighbourhood'].value_counts(sort=True).nlargest(10)

We can also use nsmallest() to display the bottom 10 neighbourhoods which might also prove useful.

data['Neighbourhood'].value_counts(sort=True).nsmallest(10)

Plotting

Another handy combination is the Pandas plotting functionality together with value_counts(). Having the ability to display the analyses we get from value_counts() as visualisations can make it far easier to view trends and patterns.

We can display all of the above examples and more with most plot types available in the Pandas library. A full list of available options can be found here .

Let’s look a few examples.

We can use a bar plot to view the top 10 neighbourhoods.

data['Neighbourhood'].value_counts(sort=True).nlargest(10).plot.bar()

We can make a pie chart to better visualise the Gender column.

data['Gender'].value_counts().plot.pie()

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

自制编程语言 基于C语言

自制编程语言 基于C语言

郑钢 / 人民邮电出版社 / 2018-9-1 / CNY 89.00

本书是一本专门介绍自制编程语言的图书,书中深入浅出地讲述了如何开发一门编程语言,以及运行这门编程语言的虚拟机。本书主要内容包括:脚本语言的功能、词法分析器、类、对象、原生方法、自上而下算符优先、语法分析、语义分析、虚拟机、内建类、垃圾回收、命令行及调试等技术。 本书适合程序员阅读,也适合对编程语言原理感兴趣的计算机从业人员学习。一起来看看 《自制编程语言 基于C语言》 这本书的介绍吧!

SHA 加密
SHA 加密

SHA 加密工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具