Valuable Data Analysis with Pandas Value Counts

栏目: IT技术 · 发布时间: 4年前

内容简介:In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or no

The data

In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or not the patient showed up to their appointment.

It can be downloaded here .

In the code below I have imported the data and the libraries that I will be using throughout the article.

import pandas as pdimport matplotlib.pyplot as plt
%matplotlib inlinedata = pd.read_csv('KaggleV2-May-2016.csv')
data.head()
The first few rows of the Medical Appointments No-Show data set from Kaggle.com

Basic counts

The value_counts() function can be used in the following way to get a count of unique values for one column in the data set. The code below gives a count of each value in the Gender column.

data['Gender'].value_counts()

To sort values in ascending or descending order we can use the sort argument. In the code below I have added sort=True to display the counts in the Age column in descending order.

data['Age'].value_counts(sort=True)

Combine with groupby()

The value_counts function can be combined with other Panadas functions for richer analysis techniques. One example is to combine with the groupby() function. In the below example I am counting values in the Gender column and applying groupby() to further understand the number of no-shows in each group.

data['No-show'].groupby(data['Gender']).value_counts(sort=True)

Normalize

In the above example displaying the absolute values does not easily enable us to understand the differences between the two groups. A better solution would be to show the relative frequencies of the unique values in each group.

We can add the normalize argument to value_counts() to display the values in this way.

data['No-show'].groupby(data['Gender']).value_counts(normalize=True)

Binning

For columns where there are a large number of unique values the output of the value_counts() function is not always particularly useful. A good example of this would be the Age column which we displayed value counts for earlier in this post.

Fortunately value_counts() has a bins argument. This parameter allows us to specificy the number of bins (or groups we want to split the data into) as an integer. In the example below I have added bins=5 to split the Age counts into 5 groups. We now have a count of values in each of these bins.

data['Age'].value_counts(bins=5)

Once again showing absolute numbers is not particularly useful so let’s add the normalize=True argument as well. Now we have a useful piece of analysis.

data['Age'].value_counts(bins=5, normalize=True)

Combine with nlargest()

There are other columns in our data set which have a large number of unique values where binning is still not going to provide us with a useful piece of analysis. A good example of this would be the Neighbourhood column.

If we simply run value_counts() against this we get an output that is not particularly insightful.

data['Neighbourhood'].value_counts(sort=True)

A better way to display this might be to view the top 10 neighbourhoods. We can do this by combining with another Pandas function called nlargest() as shown below.

data['Neighbourhood'].value_counts(sort=True).nlargest(10)

We can also use nsmallest() to display the bottom 10 neighbourhoods which might also prove useful.

data['Neighbourhood'].value_counts(sort=True).nsmallest(10)

Plotting

Another handy combination is the Pandas plotting functionality together with value_counts(). Having the ability to display the analyses we get from value_counts() as visualisations can make it far easier to view trends and patterns.

We can display all of the above examples and more with most plot types available in the Pandas library. A full list of available options can be found here .

Let’s look a few examples.

We can use a bar plot to view the top 10 neighbourhoods.

data['Neighbourhood'].value_counts(sort=True).nlargest(10).plot.bar()

We can make a pie chart to better visualise the Gender column.

data['Gender'].value_counts().plot.pie()

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

高扩展性网站的50条原则

高扩展性网站的50条原则

[美] Martin L. Abbott、[美]Michael T. Fisher / 张欣、杨海玲 / 人民邮电出版社 / 2012-6-3 / 35.00元

《高扩展性网站的50条原则》给出了设计高扩展网站的50条原则,如不要过度设计、设计时就考虑扩展性、把方案简化3倍以上、减少DNS查找、尽可能减少对象等,每个原则都与不同的主题绑定在一起。大部分原则是面向技术的,只有少量原则解决的是与关键习惯和方法有关的问题,当然,每个原则都对构建可扩展的产品至关重要。 主要内容包括: 通过克隆、复制、分离功能和拆分数据集提高网站扩展性; 采用横向......一起来看看 《高扩展性网站的50条原则》 这本书的介绍吧!

URL 编码/解码
URL 编码/解码

URL 编码/解码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具