Valuable Data Analysis with Pandas Value Counts

栏目: IT技术 · 发布时间: 6年前

内容简介:In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or no

The data

In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or not the patient showed up to their appointment.

It can be downloaded here .

In the code below I have imported the data and the libraries that I will be using throughout the article.

import pandas as pdimport matplotlib.pyplot as plt
%matplotlib inlinedata = pd.read_csv('KaggleV2-May-2016.csv')
data.head()
The first few rows of the Medical Appointments No-Show data set from Kaggle.com

Basic counts

The value_counts() function can be used in the following way to get a count of unique values for one column in the data set. The code below gives a count of each value in the Gender column.

data['Gender'].value_counts()

To sort values in ascending or descending order we can use the sort argument. In the code below I have added sort=True to display the counts in the Age column in descending order.

data['Age'].value_counts(sort=True)

Combine with groupby()

The value_counts function can be combined with other Panadas functions for richer analysis techniques. One example is to combine with the groupby() function. In the below example I am counting values in the Gender column and applying groupby() to further understand the number of no-shows in each group.

data['No-show'].groupby(data['Gender']).value_counts(sort=True)

Normalize

In the above example displaying the absolute values does not easily enable us to understand the differences between the two groups. A better solution would be to show the relative frequencies of the unique values in each group.

We can add the normalize argument to value_counts() to display the values in this way.

data['No-show'].groupby(data['Gender']).value_counts(normalize=True)

Binning

For columns where there are a large number of unique values the output of the value_counts() function is not always particularly useful. A good example of this would be the Age column which we displayed value counts for earlier in this post.

Fortunately value_counts() has a bins argument. This parameter allows us to specificy the number of bins (or groups we want to split the data into) as an integer. In the example below I have added bins=5 to split the Age counts into 5 groups. We now have a count of values in each of these bins.

data['Age'].value_counts(bins=5)

Once again showing absolute numbers is not particularly useful so let’s add the normalize=True argument as well. Now we have a useful piece of analysis.

data['Age'].value_counts(bins=5, normalize=True)

Combine with nlargest()

There are other columns in our data set which have a large number of unique values where binning is still not going to provide us with a useful piece of analysis. A good example of this would be the Neighbourhood column.

If we simply run value_counts() against this we get an output that is not particularly insightful.

data['Neighbourhood'].value_counts(sort=True)

A better way to display this might be to view the top 10 neighbourhoods. We can do this by combining with another Pandas function called nlargest() as shown below.

data['Neighbourhood'].value_counts(sort=True).nlargest(10)

We can also use nsmallest() to display the bottom 10 neighbourhoods which might also prove useful.

data['Neighbourhood'].value_counts(sort=True).nsmallest(10)

Plotting

Another handy combination is the Pandas plotting functionality together with value_counts(). Having the ability to display the analyses we get from value_counts() as visualisations can make it far easier to view trends and patterns.

We can display all of the above examples and more with most plot types available in the Pandas library. A full list of available options can be found here .

Let’s look a few examples.

We can use a bar plot to view the top 10 neighbourhoods.

data['Neighbourhood'].value_counts(sort=True).nlargest(10).plot.bar()

We can make a pie chart to better visualise the Gender column.

data['Gender'].value_counts().plot.pie()

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Building Web Reputation Systems

Building Web Reputation Systems

Randy Farmer、Bryce Glass / Yahoo Press / 2010 / GBP 31.99

What do Amazon's product reviews, eBay's feedback score system, Slashdot's Karma System, and Xbox Live's Achievements have in common? They're all examples of successful reputation systems that enable ......一起来看看 《Building Web Reputation Systems》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

随机密码生成器
随机密码生成器

多种字符组合密码

URL 编码/解码
URL 编码/解码

URL 编码/解码