内容简介:In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or no
The data
In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or not the patient showed up to their appointment.
It can be downloaded here .
In the code below I have imported the data and the libraries that I will be using throughout the article.
import pandas as pdimport matplotlib.pyplot as plt %matplotlib inlinedata = pd.read_csv('KaggleV2-May-2016.csv') data.head()
Basic counts
The value_counts()
function can be used in the following way to get a count of unique values for one column in the data set. The code below gives a count of each value in the Gender
column.
data['Gender'].value_counts()
To sort values in ascending or descending order we can use the sort
argument. In the code below I have added sort=True
to display the counts in the Age
column in descending order.
data['Age'].value_counts(sort=True)
Combine with groupby()
The value_counts function can be combined with other Panadas functions for richer analysis techniques. One example is to combine with the groupby()
function. In the below example I am counting values in the Gender column and applying groupby()
to further understand the number of no-shows in each group.
data['No-show'].groupby(data['Gender']).value_counts(sort=True)
Normalize
In the above example displaying the absolute values does not easily enable us to understand the differences between the two groups. A better solution would be to show the relative frequencies of the unique values in each group.
We can add the normalize argument to value_counts() to display the values in this way.
data['No-show'].groupby(data['Gender']).value_counts(normalize=True)
Binning
For columns where there are a large number of unique values the output of the value_counts() function is not always particularly useful. A good example of this would be the Age column which we displayed value counts for earlier in this post.
Fortunately value_counts() has a bins
argument. This parameter allows us to specificy the number of bins (or groups we want to split the data into) as an integer. In the example below I have added bins=5
to split the Age counts into 5 groups. We now have a count of values in each of these bins.
data['Age'].value_counts(bins=5)
Once again showing absolute numbers is not particularly useful so let’s add the normalize=True
argument as well. Now we have a useful piece of analysis.
data['Age'].value_counts(bins=5, normalize=True)
Combine with nlargest()
There are other columns in our data set which have a large number of unique values where binning is still not going to provide us with a useful piece of analysis. A good example of this would be the Neighbourhood column.
If we simply run value_counts() against this we get an output that is not particularly insightful.
data['Neighbourhood'].value_counts(sort=True)
A better way to display this might be to view the top 10 neighbourhoods. We can do this by combining with another Pandas function called nlargest()
as shown below.
data['Neighbourhood'].value_counts(sort=True).nlargest(10)
We can also use nsmallest()
to display the bottom 10 neighbourhoods which might also prove useful.
data['Neighbourhood'].value_counts(sort=True).nsmallest(10)
Plotting
Another handy combination is the Pandas plotting functionality together with value_counts(). Having the ability to display the analyses we get from value_counts() as visualisations can make it far easier to view trends and patterns.
We can display all of the above examples and more with most plot types available in the Pandas library. A full list of available options can be found here .
Let’s look a few examples.
We can use a bar plot to view the top 10 neighbourhoods.
data['Neighbourhood'].value_counts(sort=True).nlargest(10).plot.bar()
We can make a pie chart to better visualise the Gender column.
data['Gender'].value_counts().plot.pie()
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
高扩展性网站的50条原则
[美] Martin L. Abbott、[美]Michael T. Fisher / 张欣、杨海玲 / 人民邮电出版社 / 2012-6-3 / 35.00元
《高扩展性网站的50条原则》给出了设计高扩展网站的50条原则,如不要过度设计、设计时就考虑扩展性、把方案简化3倍以上、减少DNS查找、尽可能减少对象等,每个原则都与不同的主题绑定在一起。大部分原则是面向技术的,只有少量原则解决的是与关键习惯和方法有关的问题,当然,每个原则都对构建可扩展的产品至关重要。 主要内容包括: 通过克隆、复制、分离功能和拆分数据集提高网站扩展性; 采用横向......一起来看看 《高扩展性网站的50条原则》 这本书的介绍吧!