Pandas Tricks that Expedite Data Analysis Process

栏目: IT技术 · 发布时间: 5年前

内容简介:Speed-up your data analysis process with these simple tricks.Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite use

Pandas Tricks that Expedite Data Analysis Process

Speed-up your data analysis process with these simple tricks.

Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in data analysis.

Photo by Daniel Cheung on Unsplash

As always we start with importing numpy and pandas.

import numpy as np
import pandas as pd

Let’s create a sample dataframe to work on. Pandas is a versatile library that usually offers multiple ways to do a task. Thus, there are many ways to create a dataframe. One common method is to pass a dictionary that includes columns as key-value pairs.

values = np.random.randint(10, size=10)years = np.arange(2010,2020)groups = ['A','A','B','A','B','B','C','A','C','C']df = pd.DataFrame({'group':groups, 'year':years, 'value':values})df

We also used numpy to create arrays to be used as values in columns. np.arange returns a range values within specified interval. np.random.randint returns random integer values based on the specified range and size.

The dataframe contains some yearly values of 3 different groups. We may only be interested in yearly values but there are some cases in which we also need a cumulative sum. Pandas provides an easy-to-use function to calculate cumulative sum which is cumsum .

df['cumsum'] = df['value'].cumsum()df

We created a column named “cumsum” which contains cumulative sum of the numbers in value column. However, it does not take the groups into consideration. This kind of cumulative values may be useless in some cases because we are not able to distinguish between groups. Don’t worry! There is a very simple and convenient solution for this issue. We can apply groupby function.

df['cumsum'] = df[['value','group']].groupby('group').cumsum()df

We first applied groupby on “group” column then cumsum function. Now the values are summed up within each group. To make the dataframe look nicer, we may want to sort the values based on group instead of year so that we can visually separate groups.

df.sort_values(by='group').reset_index()

We applied sort_values function and reset the index with reset_index function. As we can see in the returned dataframe, original index is kept as a column. We can eliminate it by setting drop parameter of reset_index function as True.

df = df.sort_values(by='group').reset_index(drop=True)df

It looks better now. When we want to add a new column to a dataframe, it is added at the end by default. However, pandas offers the option to add the new column in any position using insert function.

new = np.random.randint(5, size=10)df.insert(2, 'new_col', new)df

We specified the position by passing an index as first argument. This value must be an integer. Column indices start from zero just like row indices. The second argument is column name and the third argument is the object that includes values which can be Series or an array-like object.

Consider we want to remove a column from a dataframe but also want store keep that column as a separate series. One way is to assign the column to a series and then use drop function. A simpler way is to use pop functionn.

value = df.pop('value')df

With one line of code, we remove the value column from the dataframe and store it in a pandas series.

We sometimes need to filter a dataframe based on a condition or apply a mask to get certain values. One easy way to filter a dataframe is query function. I will use the sample dataframe we have been using. Let’s first insert the “value” column back:

df.insert(2, 'value', value)df

It is very simple to use query function which only requires the condition.

df.query('value < new_col')

It returned the rows in which “value” is less then “new_col”. We can set more complex conditions and also use additional operators.

df.query('2*new_col > value')

We can also combine multiple conditions into one query.

df.query('2*new_col > value & cumsum < 15')

There are many aggregations functions that we can use to calculate basic statistics on columns such as mean, sum, count and so on. We can apply each of these function to a column. However, in some cases, we may need to check more than one type statistics. For instance, both count and mean might be important in some cases. Instead of applying functions separately, pandas offers agg function to apply multiple aggregation functions.

df[['group','value']].groupby('group').agg(['mean','count'])

It makes more sense to see both mean and count. We can easily detect outliers that have extreme mean values but very low number of observations.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

ACM图灵奖演讲集

ACM图灵奖演讲集

阿申豪斯特 / 苏运霖 / 电子工业出版社 / 2005-4 / 55.0

本书完整地收录了这些演讲,并配之以部分获奖者撰写的后记,旨在反映过去数年来这一领域中发生的变化。对任何一位计算机科学的历史与发展有兴趣的人来说,本书都极具收藏价值。  本文收录了自图灵奖开始颁发的1966年起到1985年这20年间图灵奖获得者在授奖大会上所做演讲的全文。由于在此期间有三次是把奖项同时授予两个人的,而其中有两次两位获奖者分别做了演讲,因此一共收录了22篇演讲稿。本书把这些演讲分为两大......一起来看看 《ACM图灵奖演讲集》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具