Pandas Tricks that Expedite Data Analysis Process

栏目: IT技术 · 发布时间: 5年前

内容简介:Speed-up your data analysis process with these simple tricks.Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite use

Pandas Tricks that Expedite Data Analysis Process

Speed-up your data analysis process with these simple tricks.

Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in data analysis.

Photo by Daniel Cheung on Unsplash

As always we start with importing numpy and pandas.

import numpy as np
import pandas as pd

Let’s create a sample dataframe to work on. Pandas is a versatile library that usually offers multiple ways to do a task. Thus, there are many ways to create a dataframe. One common method is to pass a dictionary that includes columns as key-value pairs.

values = np.random.randint(10, size=10)years = np.arange(2010,2020)groups = ['A','A','B','A','B','B','C','A','C','C']df = pd.DataFrame({'group':groups, 'year':years, 'value':values})df

We also used numpy to create arrays to be used as values in columns. np.arange returns a range values within specified interval. np.random.randint returns random integer values based on the specified range and size.

The dataframe contains some yearly values of 3 different groups. We may only be interested in yearly values but there are some cases in which we also need a cumulative sum. Pandas provides an easy-to-use function to calculate cumulative sum which is cumsum .

df['cumsum'] = df['value'].cumsum()df

We created a column named “cumsum” which contains cumulative sum of the numbers in value column. However, it does not take the groups into consideration. This kind of cumulative values may be useless in some cases because we are not able to distinguish between groups. Don’t worry! There is a very simple and convenient solution for this issue. We can apply groupby function.

df['cumsum'] = df[['value','group']].groupby('group').cumsum()df

We first applied groupby on “group” column then cumsum function. Now the values are summed up within each group. To make the dataframe look nicer, we may want to sort the values based on group instead of year so that we can visually separate groups.

df.sort_values(by='group').reset_index()

We applied sort_values function and reset the index with reset_index function. As we can see in the returned dataframe, original index is kept as a column. We can eliminate it by setting drop parameter of reset_index function as True.

df = df.sort_values(by='group').reset_index(drop=True)df

It looks better now. When we want to add a new column to a dataframe, it is added at the end by default. However, pandas offers the option to add the new column in any position using insert function.

new = np.random.randint(5, size=10)df.insert(2, 'new_col', new)df

We specified the position by passing an index as first argument. This value must be an integer. Column indices start from zero just like row indices. The second argument is column name and the third argument is the object that includes values which can be Series or an array-like object.

Consider we want to remove a column from a dataframe but also want store keep that column as a separate series. One way is to assign the column to a series and then use drop function. A simpler way is to use pop functionn.

value = df.pop('value')df

With one line of code, we remove the value column from the dataframe and store it in a pandas series.

We sometimes need to filter a dataframe based on a condition or apply a mask to get certain values. One easy way to filter a dataframe is query function. I will use the sample dataframe we have been using. Let’s first insert the “value” column back:

df.insert(2, 'value', value)df

It is very simple to use query function which only requires the condition.

df.query('value < new_col')

It returned the rows in which “value” is less then “new_col”. We can set more complex conditions and also use additional operators.

df.query('2*new_col > value')

We can also combine multiple conditions into one query.

df.query('2*new_col > value & cumsum < 15')

There are many aggregations functions that we can use to calculate basic statistics on columns such as mean, sum, count and so on. We can apply each of these function to a column. However, in some cases, we may need to check more than one type statistics. For instance, both count and mean might be important in some cases. Instead of applying functions separately, pandas offers agg function to apply multiple aggregation functions.

df[['group','value']].groupby('group').agg(['mean','count'])

It makes more sense to see both mean and count. We can easily detect outliers that have extreme mean values but very low number of observations.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

企业IT架构转型之道:阿里巴巴中台战略思想与架构实战

企业IT架构转型之道:阿里巴巴中台战略思想与架构实战

钟华 / 机械工业出版社 / 2017-4-1 / 79

在当今整个中国社会都处于互联网转型的浪潮中,不管是政府职能单位、业务规模庞大的央企,还是面临最激烈竞争的零售行业都处于一个重要的转折点,这个转折对企业业务模式带来了冲击,当然也给企业的信息中心部门带来了挑战:如何构建IT系统架构更好地满足互联网时代下企业业务发展的需要。阿里巴巴的共享服务理念以及企业级互联网架构建设的思路,给这些企业带来了不少新的思路,这也是我最终决定写这本书的最主要原因。本书从阿......一起来看看 《企业IT架构转型之道:阿里巴巴中台战略思想与架构实战》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具