4 Less Known Pandas Functions That Can Make Your Work Easier

栏目: IT技术 · 发布时间: 4年前

Learn Pandas for Data Science

4 Less Known Pandas Functions That Can Make Your Work Easier

Supercharge you data science projects

Apr 22 ·5min read

4 Less Known Pandas Functions That Can Make Your Work Easier — Photo by Jérémy Stenuit on Unsplash

Many data scientists have been using Python as their programming language of choice. As an open-source language, Python has gained considerable popularity by providing a variety of data science-related libraries. Particularly, the pandas library is arguably the most prevalent toolbox among Python-based data scientists.

I have to say that the pandas library is so well developed that it provides a very large collection of functions for various operations. However, the drawback of this powerful toolbox is that some useful functions can be less known to beginners. In this article, I would like to share four such functions.

1. The where() Function

Most of the time for the dataset that we’re working with, we have to do some data conversion to make the data in the analyzable format. The where() function is useful to replace the values that doesn’t satisfy the condition. Let’s consider the following example for its usage. Certainly, we first needed to import pandas and numpy as we do for all data manipulation steps.

In the above figure, we created a Series and applied the where() function. Specifically, the signature use of this function is where(condition, other) . In this call, the condition argument will result in boolean values, and when they’re True , the original values are kept, while they’re False , the value specified by the other argument will be used. In our case, any values that below 1000 were kept, while the ones that were equal to or greater than 1000 were assigned to 1000.

This function can’t only be used with Series, but also with DataFrame. Let’s see a similar usage with the DataFrame. In the example below, the DataFrame df0 ’s odd numbers will all be incremented by 1, and the even values are kept.

2. The pivot_table() Function

Unlike the where() function, the pivot_table() function is only available to DataFrame. This function is to create a spreadsheet-style pivot table, and thus it’s a great tool to summarize, analyze, and present data by displaying the data in a straightforward manner. Its power can be best shown with a more realistic example.

In the above figure, we created a DataFrame that consisted of salary and bonus records together with the employees’ gender and department information. We then created a pivot table using the pivot_table() function. Specifically, we set the salary and bonus columns to the values argument, set the department to the index argument, set the gender to the columns argument, and set [np.mean, np.median, np.amax] to the aggfunc argument.

In the output, you can see that we have a pivot table showing us the 2 (gender) by 2 (department) tables in mean, median, and maximum values for the salary and bonus variables. Some interesting observations include that in Department A, women have higher salaries than men, while the pattern is opposite in Department B. In both departments, women and men have similar bonuses.

3. The qcut() Function

When we have a dataset that involves ordinal measures, it sometimes makes more sense to create categorical quantiles to identify possible patterns instead of examining these ordinal measures parametrically. Theoretically, we can calculate the quantile cutoffs ourselves and map the data using these cutoffs to create the new categorical variable.

However, this operation can be easily realized with the qcut() function , which discretizes the variable into equal-sized pools (e.g., quantiles and deciles) based on their ranks. Let’s see how this function works with the following example.

In the above figure, we created a DataFrame having 3 columns. We were interested in generating the quantiles for the var2 column. Thus, we specified the q argument to be 4 (it can be 10 if you want deciles). We also specified the label list to mark these quantiles.

4. The melt() Function

Depending on the tools that data scientists use, some prefer the “wide” format (e.g., one subject one row with multiple variables), while some others prefer the “long” format (e.g., one subject multiple rows with one variable). Thus, it’s not uncommon that we need to do data transformation between these formats.

Unlike the transposition T function that transposes the DataFrame entirely, the melt() function is particularly useful to convert the data from the wide to long format. Let’s see how it works with the following example.

In the above figure, we created a DataFrame in a wide format. Specifically, we have two measures before and after taking the medicine. We then used the melt() function to produce a long-format DataFrame. We specified the SubjectID as the id_vars , the two measures as the value_vars , and rename the columns to be more meaningful.

Before You Go

There are many more functions in pandas that we can explore. In this article, we just learned four functions that some of us don’t know too well, but they can be very useful in our daily data manipulation work.

I hope that you enjoyed reading this piece. You can find the code on GitHub .

About the Author

I write blogs about Python and data processing and analysis. Just in case you’ve missed some of my earlier blogs, here are the links to some articles that are relevant to the current one.

30 Simple Tricks to Level Up Your Python Coding

Better Python

medium.com

Understand map() function to manipulate pandas Series

Learn the fundamentals of using the map() function to convert the data to the desired format

towardsdatascience.com

A Cheat Sheet on Generating Random Numbers in NumPy

See the most commonly used functions on generating random numbers in NumPy.

towardsdatascience.com

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

4 Less Known Pandas Functions That Can Make Your Work Easier

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

大型网站技术架构演进与性能优化

许令波 / 电子工业出版社 / 2018-6 / 79

《大型网站技术架构演进与性能优化》从一名亲历者的角度，阐述了一个网站在业务量飞速发展的过程中所遇到的技术转型等各种问题及解决思路。从技术发展上看，网站经历了Web应用系统从分布式、无线多端、中台到国际化的改造；在解决大流量问题的方向上，涉及了从端的优化到管道到服务端甚至到基础环境优化的各个层面。《大型网站技术架构演进与性能优化》总结的宝贵经验教训可以帮助读者了解当网站遇到类似问题时，应如何......一起来看看《大型网站技术架构演进与性能优化》这本书的介绍吧!

码农工具

4 Less Known Pandas Functions That Can Make Your Work Easier

Learn Pandas for Data Science