Learn Pandas for Data Science
4 Less Known Pandas Functions That Can Make Your Work Easier
Supercharge you data science projects
Apr 22 ·5min read
Many data scientists have been using Python as their programming language of choice. As an open-source language, Python has gained considerable popularity by providing a variety of data science-related libraries. Particularly, the pandas library is arguably the most prevalent toolbox among Python-based data scientists.
I have to say that the pandas library is so well developed that it provides a very large collection of functions for various operations. However, the drawback of this powerful toolbox is that some useful functions can be less known to beginners. In this article, I would like to share four such functions.
1. The where() Function
Most of the time for the dataset that we’re working with, we have to do some data conversion to make the data in the analyzable format. The where()
function is useful to replace the values that doesn’t satisfy the condition. Let’s consider the following example for its usage. Certainly, we first needed to import pandas and numpy as we do for all data manipulation steps.
In the above figure, we created a Series and applied the where()
function. Specifically, the signature use of this function is where(condition, other)
. In this call, the condition
argument will result in boolean values, and when they’re True
, the original values are kept, while they’re False
, the value specified by the other argument will be used. In our case, any values that below 1000 were kept, while the ones that were equal to or greater than 1000 were assigned to 1000.
This function can’t only be used with Series, but also with DataFrame. Let’s see a similar usage with the DataFrame. In the example below, the DataFrame df0
’s odd numbers will all be incremented by 1, and the even values are kept.
2. The pivot_table() Function
Unlike the where()
function, the pivot_table()
function is only available to DataFrame. This function is to create a spreadsheet-style pivot table, and thus it’s a great tool to summarize, analyze, and present data by displaying the data in a straightforward manner. Its power can be best shown with a more realistic example.
In the above figure, we created a DataFrame that consisted of salary and bonus records together with the employees’ gender and department information. We then created a pivot table using the pivot_table()
function. Specifically, we set the salary and bonus columns to the values
argument, set the department to the index
argument, set the gender to the columns
argument, and set [np.mean, np.median, np.amax]
to the aggfunc
argument.
In the output, you can see that we have a pivot table showing us the 2 (gender) by 2 (department) tables in mean, median, and maximum values for the salary and bonus variables. Some interesting observations include that in Department A, women have higher salaries than men, while the pattern is opposite in Department B. In both departments, women and men have similar bonuses.
3. The qcut() Function
When we have a dataset that involves ordinal measures, it sometimes makes more sense to create categorical quantiles to identify possible patterns instead of examining these ordinal measures parametrically. Theoretically, we can calculate the quantile cutoffs ourselves and map the data using these cutoffs to create the new categorical variable.
However, this operation can be easily realized with the qcut()
function , which discretizes the variable into equal-sized pools (e.g., quantiles and deciles) based on their ranks. Let’s see how this function works with the following example.
In the above figure, we created a DataFrame having 3 columns. We were interested in generating the quantiles for the var2
column. Thus, we specified the q
argument to be 4 (it can be 10 if you want deciles). We also specified the label list to mark these quantiles.
4. The melt() Function
Depending on the tools that data scientists use, some prefer the “wide” format (e.g., one subject one row with multiple variables), while some others prefer the “long” format (e.g., one subject multiple rows with one variable). Thus, it’s not uncommon that we need to do data transformation between these formats.
Unlike the transposition T
function that transposes the DataFrame entirely, the melt()
function is particularly useful to convert the data from the wide to long format. Let’s see how it works with the following example.
In the above figure, we created a DataFrame in a wide format. Specifically, we have two measures before and after taking the medicine. We then used the melt()
function to produce a long-format DataFrame. We specified the SubjectID
as the id_vars
, the two measures as the value_vars
, and rename the columns to be more meaningful.
Before You Go
There are many more functions in pandas that we can explore. In this article, we just learned four functions that some of us don’t know too well, but they can be very useful in our daily data manipulation work.
I hope that you enjoyed reading this piece. You can find the code on GitHub .
About the Author
I write blogs about Python and data processing and analysis. Just in case you’ve missed some of my earlier blogs, here are the links to some articles that are relevant to the current one.
30 Simple Tricks to Level Up Your Python Coding
Better Python
medium.com
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
HTML & XHTML
Chuck Musciano、Bill Kennedy / O'Reilly Media / 2006-10-27 / GBP 39.99
"...lucid, in-depth descriptions of the behavior of every HTML tag on every major browser and platform, plus enough dry humor to make the book a pleasure to read." --Edward Mendelson, PC Magazine "Whe......一起来看看 《HTML & XHTML》 这本书的介绍吧!