import palmerpenguins
import tidypolars as tp
from tidypolars import col
Let’s use tidypolars more
Let’s use tidypolars more
In this blogpost I’m going to show how to perform the same task with pandas, the most popular library for data analysis in Python, and tidypolars, a new library to do data analysis with tabular data inspired on the tidyverse.
The task consists of computing a variable transformation that relies on grouped aggregations. In particular, we will be computing the standardized version of a numeric variable by group.
I’m going to work with the famous palmer penguins dataset. In Python this can be loaded very easily thanks to the palmerpenguins
library.
= palmerpenguins.load_penguins()
data data
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
339 | Chinstrap | Dream | 55.8 | 19.8 | 207.0 | 4000.0 | male | 2009 |
340 | Chinstrap | Dream | 43.5 | 18.1 | 202.0 | 3400.0 | female | 2009 |
341 | Chinstrap | Dream | 49.6 | 18.2 | 193.0 | 3775.0 | male | 2009 |
342 | Chinstrap | Dream | 50.8 | 19.0 | 210.0 | 4100.0 | male | 2009 |
343 | Chinstrap | Dream | 50.2 | 18.7 | 198.0 | 3775.0 | female | 2009 |
344 rows × 8 columns
Data wrangling with pandas
Before seeing an example using tidypolars I’m going to perform some basic data wrangling with pandas. One of the first things one usually do with a data frame is exploring its first rows visually. The .head()
method comes very handy.
data.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
I’m going to work with the species
, sex
, and body_mass_g
columns only. To subset the dataframe we only need to pass the name of these columns as a list within brackets.
= data[["species", "sex", "body_mass_g"]].reset_index(drop=True)
data data.head()
species | sex | body_mass_g | |
---|---|---|---|
0 | Adelie | male | 3750.0 |
1 | Adelie | female | 3800.0 |
2 | Adelie | female | 3250.0 |
3 | Adelie | NaN | NaN |
4 | Adelie | female | 3450.0 |
I add the .reset_index(drop=True)
to avoid some SettingWithCopyWarning
s later. I also want to drop any observations with missing values so I use .dropna()
to do that.
= data.dropna() data
Suppose now I want to standardize the variable body_mass_g
. Pandas vectorized operations make it extremely easy. One can save the result in a new column in a very intuitive way as well.
"body_mass_z"] = (data["body_mass_g"] - data["body_mass_g"].mean()) / data["body_mass_g"].std()
data[ data.head()
species | sex | body_mass_g | body_mass_z | |
---|---|---|---|---|
0 | Adelie | male | 3750.0 | -0.567621 |
1 | Adelie | female | 3800.0 | -0.505525 |
2 | Adelie | female | 3250.0 | -1.188572 |
4 | Adelie | female | 3450.0 | -0.940192 |
5 | Adelie | male | 3650.0 | -0.691811 |
We could consider it is more appropiate to standardize considering the species
variable. We still perform the same operation than above, but we do it within each group.
The way to perform such operations in pandas is to use the .groupby()
method. Then one can select the desired column and compute the aggregation.
For example, to compute the mean body mass by species we can do the following.
"species")["body_mass_g"].mean() data.groupby(
species
Adelie 3706.164384
Chinstrap 3733.088235
Gentoo 5092.436975
Name: body_mass_g, dtype: float64
We obtained a pandas Series with three values, the mean for each species. If we want to obtain a Series of the same length than the original data, it is, for each row the mean of the species the observation belongs to, we can use .transform()
.
"species")["body_mass_g"].transform("mean") data.groupby(
0 3706.164384
1 3706.164384
2 3706.164384
4 3706.164384
5 3706.164384
...
339 3733.088235
340 3733.088235
341 3733.088235
342 3733.088235
343 3733.088235
Name: body_mass_g, Length: 333, dtype: float64
The same can be done with other transformations such as the standard deviation.
"species")["body_mass_g"].transform("std") data.groupby(
0 458.620135
1 458.620135
2 458.620135
4 458.620135
5 458.620135
...
339 384.335081
340 384.335081
341 384.335081
342 384.335081
343 384.335081
Name: body_mass_g, Length: 333, dtype: float64
Now, putting all the pieces together, we can compute the standardized body mass by species.
"body_mass_z"] = (
data["body_mass_g"] - data.groupby("species")["body_mass_g"].transform("mean"))
(data[/ data.groupby("species")["body_mass_g"].transform("std")
) data.head()
species | sex | body_mass_g | body_mass_z | |
---|---|---|---|---|
0 | Adelie | male | 3750.0 | 0.095582 |
1 | Adelie | female | 3800.0 | 0.204604 |
2 | Adelie | female | 3250.0 | -0.994645 |
4 | Adelie | female | 3450.0 | -0.558555 |
5 | Adelie | male | 3650.0 | -0.122464 |
It’s also possible to add more variables to the groups. For example, this is how we can perform the same compution considering groups given by species and sex.
"body_mass_z"] = (
data["body_mass_g"] - data.groupby(["species", "sex"])["body_mass_g"].transform("mean"))
(data[/ data.groupby(["species", "sex"])["body_mass_g"].transform("std")
) data.head()
species | sex | body_mass_g | body_mass_z | |
---|---|---|---|---|
0 | Adelie | male | 3750.0 | -0.846261 |
1 | Adelie | female | 3800.0 | 1.600580 |
2 | Adelie | female | 3250.0 | -0.441145 |
4 | Adelie | female | 3450.0 | 0.301301 |
5 | Adelie | male | 3650.0 | -1.134602 |
And finally, I can sort by the standardized body mass in ascending order.
"body_mass_z") data.sort_values(
species | sex | body_mass_g | body_mass_z | |
---|---|---|---|---|
314 | Chinstrap | female | 2700.0 | -2.899080 |
192 | Gentoo | female | 3950.0 | -2.591611 |
195 | Gentoo | male | 4750.0 | -2.346530 |
298 | Chinstrap | female | 2900.0 | -2.198147 |
119 | Adelie | male | 3325.0 | -2.071711 |
... | ... | ... | ... | ... |
114 | Adelie | female | 3900.0 | 1.971803 |
109 | Adelie | male | 4775.0 | 2.109234 |
284 | Chinstrap | female | 4150.0 | 2.182685 |
313 | Chinstrap | male | 4800.0 | 2.377631 |
169 | Gentoo | male | 6300.0 | 2.603039 |
333 rows × 4 columns
Data wrangling with tidypolars
Now it’s time to see the same operations performed with tidypolars. Tidypolars is inspired on the tidyverse
, a set of packages following a consistent design phillosophy that has revolutionated the way we do data science in R and other languages as well.
Its description says > tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users.
Tidypolars does not rely on any pandas data structure because it’s built on top of polars, not pandas. In addition, it works with a new data frame structure called Tibble
, borrowing its name from the tibble
in the R package of the same name.
We can convert a pandas DataFrame to a tibble with the .from_pandas()
function.
= tp.from_pandas(palmerpenguins.load_penguins())
tibble type(tibble)
tidypolars.tibble.Tibble
We still have a .head()
method that prints the first rows. The representation is very similar to a pandas data frame.
tibble.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
str | str | f64 | f64 | f64 | f64 | str | i64 |
"Adelie" | "Torgersen" | 39.1 | 18.7 | 181.0 | 3750.0 | "male" | 2007 |
"Adelie" | "Torgersen" | 39.5 | 17.4 | 186.0 | 3800.0 | "female" | 2007 |
"Adelie" | "Torgersen" | 40.3 | 18.0 | 195.0 | 3250.0 | "female" | 2007 |
"Adelie" | "Torgersen" | null | null | null | null | null | 2007 |
"Adelie" | "Torgersen" | 36.7 | 19.3 | 193.0 | 3450.0 | "female" | 2007 |
To filter rows we can use the .filter()
method. In the next chunk of code we use col("*")
to select all columns, tp.is_not_null()
to flag observations with non-null values, and .filter()
to use those booleans to actually perform the filtering.
Update: See this issue where the main author of tidypolars lets me know that we can use the .drop_null()
method instead.
tibble.drop_null()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
str | str | f64 | f64 | f64 | f64 | str | i64 |
"Adelie" | "Torgersen" | 39.1 | 18.7 | 181.0 | 3750.0 | "male" | 2007 |
"Adelie" | "Torgersen" | 39.5 | 17.4 | 186.0 | 3800.0 | "female" | 2007 |
"Adelie" | "Torgersen" | 40.3 | 18.0 | 195.0 | 3250.0 | "female" | 2007 |
"Adelie" | "Torgersen" | 36.7 | 19.3 | 193.0 | 3450.0 | "female" | 2007 |
"Adelie" | "Torgersen" | 39.3 | 20.6 | 190.0 | 3650.0 | "male" | 2007 |
"Adelie" | "Torgersen" | 38.9 | 17.8 | 181.0 | 3625.0 | "female" | 2007 |
"Adelie" | "Torgersen" | 39.2 | 19.6 | 195.0 | 4675.0 | "male" | 2007 |
"Adelie" | "Torgersen" | 41.1 | 17.6 | 182.0 | 3200.0 | "female" | 2007 |
"Adelie" | "Torgersen" | 38.6 | 21.2 | 191.0 | 3800.0 | "male" | 2007 |
"Adelie" | "Torgersen" | 34.6 | 21.1 | 198.0 | 4400.0 | "male" | 2007 |
"Adelie" | "Torgersen" | 36.6 | 17.8 | 185.0 | 3700.0 | "female" | 2007 |
"Adelie" | "Torgersen" | 38.7 | 19.0 | 195.0 | 3450.0 | "female" | 2007 |
... | ... | ... | ... | ... | ... | ... | ... |
"Chinstrap" | "Dream" | 45.2 | 16.6 | 191.0 | 3250.0 | "female" | 2009 |
"Chinstrap" | "Dream" | 49.3 | 19.9 | 203.0 | 4050.0 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.2 | 18.8 | 202.0 | 3800.0 | "male" | 2009 |
"Chinstrap" | "Dream" | 45.6 | 19.4 | 194.0 | 3525.0 | "female" | 2009 |
"Chinstrap" | "Dream" | 51.9 | 19.5 | 206.0 | 3950.0 | "male" | 2009 |
"Chinstrap" | "Dream" | 46.8 | 16.5 | 189.0 | 3650.0 | "female" | 2009 |
"Chinstrap" | "Dream" | 45.7 | 17.0 | 195.0 | 3650.0 | "female" | 2009 |
"Chinstrap" | "Dream" | 55.8 | 19.8 | 207.0 | 4000.0 | "male" | 2009 |
"Chinstrap" | "Dream" | 43.5 | 18.1 | 202.0 | 3400.0 | "female" | 2009 |
"Chinstrap" | "Dream" | 49.6 | 18.2 | 193.0 | 3775.0 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.8 | 19.0 | 210.0 | 4100.0 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.2 | 18.7 | 198.0 | 3775.0 | "female" | 2009 |
It’s very easy to select columns with the .select()
method.
"species", "sex", "body_mass_g") tibble.select(
species | sex | body_mass_g |
---|---|---|
str | str | f64 |
"Adelie" | "male" | 3750.0 |
"Adelie" | "female" | 3800.0 |
"Adelie" | "female" | 3250.0 |
"Adelie" | null | null |
"Adelie" | "female" | 3450.0 |
"Adelie" | "male" | 3650.0 |
"Adelie" | "female" | 3625.0 |
"Adelie" | "male" | 4675.0 |
"Adelie" | null | 3475.0 |
"Adelie" | null | 4250.0 |
"Adelie" | null | 3300.0 |
"Adelie" | null | 3700.0 |
... | ... | ... |
"Chinstrap" | "female" | 3250.0 |
"Chinstrap" | "male" | 4050.0 |
"Chinstrap" | "male" | 3800.0 |
"Chinstrap" | "female" | 3525.0 |
"Chinstrap" | "male" | 3950.0 |
"Chinstrap" | "female" | 3650.0 |
"Chinstrap" | "female" | 3650.0 |
"Chinstrap" | "male" | 4000.0 |
"Chinstrap" | "female" | 3400.0 |
"Chinstrap" | "male" | 3775.0 |
"Chinstrap" | "male" | 4100.0 |
"Chinstrap" | "female" | 3775.0 |
What’s even better, we can chain these operations. This is where one can start seeing how powerful this approach is.
(
tibble
.drop_null()"species", "sex", "body_mass_g")
.select( )
species | sex | body_mass_g |
---|---|---|
str | str | f64 |
"Adelie" | "male" | 3750.0 |
"Adelie" | "female" | 3800.0 |
"Adelie" | "female" | 3250.0 |
"Adelie" | "female" | 3450.0 |
"Adelie" | "male" | 3650.0 |
"Adelie" | "female" | 3625.0 |
"Adelie" | "male" | 4675.0 |
"Adelie" | "female" | 3200.0 |
"Adelie" | "male" | 3800.0 |
"Adelie" | "male" | 4400.0 |
"Adelie" | "female" | 3700.0 |
"Adelie" | "female" | 3450.0 |
... | ... | ... |
"Chinstrap" | "female" | 3250.0 |
"Chinstrap" | "male" | 4050.0 |
"Chinstrap" | "male" | 3800.0 |
"Chinstrap" | "female" | 3525.0 |
"Chinstrap" | "male" | 3950.0 |
"Chinstrap" | "female" | 3650.0 |
"Chinstrap" | "female" | 3650.0 |
"Chinstrap" | "male" | 4000.0 |
"Chinstrap" | "female" | 3400.0 |
"Chinstrap" | "male" | 3775.0 |
"Chinstrap" | "male" | 4100.0 |
"Chinstrap" | "female" | 3775.0 |
We then have the .summarise()
method to compute summaries (or aggregations) by groups. Note we use functions available in the tidypolars namespace such as tp.mean()
.
(
tibble
.drop_null()"species", "sex", "body_mass_g")
.select(
.summarise(=tp.mean("body_mass_g"),
body_mass_mean="species"
by
) )
species | body_mass_mean |
---|---|
str | f64 |
"Adelie" | 3706.164384 |
"Gentoo" | 5092.436975 |
"Chinstrap" | 3733.088235 |
If we want to get a behavior similar to .groupby()
and .transform()
in pandas, we can use another verb, mutate()
.
(
tibble
.drop_null()"species", "sex", "body_mass_g")
.select(
.mutate(=tp.mean("body_mass_g"),
body_mass_mean="species"
by
) )
species | sex | body_mass_g | body_mass_mean |
---|---|---|---|
str | str | f64 | f64 |
"Gentoo" | "female" | 4500.0 | 5092.436975 |
"Gentoo" | "male" | 5700.0 | 5092.436975 |
"Gentoo" | "female" | 4450.0 | 5092.436975 |
"Gentoo" | "male" | 5700.0 | 5092.436975 |
"Gentoo" | "male" | 5400.0 | 5092.436975 |
"Gentoo" | "female" | 4550.0 | 5092.436975 |
"Gentoo" | "female" | 4800.0 | 5092.436975 |
"Gentoo" | "male" | 5200.0 | 5092.436975 |
"Gentoo" | "female" | 4400.0 | 5092.436975 |
"Gentoo" | "male" | 5150.0 | 5092.436975 |
"Gentoo" | "female" | 4650.0 | 5092.436975 |
"Gentoo" | "male" | 5550.0 | 5092.436975 |
... | ... | ... | ... |
"Adelie" | "female" | 3400.0 | 3706.164384 |
"Adelie" | "male" | 3475.0 | 3706.164384 |
"Adelie" | "female" | 3050.0 | 3706.164384 |
"Adelie" | "male" | 3725.0 | 3706.164384 |
"Adelie" | "female" | 3000.0 | 3706.164384 |
"Adelie" | "male" | 3650.0 | 3706.164384 |
"Adelie" | "male" | 4250.0 | 3706.164384 |
"Adelie" | "female" | 3475.0 | 3706.164384 |
"Adelie" | "female" | 3450.0 | 3706.164384 |
"Adelie" | "male" | 3750.0 | 3706.164384 |
"Adelie" | "female" | 3700.0 | 3706.164384 |
"Adelie" | "male" | 4000.0 | 3706.164384 |
This verb is very powerful. We can compute more complex expressions, such as the one involved in the standardization of a variable.
(
tibble
.drop_null()"species", "sex", "body_mass_g")
.select(
.mutate(=(col("body_mass_g") - tp.mean("body_mass_g")) / tp.sd("body_mass_g"),
body_mass_z="species"
by
) )
species | sex | body_mass_g | body_mass_z |
---|---|---|---|
str | str | f64 | f64 |
"Gentoo" | "female" | 4500.0 | -1.181386 |
"Gentoo" | "male" | 5700.0 | 1.211549 |
"Gentoo" | "female" | 4450.0 | -1.281092 |
"Gentoo" | "male" | 5700.0 | 1.211549 |
"Gentoo" | "male" | 5400.0 | 0.613315 |
"Gentoo" | "female" | 4550.0 | -1.0816 |
"Gentoo" | "female" | 4800.0 | -0.583152 |
"Gentoo" | "male" | 5200.0 | 0.214493 |
"Gentoo" | "female" | 4400.0 | -1.380797 |
"Gentoo" | "male" | 5150.0 | 0.114787 |
"Gentoo" | "female" | 4650.0 | -0.882269 |
"Gentoo" | "male" | 5550.0 | 0.912432 |
... | ... | ... | ... |
"Adelie" | "female" | 3400.0 | -0.667577 |
"Adelie" | "male" | 3475.0 | -0.504043 |
"Adelie" | "female" | 3050.0 | -1.430736 |
"Adelie" | "male" | 3725.0 | 0.041 |
"Adelie" | "female" | 3000.0 | -1.539759 |
"Adelie" | "male" | 3650.0 | -0.122464 |
"Adelie" | "male" | 4250.0 | 1.185808 |
"Adelie" | "female" | 3475.0 | -0.504043 |
"Adelie" | "female" | 3450.0 | -0.558555 |
"Adelie" | "male" | 3750.0 | 0.095582 |
"Adelie" | "female" | 3700.0 | -0.013441 |
"Adelie" | "male" | 4000.0 | 0.640695 |
And finally, we can use the .arrange()
verb to sort observations by the standardized body mass in ascending order.
(
tibble
.drop_null()"species", "sex", "body_mass_g")
.select(
.mutate(=(col("body_mass_g") - tp.mean("body_mass_g")) / tp.sd("body_mass_g"),
body_mass_z="species"
by
)"body_mass_z")
.arrange( )
species | sex | body_mass_g | body_mass_z |
---|---|---|---|
str | str | f64 | f64 |
"Chinstrap" | "female" | 2700.0 | -2.687988 |
"Gentoo" | "female" | 3950.0 | -2.278148 |
"Chinstrap" | "female" | 2900.0 | -2.167609 |
"Gentoo" | "female" | 4150.0 | -1.879326 |
"Adelie" | "female" | 2850.0 | -1.866827 |
"Adelie" | "female" | 2850.0 | -1.866827 |
"Gentoo" | "female" | 4200.0 | -1.7796 |
"Gentoo" | "female" | 4200.0 | -1.7796 |
"Gentoo" | "female" | 4200.0 | -1.7796 |
"Adelie" | "female" | 2900.0 | -1.757804 |
"Adelie" | "female" | 2900.0 | -1.757804 |
"Adelie" | "female" | 2900.0 | -1.757804 |
... | ... | ... | ... |
"Gentoo" | "male" | 6050.0 | 1.909489 |
"Adelie" | "male" | 4600.0 | 1.948967 |
"Adelie" | "male" | 4600.0 | 1.948967 |
"Chinstrap" | "male" | 4500.0 | 1.995425 |
"Adelie" | "male" | 4650.0 | 2.0579 |
"Adelie" | "male" | 4675.0 | 2.112501 |
"Chinstrap" | "male" | 4550.0 | 2.1255 |
"Adelie" | "male" | 4700.0 | 2.167013 |
"Adelie" | "male" | 4725.0 | 2.221524 |
"Adelie" | "male" | 4775.0 | 2.330547 |
"Gentoo" | "male" | 6300.0 | 2.408017 |
"Chinstrap" | "male" | 4800.0 | 2.775994 |
We can make it even clearer if we wrap the standardization operation within a function.
def standardize(name):
return (col(name) - tp.mean(name)) / tp.sd(name)
(
tibble
.drop_null()"species", "sex", "body_mass_g")
.select(
.mutate(=standardize("body_mass_g"),
body_mass_z="species"
by
)"body_mass_z")
.arrange( )
species | sex | body_mass_g | body_mass_z |
---|---|---|---|
str | str | f64 | f64 |
"Chinstrap" | "female" | 2700.0 | -2.687988 |
"Gentoo" | "female" | 3950.0 | -2.278148 |
"Chinstrap" | "female" | 2900.0 | -2.167609 |
"Gentoo" | "female" | 4150.0 | -1.879326 |
"Adelie" | "female" | 2850.0 | -1.866827 |
"Adelie" | "female" | 2850.0 | -1.866827 |
"Gentoo" | "female" | 4200.0 | -1.7796 |
"Gentoo" | "female" | 4200.0 | -1.7796 |
"Gentoo" | "female" | 4200.0 | -1.7796 |
"Adelie" | "female" | 2900.0 | -1.757804 |
"Adelie" | "female" | 2900.0 | -1.757804 |
"Adelie" | "female" | 2900.0 | -1.757804 |
... | ... | ... | ... |
"Gentoo" | "male" | 6050.0 | 1.909489 |
"Adelie" | "male" | 4600.0 | 1.948967 |
"Adelie" | "male" | 4600.0 | 1.948967 |
"Chinstrap" | "male" | 4500.0 | 1.995425 |
"Adelie" | "male" | 4650.0 | 2.0579 |
"Adelie" | "male" | 4675.0 | 2.112501 |
"Chinstrap" | "male" | 4550.0 | 2.1255 |
"Adelie" | "male" | 4700.0 | 2.167013 |
"Adelie" | "male" | 4725.0 | 2.221524 |
"Adelie" | "male" | 4775.0 | 2.330547 |
"Gentoo" | "male" | 6300.0 | 2.408017 |
"Chinstrap" | "male" | 4800.0 | 2.775994 |
If we want to carry on the .mutate()
operation grouped by more than one variable, we can simply pass the names in a list.
Comparison
Let’s see both approaches in action so we can better appreciate the differences.
With pandas
= palmerpenguins.load_penguins()
data = data[["species", "sex", "body_mass_g"]].dropna().reset_index(drop=True)
data "body_mass_z"] = (
data["body_mass_g"] - data.groupby(["species", "sex"])["body_mass_g"].transform("mean"))
(data[/ data.groupby(["species", "sex"])["body_mass_g"].transform("std")
)"body_mass_z") data.sort_values(
species | sex | body_mass_g | body_mass_z | |
---|---|---|---|---|
303 | Chinstrap | female | 2700.0 | -2.899080 |
185 | Gentoo | female | 3950.0 | -2.591611 |
188 | Gentoo | male | 4750.0 | -2.346530 |
287 | Chinstrap | female | 2900.0 | -2.198147 |
113 | Adelie | male | 3325.0 | -2.071711 |
... | ... | ... | ... | ... |
108 | Adelie | female | 3900.0 | 1.971803 |
103 | Adelie | male | 4775.0 | 2.109234 |
273 | Chinstrap | female | 4150.0 | 2.182685 |
302 | Chinstrap | male | 4800.0 | 2.377631 |
163 | Gentoo | male | 6300.0 | 2.603039 |
333 rows × 4 columns
With tidypolars
def standardize(name):
return (col(name) - tp.mean(name)) / tp.sd(name)
= tp.from_pandas(palmerpenguins.load_penguins())
tibble
(
tibble
.drop_null()"species", "sex", "body_mass_g")
.select(
.mutate(=standardize("body_mass_g"),
body_mass_z=["species", "sex"]
by
)"body_mass_z")
.arrange( )
species | sex | body_mass_g | body_mass_z |
---|---|---|---|
str | str | f64 | f64 |
"Chinstrap" | "female" | 2700.0 | -2.899 |
"Gentoo" | "female" | 3950.0 | -2.591611 |
"Gentoo" | "male" | 4750.0 | -2.3465 |
"Chinstrap" | "female" | 2900.0 | -2.198147 |
"Adelie" | "male" | 3325.0 | -2.071711 |
"Adelie" | "female" | 2850.0 | -1.926035 |
"Adelie" | "female" | 2850.0 | -1.926035 |
"Chinstrap" | "male" | 3250.0 | -1.902511 |
"Gentoo" | "female" | 4150.0 | -1.881329 |
"Gentoo" | "male" | 4925.0 | -1.787708 |
"Adelie" | "male" | 3425.0 | -1.7833 |
"Chinstrap" | "male" | 3300.0 | -1.764442 |
... | ... | ... | ... |
"Adelie" | "male" | 4650.0 | 1.748808 |
"Gentoo" | "male" | 6050.0 | 1.804721 |
"Adelie" | "male" | 4675.0 | 1.820893 |
"Gentoo" | "female" | 5200.0 | 1.847652 |
"Gentoo" | "female" | 5200.0 | 1.847652 |
"Adelie" | "male" | 4700.0 | 1.892979 |
"Adelie" | "male" | 4725.0 | 1.965064 |
"Adelie" | "female" | 3900.0 | 1.971803 |
"Adelie" | "male" | 4775.0 | 2.109234 |
"Chinstrap" | "female" | 4150.0 | 2.182685 |
"Chinstrap" | "male" | 4800.0 | 2.377631 |
"Gentoo" | "male" | 6300.0 | 2.603039 |
Conclusion
Pandas is great, so many people love it, and it will be the most used tool to work with tabular data in Python for a long time. In my case, it made me struggle a lot when I started to do data analysis in Python after working several years with R. Later, and fortunately, I began to understand how it worked and I became better at it.
However, I never felt as comfortable with pandas as I do with dplyr
syntax. Maybe it’s my R background? I don’t know. I just know that one of the things I enjoy the most in R is how easy and clear is to compose data manipulation operations with dplyr
and friends. And having that expressiveness in Python is fantastic.