Transforming Data using Polars
In this chapter, we'll look at how to transform data using Polars in both Python and Rust.
Polars is a "blazing fast DataFrame library" available in both Python and Rust. It is similar to pandas, although has fewer capabilities, however, it supports a wide-variety of common transformation tasks.
It is several times faster than pandas, and is a good choice for data transformation tasks.
The Polars documentation is a great resource for getting started, and the API docs have even more details on syntax.
Let's look at some key differences between the syntax in Python and Rust.
Python
import os
import polars as pl
script_path = os.path.dirname(os.path.realpath(__file__))
bird_path = os.path.join(script_path, "../../../lib/PFW_2016_2020_public.csv")
codes_path = os.path.join(script_path, "../../../lib/species_code.csv")
columns = [
"LATITUDE",
"LONGITUDE",
"SUBNATIONAL1_CODE",
"Month",
"Day",
"Year",
"SPECIES_CODE",
"HOW_MANY",
"VALID",
]
birds = pl.read_csv(
bird_path,
columns=columns,
new_columns=[s.lower() for s in columns],
)
codes = pl.read_csv(codes_path).select(
[
pl.col("SPECIES_CODE").alias("species_code"),
pl.col("PRIMARY_COM_NAME").alias("species_name"),
]
)
birds_df = (
birds.select(
pl.col(
[
"latitude",
"longitude",
"subnational1_code",
"month",
"day",
"year",
"species_code",
"how_many",
"valid",
]
)
)
.filter(pl.col("valid") == 1)
.groupby(["subnational1_code", "species_code"])
.agg(
[
pl.sum("how_many").alias("total_species"),
pl.count("how_many").alias("total_sightings"),
]
)
.sort("total_species", descending=True)
.join(codes, on="species_code", how="inner")
)
print(birds_df)
The Python code very concise, columns can be selected as a list of strings,
the sort
function takes a simple descending
argument, and the general
API is very straightforward.
I've also included an attempt at the same logic in pandas. While largely similar, there are a few differences, for example, in how we filter for valid results.
import os
import pandas as pd
script_path = os.path.dirname(os.path.realpath(__file__))
bird_path = os.path.join(script_path, "../../../lib/PFW_2016_2020_public.csv")
codes_path = os.path.join(script_path, "../../../lib/species_code.csv")
# adding usecols reducing memory usage and runtime from 13s to 7s
birds = pd.read_csv(
bird_path,
usecols=[
"LATITUDE",
"LONGITUDE",
"SUBNATIONAL1_CODE",
"Month",
"Day",
"Year",
"SPECIES_CODE",
"HOW_MANY",
"VALID",
],
).rename(columns=lambda x: x.lower())
codes = pd.read_csv(codes_path)[["SPECIES_CODE", "PRIMARY_COM_NAME"]].rename(
columns={"SPECIES_CODE": "species_code", "PRIMARY_COM_NAME": "species_name"}
)
birds = birds[
[
"latitude",
"longitude",
"subnational1_code",
"month",
"day",
"year",
"species_code",
"how_many",
"valid",
]
]
birds = birds[birds["valid"] == 1]
birds = (
birds.groupby(["subnational1_code", "species_code"])
.agg(total_species=("how_many", "sum"), total_sightings=("how_many", "count"))
.reset_index()
.sort_values("total_species", ascending=False)
)
birds = pd.merge(birds, codes, on="species_code", how="inner")
print(birds)
Now let's compare the above to Rust code.
Rust
use polars::prelude::*; use std::env; fn main() { let current_dir = env::current_dir().expect("Failed to get current directory"); let bird_path = current_dir.join("../lib/PFW_2016_2020_public.csv"); let codes_path = current_dir.join("../lib/species_code.csv"); let cols = vec![ "LATITUDE".into(), "LONGITUDE".into(), "SUBNATIONAL1_CODE".into(), "Month".into(), "Day".into(), "Year".into(), "SPECIES_CODE".into(), "HOW_MANY".into(), "VALID".into(), ]; let birds_df = CsvReader::from_path(bird_path) .expect("Failed to read CSV file") .has_header(true) .with_columns(Some(cols.clone())) .finish() .unwrap() .lazy(); let mut codes_df = CsvReader::from_path(codes_path) .expect("Failed to read CSV file") .infer_schema(None) .has_header(true) .finish() .unwrap(); codes_df = codes_df .clone() .lazy() .select([ col("SPECIES_CODE").alias("species_code"), col("PRIMARY_COM_NAME").alias("species_name"), ]) .collect() .unwrap(); let birds_df = birds_df .rename(cols.clone(), cols.into_iter().map(|x| x.to_lowercase())) .filter(col("valid").eq(lit(1))) .groupby(["subnational1_code", "species_code"]) .agg(&[ col("how_many").sum().alias("total_species"), col("how_many").count().alias("total_sightings"), ]) .sort( "total_species", SortOptions { descending: true, nulls_last: false, multithreaded: true, }, ) .collect() .unwrap(); let joined = birds_df .join( &codes_df, ["species_code"], ["species_code"], JoinType::Inner, None, ) .unwrap(); println!("{}", joined); }
In Rust, the code 75% longer and the syntax is more verbose. There are a lot of
unwrap
calls to handle errors, although some of these could be replaced with
?
in a real application.
The sort
function takes a SortOptions
struct, which is a bit more verbose.
Overall, the API is very similar.
Benchmarks
Let's look at some benchmarks for polars in both Python and Rust, as well as similar code in Pandas.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
../wxrs/target/release/ch5 | 473.9 ± 5.8 | 461.3 | 480.7 | 1.00 |
python ../wxpy/wxpy/ch5/ch5.py | 764.8 ± 26.8 | 732.9 | 815.2 | 1.61 ± 0.06 |
python ../wxpy/wxpy/ch5/ch5_pandas.py | 5644.0 ± 39.2 | 5584.9 | 5710.0 | 11.91 ± 0.17 |
The Rust version is the fastest again. The Python-polars code is 1.6x slower than the Rust code, but the Pandas code is exceptionally slow, taking over 5 seconds to complete while both Polars versions take less than 1 second.