Transforming Data using Polars

In this chapter, we'll look at how to transform data using Polars in both Python and Rust.

Polars is a "blazing fast DataFrame library" available in both Python and Rust. It is similar to pandas, although has fewer capabilities, however, it supports a wide-variety of common transformation tasks.

It is several times faster than pandas, and is a good choice for data transformation tasks.

The Polars documentation is a great resource for getting started, and the API docs have even more details on syntax.

Let's look at some key differences between the syntax in Python and Rust.

Python

import os

import polars as pl

script_path = os.path.dirname(os.path.realpath(__file__))
bird_path = os.path.join(script_path, "../../../lib/PFW_2016_2020_public.csv")
codes_path = os.path.join(script_path, "../../../lib/species_code.csv")

columns = [
    "LATITUDE",
    "LONGITUDE",
    "SUBNATIONAL1_CODE",
    "Month",
    "Day",
    "Year",
    "SPECIES_CODE",
    "HOW_MANY",
    "VALID",
]

birds = pl.read_csv(
    bird_path,
    columns=columns,
    new_columns=[s.lower() for s in columns],
)

codes = pl.read_csv(codes_path).select(
    [
        pl.col("SPECIES_CODE").alias("species_code"),
        pl.col("PRIMARY_COM_NAME").alias("species_name"),
    ]
)

birds_df = (
    birds.select(
        pl.col(
            [
                "latitude",
                "longitude",
                "subnational1_code",
                "month",
                "day",
                "year",
                "species_code",
                "how_many",
                "valid",
            ]
        )
    )
    .filter(pl.col("valid") == 1)
    .groupby(["subnational1_code", "species_code"])
    .agg(
        [
            pl.sum("how_many").alias("total_species"),
            pl.count("how_many").alias("total_sightings"),
        ]
    )
    .sort("total_species", descending=True)
    .join(codes, on="species_code", how="inner")
)

print(birds_df)

The Python code very concise, columns can be selected as a list of strings, the sort function takes a simple descending argument, and the general API is very straightforward.

I've also included an attempt at the same logic in pandas. While largely similar, there are a few differences, for example, in how we filter for valid results.

import os

import pandas as pd

script_path = os.path.dirname(os.path.realpath(__file__))
bird_path = os.path.join(script_path, "../../../lib/PFW_2016_2020_public.csv")
codes_path = os.path.join(script_path, "../../../lib/species_code.csv")

# adding usecols reducing memory usage and runtime from 13s to 7s
birds = pd.read_csv(
    bird_path,
    usecols=[
        "LATITUDE",
        "LONGITUDE",
        "SUBNATIONAL1_CODE",
        "Month",
        "Day",
        "Year",
        "SPECIES_CODE",
        "HOW_MANY",
        "VALID",
    ],
).rename(columns=lambda x: x.lower())

codes = pd.read_csv(codes_path)[["SPECIES_CODE", "PRIMARY_COM_NAME"]].rename(
    columns={"SPECIES_CODE": "species_code", "PRIMARY_COM_NAME": "species_name"}
)

birds = birds[
    [
        "latitude",
        "longitude",
        "subnational1_code",
        "month",
        "day",
        "year",
        "species_code",
        "how_many",
        "valid",
    ]
]

birds = birds[birds["valid"] == 1]
birds = (
    birds.groupby(["subnational1_code", "species_code"])
    .agg(total_species=("how_many", "sum"), total_sightings=("how_many", "count"))
    .reset_index()
    .sort_values("total_species", ascending=False)
)

birds = pd.merge(birds, codes, on="species_code", how="inner")


print(birds)

Now let's compare the above to Rust code.

Rust

use polars::prelude::*;
use std::env;

fn main() {
    let current_dir = env::current_dir().expect("Failed to get current directory");
    let bird_path = current_dir.join("../lib/PFW_2016_2020_public.csv");
    let codes_path = current_dir.join("../lib/species_code.csv");

    let cols = vec![
        "LATITUDE".into(),
        "LONGITUDE".into(),
        "SUBNATIONAL1_CODE".into(),
        "Month".into(),
        "Day".into(),
        "Year".into(),
        "SPECIES_CODE".into(),
        "HOW_MANY".into(),
        "VALID".into(),
    ];

    let birds_df = CsvReader::from_path(bird_path)
        .expect("Failed to read CSV file")
        .has_header(true)
        .with_columns(Some(cols.clone()))
        .finish()
        .unwrap()
        .lazy();

    let mut codes_df = CsvReader::from_path(codes_path)
        .expect("Failed to read CSV file")
        .infer_schema(None)
        .has_header(true)
        .finish()
        .unwrap();

    codes_df = codes_df
        .clone()
        .lazy()
        .select([
            col("SPECIES_CODE").alias("species_code"),
            col("PRIMARY_COM_NAME").alias("species_name"),
        ])
        .collect()
        .unwrap();

    let birds_df = birds_df
        .rename(cols.clone(), cols.into_iter().map(|x| x.to_lowercase()))
        .filter(col("valid").eq(lit(1)))
        .groupby(["subnational1_code", "species_code"])
        .agg(&[
            col("how_many").sum().alias("total_species"),
            col("how_many").count().alias("total_sightings"),
        ])
        .sort(
            "total_species",
            SortOptions {
                descending: true,
                nulls_last: false,
                multithreaded: true,
            },
        )
        .collect()
        .unwrap();

    let joined = birds_df
        .join(
            &codes_df,
            ["species_code"],
            ["species_code"],
            JoinType::Inner,
            None,
        )
        .unwrap();

    println!("{}", joined);
}

In Rust, the code 75% longer and the syntax is more verbose. There are a lot of unwrap calls to handle errors, although some of these could be replaced with ? in a real application.

The sort function takes a SortOptions struct, which is a bit more verbose. Overall, the API is very similar.

Benchmarks

Let's look at some benchmarks for polars in both Python and Rust, as well as similar code in Pandas.

CommandMean [ms]Min [ms]Max [ms]Relative
../wxrs/target/release/ch5473.9 ± 5.8461.3480.71.00
python ../wxpy/wxpy/ch5/ch5.py764.8 ± 26.8732.9815.21.61 ± 0.06
python ../wxpy/wxpy/ch5/ch5_pandas.py5644.0 ± 39.25584.95710.011.91 ± 0.17

The Rust version is the fastest again. The Python-polars code is 1.6x slower than the Rust code, but the Pandas code is exceptionally slow, taking over 5 seconds to complete while both Polars versions take less than 1 second.