Chapter 1 - Introduction

This book is available for free online at rustfordata.com You can find the source code for book in ./rust4data-book This book is very early, so expect code to change quite a bit.

Feel free to open an issue with any bugs, mistakes, or requests you may have.

What is this book about?

This series of posts is about using Rust for data engineering tasks for people who are already familiar with Python and are curious about Rust. It will not cover every aspect of Rust, or of Python. Instead, it aims to give practical examples of how common engineering tasks done in Python might be done in Rust, along with representative benchmarks.

We will cover a variety of topics, including:

  • Getting data from an API
  • Scraping a website
  • Parsing data and using structs
  • Data Transformation
  • Writing Data
  • Other topics as I feel like it

This book is not an introduction to either Rust or Python. There are many great resources to both out there. If you are not familiar with Python, the official Python Tutorial is a great starting point.

Should I use Rust for Data Engineering?

Probably not. Rust is a great language, it is fun, it is pleasant to use, and it is fast. But choosing a language for a project is more than choosing a language that is fun. There are cautionary tales about using Rust at a startup, and I think they are worth reading.

There are many reasons why you might not want to use Rust for data engineering. The first is that Rust is not as mature as Python. There are many libraries that are missing. For example, as of this writing, there are no Snowflake libraries for querying data in that warehouse. Most people do not know Rust, and it is harder to hire and harder to train people.

There may be good reasons to use Rust for data engineering however. When it comes to cost and performance, Rust is clearly faster than Python for many types of tasks. Memory usage is also much lower, which can be important when you are constrained by small IoT devices, for example.

I can't tell you when to use Rust and when to use Python, but I do believe that by understanding both languages, their merits and pitfalls, you will be better positioned to make that decision for yourself.

Why Should I Learn Rust?

Because it is fun to learn new things. I can't promise you that anything you learn here will ever have a material impact on your life or career. But if you enjoy learning and tinkering, then you might want to tinker with this. If you are like me, and you like learning for learning's sake, then you will enjoy this experience too. I learned vim and lua not because it was useful, but because I was curious about it. I did end up benfiting from it, but I never approached it from a purely utilitarian perspective. There are better ways to spend your time if your goal is purely career advancement.

But, if you are curious about Rust, and if you like to have fun, then I think you will be pleasantly surprised by what Rust has to offer.

Prerequisites

Installing Rust and Python

You will need Rust and Python installed to follow along with the examples here. There are many ways to setup an environment, especially in Python, I've listed the one I like here.

For Rust, you can install Rust by going to rustup.rs.

For Python, I recommend using pyenv along with pyenv-virtualenv. With both installed, here is my typical workflow:

# Create a folder
mkdir -p ~/projects/somepyproj
cd ~/projects/somepyproj

# Install a recent version of Python
pyenv install 3.10

# Create a virtualenv for the project named proj310
pyenv virtualenv 3.10 proj310
pyenv local proj310

# Confirm pyenv is working
pyenv version
> proj310 (set by /Users/username/projects/somepyproj/.python-version)```

Installing the Code

The code for this book can be found here: https://github.com/PedramNavid/rust-for-data

git clone git@github.com:PedramNavid/rust-for-data.git

Fetching from an API

One of the simplest examples to start with is fetching data from an API endpoint. This is often the beginning of many data pipeline journeys.

In our first case, we will use the OpenWeatherMap API to fetch the current weather in a configurable location by providing a latitude and longitude on the command line.

You will need to sign up for a free account to get an API key, once you've signed up, create an API key.

Starting a Project

One of the first differences between Rust and Python you will experience is through initializing a project.

Rust

In Rust, this is as simple as running

# Create the project
cargo init wxrs

# Add a dependency
cd wxrs
cargo add reqwest --features blocking

This will create a new directory called wxrs with a Hello World example.

It will also add the reqwest crate to our dependencies, similar to pip install. Unlike pip install though, this will also update Cargo.toml with our dependency, and create a Cargo.lock file that will lock the reqwest crate to a specific version.

The --features flag is used to express optional compilation features. Reqwest has several options, described in the crate's documentation.

We will use the blocking feature, which will allow us to use the blocking API. It gives us a simpler interface to reqwest instead of futures that require an async runtime. We will eventually use async to show the power of Rust's fearless concurrency.

In Python, we have to manually maintain dependencies and create lockfiles through updating setup.py or pyproject.toml, or using a tool like pipenv or poetry.

We are not using any specific features, but Python too allows optional features, for example pip install snowflake-connector-python[pandas].

# Cargo.toml
[package]
name = "wxrs"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
chrono = "0.4.26"
polars = { version = "0.30.0", features = ["lazy"] }
reqwest = { version = "0.11.18", features = ["blocking", "json"] }
serde = { version = "1.0.164", features = ["derive"] }
serde_json = "1.0.97"
tempfile = "3.6.0"

Python

In Python, we create the directory manually, create a virtual environment, hand-write a pyproject.toml file, name our dependencies, and then install the Python package locally.

# Python venv stuff
pyenv virtualenv wxpy
pyenv shell wxpy
pip install --upgrade pip build

# Create the project directory and pyproject.toml
mkdir -p wxpy/wxpy
cd wxpy
vim pyproject.toml
# pyproject.toml
[project]
name = "wxpy"
version = "0.0.1"
dependencies = [
    "requests",
    'importlib-metadata; python_version<"3.8"',
    "polars",
    "pandas"
]

pip install -e .

Now, admittedly we can skip all of the above steps, create a random file anywhere we want and run it with python myfile.py, but the goal here is to build a more stable distribution that can be packaged, shared, and tested.

Fetching the Weather

Now that we have a project, let's fetch the weather. To fetch from an API, we will use the requests package in Python and the reqwest package in Rust.

We will read the API key from the environment variable, and get the latitude and longitude from the command line arguments.

Given that I'm in California, it only makes sense to start with the Air Pollution API.

Python

In Python, we'll create a folder for this chapter to keep code organized, and then run that file directly.

mkdir wxpy/wxpy/ch3
# wxpy/wxpy/ch3/fetch_api.py
import os
import sys
import requests

API_KEY = os.getenv("OWM_APPID")


def get_air_pollution(lat, lon):
    url = f"http://api.openweathermap.org/data/2.5/air_pollution?lat={lat}&lon={lon}&appid={API_KEY}"
    body = requests.get(url).text
    return body


if __name__ == "__main__":
    usage = f"Usage: python {__file__} <lat> <lon>"

    if not API_KEY:
        print("Please set OWM_APPID environment variable")
        sys.exit(1)

    if len(sys.argv) != 3:
        print(usage)
        sys.exit(1)

    lat = sys.argv[1]
    lon = sys.argv[2]
    body = get_air_pollution(lat, lon)
    print(body)

Rust

In Rust, usually we'll have a main.rs file that runs our code, with additional code imported as modules from other files. There's a great convention for package layouts in Rust.

But since we want to execute our code directly, and we'll have multiple binaries, we'll create a bin folder and save the code for ch3 there.

mkdir wxrs/src/bin
// wxrs/src/bin/ch3.rs
pub fn get_air_pollution(lat: f32, lon: f32) -> String {
    let api_key = std::env::var("OWM_APPID").expect(
        "Environment Variable OWM_APPID not set. Please set it to your
    OpenWeatherMap API key. https://home.openweathermap.org/api_keys",
    );

    let url = format!(
        "http://api.openweathermap.org/data/2.5/air_pollution?lat={}&lon={}&appid={}",
        lat, lon, api_key
    );
    reqwest::blocking::get(url)
        .expect("request failed")
        .text()
        .expect("body failed")
}

pub fn main() {
    let usage = format!("Usage: {} [lat] [lon]", std::env::args().next().unwrap());

    let lat = std::env::args()
        .nth(1)
        .expect(&usage)
        .parse::<f32>()
        .expect(&usage);

    let lon = std::env::args()
        .nth(2)
        .expect(&usage)
        .parse::<f32>()
        .expect(&usage);

    let body = get_air_pollution(lat, lon);
    println!("{}", body);
}

Running the Program

Running the program is simple in both languages. We'll provide the latitude and longitude of beautiful Fairfax, CA, birthplace of mountain biking, and nestled in the foothills of Mount Tamalpais.

Google gives the coordinates as 37.9871 and -122.5889

Python

In Python, we can use -m to run the module directly.

# in wxpy/wxpy
# export OPENWEATHERMAP_API_KEY=your-api-key
python -m wxpy.ch3.fetch_api 37.9871 -122.5889

> {"coord":{"lon":-122.5889,"lat":37.9871},"list":[{"main":{"aqi":2},"components":{"co":178.58,"no":0.1,"no2":0.47,"o3":70.81,"so2":0.64,"pm2_5":2.58,"pm10":4.18,"nh3":0},"dt":1687221287}]}

Rust

In Rust, we must first compile the program before running it. If we run cargo build Rust will create a binary for us in ./target/debug/wxrs

We can also compile and run it with one command cargo run

When using cargo build Rust will a debug version of our application in ./target/debug for both the main.rs file which will be named wxrs as well for any files located in src/bin, such as ch3.rs

# in wxrs/
cargo build
./target/debug/ch3 37.9871 -122.5889
> {"coord":{"lon":-122.5889,"lat":37.9871},"list":[{"main":{"aqi":2},"components":{"co":178.58,"no":0.1,"no2":0.47,"o3":70.81,"so2":0.64,"pm2_5":2.58,"pm10":4.18,"nh3":0},"dt":1687221453}]}

# or
cargo run --bin ch3 37.9871 -122.5889
> {"coord":{"lon":-122.5889,"lat":37.9871},"list":[{"main":{"aqi":2},"components":{"co":178.58,"no":0.1,"no2":0.47,"o3":70.81,"so2":0.64,"pm2_5":2.58,"pm10":4.18,"nh3":0},"dt":1687221453}]}

Discussion

Looking at both programs, we can see a fairly similar approach to solving this problem.

Both programs use an external library or crate (not-so-coincidentally named request/reqwest).

In both programs, we've created a function that takes a latitude and longitude, fetches the results from an API and returns the results as text. We'll cover handling structured data from JSON soon.

Types

One obvious difference is that in Rust, we declare the types of the lat and lon arguments, and in Python we do not. The trouble with talking about types is that it inevitably leads to a discussion of memory, which can devolve into a conversation around null pointer references, which we will largely avoid until the next chapter, but here's a light introduction.

In the Rust code, we've very explicitly defined the types for our function:

#![allow(unused)]
fn main() {
pub fn get_air_pollution(lat: f32, lon: f32) -> String {
}

Both lat and lon are f32 or 32-bit floats. These are floating-point numbers that take exactly 32-bits of memory. The compiler knows exactly how much space to reserve for these values: 32-bits, or 4-bytes.

Given that lat and lon doesn't require much precision beyond a few decimals, f32 seems like the best choice for our code. We could even opt for greater precision by using a 64-bit float or f64 in Rust which would take 8 bytes of memory.

Because we know exactly how much memory we need for these variables, Rust is able to store these values on the heap.

In Python, we don't know how much memory lat and lon need until runtime because Python will accept anything in this function.


def get_air_pollution(lat, lon):

We could pass it a string, numbers, another function, or even None.

>>> def join_two(a, b):
...     return f"a+b={a}+{b}"
...
>>> join_two(1,2)
'a+b=1+2'
>>> join_two(None, None)
'a+b=None+None'
>>> join_two(join_two, join_two)
'a+b=<function join_two at 0x7f8f7f7de980>+<function join_two at 0x7f8f7f7de980>'
>>> join_two(join_two, join_two(join_two, join_two))
'a+b=<function join_two at 0x7f8f7f7de980>+a+b=<function join_two at 0x7f8f7f7de980>+<function join_
two at 0x7f8f7f7de980>'

Even the url line will not fail, because in Python duck-typing allows us great flexibility in what we do with variables. We can pass numbers into an f-string for concatenation just as easily as we can pass characters.

Python will allocate these values on the heap, and it turns out that Python allocates about 24 bytes for each float there. The actual values are stored in a private heap.

Now, the difference between 24 bytes and 8 bytes is trivial for an application such as this, and even on the most memory-constrained devices it's not worth noting. But it's important to know that heap allocation is slower, and even small applications may iterate over millions of values. Small differences can add up.

You might then ask yourself: what about mypy? Doesn't that give us typing? Mypy is a static type checker, but it doesn't change the underlying compilation of Python code. It can provide hints as to what you expect the types to be, but it doesn't change how memory is allocated.

Handling Errors

Another subtle but important difference is the handling of errors. In Python, errors are handled as exceptions that are caught. Knowing when to catch an exception is mostly an art. It's difficult to know which functions throw exceptions, what exceptions to expect, and when to deal with them.

In Rust, errors are handled as values that are returned. This is a much more explicit approach, and it's easier to know what errors to expect and how to handle them. In fact, if a function returns a Result type, the compiler will force you to handle the error. This is a huge benefit to Rust, and it's one of the reasons why Rust is so reliable.

Let's take a closer look at an exception we haven't caught yet. If we run the Python program with invalid arguments, we get a ValueError exception.

python -m wxpy.ch3.fetch_api nice birds
> ❯ python -m wxpy.ch3.fetch_api nice birds
{"cod":"400","message":"wrong latitude"}
./target/debug/ch3 nice birds

> thread 'main' panicked at 'Usage: ./target/debug/ch3 [lat] [lon]: ParseFloatError { kind: Invalid }', src/main.rs:24:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

What happened here?

In Python, we didn't check that our input is valid, and so the application send incorrect values to the API, which returned an error message. Fortunately, we get a million API requests for free a month, so this one doesn't cost us much.

In Rust, the application panicked because it could not parse the inputs we provided as a float. On line 23:24 we call parse on the arguments, and we expect them to be floats.

#![allow(unused)]
fn main() {
    let lat = std::env::args()
        .nth(1)
        .expect(&usage)
        .parse::<f32>()
        .expect(&usage);
}

the expect method tells Rust that if parse failed to convert the input, then the application must panic. We output the usage message and exit.

You will see expect and its cousin unwrap used frequently in Rust. They are useful for debugging, but they are not the best way to handle errors. We'll cover error handling in more detail soon.

Benchmarks

Let me preface this by saying speed isn't everything. No doubt someone familiar in Python will spend far more time learning Rust than they might ever save by running a slightly more optimized program. But it is nice to get a sense of

Let's use hyperfine to benchmark the two programs. We'll run each program 10 times and take the average. Before we benchmark the Rust application, we'll compile it using --release which builds a release rather than a debug version and it should provide us with a faster application.

cargo build --release
# in wxpy
hyperfine --warmup 3 --min-runs 10 \
    'python -m wxpy.ch3.fetch_api 37.9871 -122.5889' \
    './target/release/ch3 37.9871 -122.5889' \
    --export-markdown ../benchmarks/ch3_fetch_api.md
CommandMean [ms]Min [ms]Max [ms]Relative
../wxrs/target/release/ch3 30 -140177.0 ± 6.5166.2189.91.00
python ../wxpy/wxpy/ch3/fetch_api.py 30 -140320.2 ± 46.8261.2379.71.81 ± 0.27

On my system, the Python application took an average of 335ms to complete, while the Rust application was 1.7x faster at 198ms. Memory consumption was also lower in Rust, with the Python application using 26MB vs only 10MB in Rust.

Again, this is a trivial application with trivial requirements and performance is not a key factor in deciding what language to build. But as we build more intensive applications we'll keep an eye on memory and performance to see how the gap changes.

Summary

In this chapter we've built a simple application that fetches data from an API and returns the results. We've seen how Rust and Python differ in their approach to handling errors and types, and we've seen how Rust can be faster and more memory efficient than Python.

Serializing Data

In the last chapter we fetched data from the OpenWeather API in order to get Air Pollution data. The astute observe will have noticed that we parsed the response as pure text, although the response was in JSON format.

The goal of this chapter is to walk through how we would take raw data and serialize it into a structured data format, such as JSON.

We'll dive into theory in a little but let's start with practice.

Serialization

Serialization is the process of taking data and encoding it into a known format that can later be retrieved. There are many ways to encode data, but largely these are broken into human-readable and binary formats.

CSVs, JSON, XML, and YAML are all human-readable serialization formats. Conversely, many binary formats exist, such as Parquet, Avro, and Protcol Buffers. Binary formats trade reduced readability for improved performance and size.

In the end, any data that needs to be persisted outside of a computer's memory requires some type of serialization.

Let's look at how serialization varies across both Rust and Python.

Python

In Python, we can serialize nearly any arbitrary data structure to JSON using the json module.

In [1]: import json

In [2]: my_obj = [{'a': 1, 'b': None}, "foo", "bar", ("baz", "baz")]

In [3]: json.dumps(my_obj)
Out[3]: '[{"a": 1, "b": null}, "foo", "bar", ["baz", "baz"]]'

Here's the updated project code that serializes the response from the OpenWeather API.

import os
import sys
import requests

API_KEY = os.getenv("OWM_APPID")


def get_air_pollution(lat, lon):
    url = f"http://api.openweathermap.org/data/2.5/air_pollution?lat={lat}&lon={lon}&appid={API_KEY}"
    body = requests.get(url).json()
    return body

def parse_air_pollution(body):
    aqi = body["list"][0]["main"]["aqi"]
    components = body["list"][0]["components"]
    return (aqi, components)

if __name__ == "__main__":
    usage = f"Usage: python {__file__} <lat> <lon>"

    if not API_KEY:
        print("Please set OWM_APPID environment variable")
        sys.exit(1)

    if len(sys.argv) != 3:
        print(usage)
        sys.exit(1)

    lat = sys.argv[1]
    lon = sys.argv[2]
    body = get_air_pollution(lat, lon)
    aqi, components = parse_air_pollution(body)

    print(f"Air Quality Index: {aqi}")
    print("Components:")
    for k, v in components.items():
        print(f"  {k}: {v}")

There are a few key things to note here.

first, we're assuming the request was successful and that there is a json response body, and that it can parse correctly. if any of these assumptions are incorrect an exception will be raised, and we have no obvious way of knowing what these exceptions are or which method might raise one.

def parse_air_pollution(body):
    aqi = body["list"][0]["main"]["aqi"]
    components = body["list"][0]["components"]
    return (aqi, components)

When parsing the response, we slice into the response body to get various components. We're explicitly fetching keys from a dictionary under the assumption that the payload is properly formed. There are safer dictionary methods to use, such as .get() which will return None if the key is missing rather than an exception, but in our case an Exception is warranted since we can't do anything with the data if it's missing.

We also haven't explicitly typed the response from the API. This is something we can do with mypy or other tools like pydantic, but the Python interpret itself has no type-guarantees.

Let's look at how we might do this in Rust.

Rust

In Rust, we'll need to install the serde crate as well as the json feature for reqwest.

cargo add serde --features derive
cargo add serde_json
cargo add reqwest --features json

Because Rust is a typed language, we will define the struct that represents the data we expect. The API response looks like the following:

{
    "coord": {
        "lon": -122.5889,
        "lat": 37.9871
    },
    "list": [
        {
            "main": {
                "aqi": 2
            },
            "components": {
                "co": 168.56,
                "no": 0.14,
                "no2": 0.75,
                "o3": 80.11,
                "so2": 0.7,
                "pm2_5": 3.48,
                "pm10": 5.58,
                "nh3": 0
            },
            "dt": 1687308878
        }
    ]
}

We can define a struct that represents this data as follows:

#![allow(unused)]
fn main() {
use serde::{Deserialize, Serialize};
#[derive(Debug, Serialize, Deserialize)]
pub struct AirPollution {
    pub coord: Coord,
    pub list: Vec<List>,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Coord {
    pub lon: f32,
    pub lat: f32,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct List {
    pub main: Main,
    pub components: Components,
    pub dt: usize,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Main {
    pub aqi: u8,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Components {
    pub co: f32,
    pub no: f32,
    pub no2: f32,
    pub o3: f32,
    pub so2: f32,
    pub pm2_5: f32,
    pub pm10: f32,
    pub nh3: f32,
}
}

As you can see, the struct mirrors the underlying JSON structure. The serde crate gives us a lot of flexibility here, in particular the section on Attributes and the Examples are worth spending some time on.

The reqwest crate also provides a json method that will automatically deserialize the response body into a struct.

#![allow(unused)]
fn main() {
pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution {
    let api_key = std::env::var("OWM_APPID").expect(
        "Environment Variable OWM_APPID not set. Please set it to your
    OpenWeatherMap API key. https://home.openweathermap.org/api_keys",
    );

    let url = format!(
        "http://api.openweathermap.org/data/2.5/air_pollution?lat={}&lon={}&appid={}",
        lat, lon, api_key
    );

    reqwest::blocking::get(url)
        .expect("request failed")
        .json()
        .expect("json failed")
}

Our function now returns an AirPollution struct, instead of a String, and reqwest's json method will automatically deserialize the response body to the correct type.

Rust uses type inference to reduce the amount of syntax required. While function parameters and signatures always require types, local variables can usually be inferred by the compiler.

Let's look at how returning a typed Struct changes how we interact with the data

#![allow(unused)]
fn main() {
pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) {
    let main = &body.list[0].main;
    let components = &body.list[0].components;
    (main, components)
}
}

We can access the underlying fields in the struct directly. Unlike a Python dictionary, the compiler will ensure that the fields we're accessing exist.

If we add a missing field, for example:

#![allow(unused)]
fn main() {
let foo = &body.list[0].foo;
}

And run cargo check we'll get the following error:


error[E0609]: no field `foo` on type `List`
  --> src/bin/ch4.rs:65:29
   |
65 |     let foo = &body.list[0].foo;
   |                             ^^^ unknown field
   |
   = note: available fields are: `main`, `components`, `dt`

For more information about this error, try `rustc --explain E0609`.
error: could not compile `wxrs` (bin "ch4") due to previous error

Compare to Python where we'd only get a run-time error if we tried to access a missing field, unless we opt-in to type hints using mypy.

Here's the full Rust code for reference


use serde::{Deserialize, Serialize};
#[derive(Debug, Serialize, Deserialize)]
pub struct AirPollution {
    pub coord: Coord,
    pub list: Vec<List>,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Coord {
    pub lon: f32,
    pub lat: f32,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct List {
    pub main: Main,
    pub components: Components,
    pub dt: usize,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Main {
    pub aqi: u8,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Components {
    pub co: f32,
    pub no: f32,
    pub no2: f32,
    pub o3: f32,
    pub so2: f32,
    pub pm2_5: f32,
    pub pm10: f32,
    pub nh3: f32,
}

pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution {
    let api_key = std::env::var("OWM_APPID").expect(
        "Environment Variable OWM_APPID not set. Please set it to your
    OpenWeatherMap API key. https://home.openweathermap.org/api_keys",
    );

    let url = format!(
        "http://api.openweathermap.org/data/2.5/air_pollution?lat={}&lon={}&appid={}",
        lat, lon, api_key
    );

    reqwest::blocking::get(url)
        .expect("request failed")
        .json()
        .expect("json failed")
}

pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) {
    let main = &body.list[0].main;
    let components = &body.list[0].components;
    (main, components)
}

pub fn main() {
    let usage = format!("Usage: {} [lat] [lon]", std::env::args().next().unwrap());

    let lat = std::env::args()
        .nth(1)
        .expect(&usage)
        .parse::<f32>()
        .expect(&usage);

    let lon = std::env::args()
        .nth(2)
        .expect(&usage)
        .parse::<f32>()
        .expect(&usage);

    let body = get_air_pollution(lat, lon);
    let (main, components) = parse_air_pollution(&body);

    println!("Air Quality Index: {}", main.aqi);

    println!("Carbon Monoxide: {} μg/m³", components.co);
    println!("Nitrogen Monoxide: {} μg/m³", components.no);
    println!("Nitrogen Dioxide: {} μg/m³", components.no2);
    println!("Ozone: {} μg/m³", components.o3);
    println!("Sulfur Dioxide: {} μg/m³", components.so2);
    println!("Particulate Matter < 2.5 μm: {} μg/m³", components.pm2_5);
    println!("Particulate Matter < 10 μm: {} μg/m³", components.pm10);
    println!("Ammonia: {} μg/m³", components.nh3);
}

Serialization Formats

Something worth mentioning about the Rust serde crate is that it does not come with any built-in serialization formats. Instead, it provides a framework for serialization. We installed serde_json but there are many other formats available, such as serde_yaml and serde_avro.

Why Bother?

You might be wondering why we'd go through the trouble of defining a struct and serializing the response body into a struct. In Python, we avoid the boilerplate, we access fields directly, we can throw a little type-hinting at our code, we get to use # type: ignore freely, and if our application crashes, well, we'll just fix it and run it again.

You are absolutely right! This is all true. However, any seasoned Python programmer is also aware of all the ways that poorly typed code can go wrong.

If you've ever created a compute-intensive application that operates on many gigabytes of data, you've probably run into a situation where you've had to re-run the application because it crashed. Type-safety helps prevent these types of issues, but types also provide another nice benefit: improved performance.

The compiler can optimize code based on the types it knows about. In Python, we can use type-hints to help the compiler, but ultimately the Python interpreter is still dynamically resolving types at runtime. In Rust, the compiler knows the types at compile-time and can optimize prior to running.

What's that little & doing?

Ah, yes, the &. Now we are getting into the heart of Rust. Let's look at the code for parsing air pollution again:

#![allow(unused)]
fn main() {
pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) {
    let main = &body.list[0].main;
    let components = &body.list[0].components;
    (main, components)
}
}

parse_air_pollution is a function that takes a reference to an AirPollution struct. The & is the syntax for creating a reference. In Rust, references are a way of passing a value to a function without transferring ownership of the value. This is a key concept in Rust, and it's what allows Rust to guarantee memory safety.

In Python, values are passed around using counters. Every time you use a variable, Python's Garbage Collector keeps track of how many times it's been used. Whenever a function that used a reference exists, the counter is decremented. A Garbage Collector occasionally runs and cleans up all unused references.

In Rust, there is no garbage collector. Instead, the compiler keeps track of the lifetime of every variable. When a variable goes out of scope, the compiler will automatically free the memory associated with the variable.

This means that you cannot use a variable after transferring ownership. For a deeper dive into the concept of ownership, read the Rust Book.

For example, if we tried print the value of body after assigning it, the compiler would give us an error:

#![allow(unused)]
fn main() {
fn parse_air(body: AirPollution) {
    let foo = body;
    println!("{:?}", body);
}
}
error[E0382]: borrow of moved value: `body`
  --> src/bin/ch4.rs:71:22
   |
69 | fn parse_air(body: AirPollution) {
   |              ---- move occurs because `body` has type `AirPollution`, which does not implement the `Copy` trait
70 |     let foo = body;
   |               ---- value moved here
71 |     println!("{:?}", body);
   |                      ^^^^ value borrowed here after move

It's beyond the scope of this post to explain all the details of ownership and references, but it's important to understand that Rust's compiler is keeping track of the lifetime of every variable, and will not allow you to use a variable after it's been moved.

Instead, we can use a reference to a variable. This keeps the underlying data in the same place in memory, but allows us to pass it to a function as a reference to the original value.

#![allow(unused)]
fn main() {
fn parse_air(body: &AirPollution) {
    let foo = body;
    println!("{:?}", body);
}
}

This has some really nice benefits when it comes to processing large amounts of data, as data engineers tend to do.

In Python, it's not always clear when data is being copied, moved, or referenced. In Rust, copying code is very explicit. If we didn't want to borrow a reference in the code above, we could also copy.

#![allow(unused)]
fn main() {
fn parse_air(body: AirPollution) {
    let foo = body.clone();
    println!("{:?}", body);
}
}

For the above code to work, we would also need to implement the Clone trait for the AirPollution struct and all of its fields:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Deserialize)]
pub struct AirPollution {
...

}

Understanding ownership, references, and borrowing can be an uphill battle for new Rust programmers who are used to dynamically-typed languages, but with time and patience, it will come to you too.

Performance

To benchmark our code, we're going to change our code to fetch an entire forecast rather than a single day, increasing the payload from 0.5kb to about 13kb.

In Python, we change the url and then iterate over every element in the list provided.

def get_air_pollution(lat, lon):
    url = f"http://api.openweathermap.org/data/2.5/air_pollution/forecast?lat={lat}&lon={lon}&appid={API_KEY}"
    body = requests.get(url).json()
    return body


def parse_air_pollution(body):
    res = []
    print(body)
    for row in body["list"]:
        res.append((row["main"]["aqi"], row["components"], row["dt"]))
    return res


def print_air_pollution(main, components, dt):
    print("---")
    print(f"Air pollution forecast for {dt}")
    print(f"Air quality index: {main}")
    print("Components:")
    for k, v in components.items():
        print(f"  {k}: {v}")

In Rust, we also change the url and use the common iter().map().collect() pattern.

#![allow(unused)]
fn main() {
pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution {
    let api_key = std::env::var("OWM_APPID").expect(
        "Environment Variable OWM_APPID not set. Please set it to your
    OpenWeatherMap API key. https://home.openweathermap.org/api_keys",
    );

    let url = format!(
        "http://api.openweathermap.org/data/2.5/air_pollution/forecast?lat={}&lon={}&appid={}",
        lat, lon, api_key
    );

    reqwest::blocking::get(url)
        .expect("request failed")
        .json()
        .expect("json failed")
}

pub fn parse_air_pollution(body: AirPollution) -> Vec<(Main, Components, usize)> {
    body.list
        .iter()
        .map(|x| (x.main, x.components, x.dt))
        .collect()
}
}

Here are the results of the benchmarks:

Again, we see a 1.7x improvement in performance, or about 58% faster.

Offline Benchmarks

Benchmarking against a network connection can be a bit iffy. It also makes it hard to test larger and larger payloads, so we'll create a large payload file and use that for an offline benchmark.

I've created a 9mb JSON file that mirrors the payload from the OpenWeather API, and created offline versions of the Rust and Python code to read from a local file. The code for both can be found in the sample repository under wxpy/wxpy/ch4/serialized_offline_benchmark.py and wxrs/src/bin/ch4_offline_benchmark.rs.

Here are the results of the offline benchmarks:

CommandMean [ms]Min [ms]Max [ms]Relative
../wxrs/target/release/ch4_offline_benchmark47.0 ± 0.646.550.61.00
python ../wxpy/wxpy/ch4/serialized_offline_benchmark.py 30 -140248.0 ± 22.4210.8278.25.27 ± 0.48

Rust is now running twice as fast as Python for these larger payloads.

Transforming Data using Polars

In this chapter, we'll look at how to transform data using Polars in both Python and Rust.

Polars is a "blazing fast DataFrame library" available in both Python and Rust. It is similar to pandas, although has fewer capabilities, however, it supports a wide-variety of common transformation tasks.

It is several times faster than pandas, and is a good choice for data transformation tasks.

The Polars documentation is a great resource for getting started, and the API docs have even more details on syntax.

Let's look at some key differences between the syntax in Python and Rust.

Python

import os

import polars as pl

script_path = os.path.dirname(os.path.realpath(__file__))
bird_path = os.path.join(script_path, "../../../lib/PFW_2016_2020_public.csv")
codes_path = os.path.join(script_path, "../../../lib/species_code.csv")

columns = [
    "LATITUDE",
    "LONGITUDE",
    "SUBNATIONAL1_CODE",
    "Month",
    "Day",
    "Year",
    "SPECIES_CODE",
    "HOW_MANY",
    "VALID",
]

birds = pl.read_csv(
    bird_path,
    columns=columns,
    new_columns=[s.lower() for s in columns],
)

codes = pl.read_csv(codes_path).select(
    [
        pl.col("SPECIES_CODE").alias("species_code"),
        pl.col("PRIMARY_COM_NAME").alias("species_name"),
    ]
)

birds_df = (
    birds.select(
        pl.col(
            [
                "latitude",
                "longitude",
                "subnational1_code",
                "month",
                "day",
                "year",
                "species_code",
                "how_many",
                "valid",
            ]
        )
    )
    .filter(pl.col("valid") == 1)
    .groupby(["subnational1_code", "species_code"])
    .agg(
        [
            pl.sum("how_many").alias("total_species"),
            pl.count("how_many").alias("total_sightings"),
        ]
    )
    .sort("total_species", descending=True)
    .join(codes, on="species_code", how="inner")
)

print(birds_df)

The Python code very concise, columns can be selected as a list of strings, the sort function takes a simple descending argument, and the general API is very straightforward.

I've also included an attempt at the same logic in pandas. While largely similar, there are a few differences, for example, in how we filter for valid results.

import os

import pandas as pd

script_path = os.path.dirname(os.path.realpath(__file__))
bird_path = os.path.join(script_path, "../../../lib/PFW_2016_2020_public.csv")
codes_path = os.path.join(script_path, "../../../lib/species_code.csv")

# adding usecols reducing memory usage and runtime from 13s to 7s
birds = pd.read_csv(
    bird_path,
    usecols=[
        "LATITUDE",
        "LONGITUDE",
        "SUBNATIONAL1_CODE",
        "Month",
        "Day",
        "Year",
        "SPECIES_CODE",
        "HOW_MANY",
        "VALID",
    ],
).rename(columns=lambda x: x.lower())

codes = pd.read_csv(codes_path)[["SPECIES_CODE", "PRIMARY_COM_NAME"]].rename(
    columns={"SPECIES_CODE": "species_code", "PRIMARY_COM_NAME": "species_name"}
)

birds = birds[
    [
        "latitude",
        "longitude",
        "subnational1_code",
        "month",
        "day",
        "year",
        "species_code",
        "how_many",
        "valid",
    ]
]

birds = birds[birds["valid"] == 1]
birds = (
    birds.groupby(["subnational1_code", "species_code"])
    .agg(total_species=("how_many", "sum"), total_sightings=("how_many", "count"))
    .reset_index()
    .sort_values("total_species", ascending=False)
)

birds = pd.merge(birds, codes, on="species_code", how="inner")


print(birds)

Now let's compare the above to Rust code.

Rust

use polars::prelude::*;
use std::env;

fn main() {
    let current_dir = env::current_dir().expect("Failed to get current directory");
    let bird_path = current_dir.join("../lib/PFW_2016_2020_public.csv");
    let codes_path = current_dir.join("../lib/species_code.csv");

    let cols = vec![
        "LATITUDE".into(),
        "LONGITUDE".into(),
        "SUBNATIONAL1_CODE".into(),
        "Month".into(),
        "Day".into(),
        "Year".into(),
        "SPECIES_CODE".into(),
        "HOW_MANY".into(),
        "VALID".into(),
    ];

    let birds_df = CsvReader::from_path(bird_path)
        .expect("Failed to read CSV file")
        .has_header(true)
        .with_columns(Some(cols.clone()))
        .finish()
        .unwrap()
        .lazy();

    let mut codes_df = CsvReader::from_path(codes_path)
        .expect("Failed to read CSV file")
        .infer_schema(None)
        .has_header(true)
        .finish()
        .unwrap();

    codes_df = codes_df
        .clone()
        .lazy()
        .select([
            col("SPECIES_CODE").alias("species_code"),
            col("PRIMARY_COM_NAME").alias("species_name"),
        ])
        .collect()
        .unwrap();

    let birds_df = birds_df
        .rename(cols.clone(), cols.into_iter().map(|x| x.to_lowercase()))
        .filter(col("valid").eq(lit(1)))
        .groupby(["subnational1_code", "species_code"])
        .agg(&[
            col("how_many").sum().alias("total_species"),
            col("how_many").count().alias("total_sightings"),
        ])
        .sort(
            "total_species",
            SortOptions {
                descending: true,
                nulls_last: false,
                multithreaded: true,
            },
        )
        .collect()
        .unwrap();

    let joined = birds_df
        .join(
            &codes_df,
            ["species_code"],
            ["species_code"],
            JoinType::Inner,
            None,
        )
        .unwrap();

    println!("{}", joined);
}

In Rust, the code 75% longer and the syntax is more verbose. There are a lot of unwrap calls to handle errors, although some of these could be replaced with ? in a real application.

The sort function takes a SortOptions struct, which is a bit more verbose. Overall, the API is very similar.

Benchmarks

Let's look at some benchmarks for polars in both Python and Rust, as well as similar code in Pandas.

CommandMean [ms]Min [ms]Max [ms]Relative
../wxrs/target/release/ch5473.9 ± 5.8461.3480.71.00
python ../wxpy/wxpy/ch5/ch5.py764.8 ± 26.8732.9815.21.61 ± 0.06
python ../wxpy/wxpy/ch5/ch5_pandas.py5644.0 ± 39.25584.95710.011.91 ± 0.17

The Rust version is the fastest again. The Python-polars code is 1.6x slower than the Rust code, but the Pandas code is exceptionally slow, taking over 5 seconds to complete while both Polars versions take less than 1 second.

Concurrent Programming

One of Rust's major goals as a language is to enable fearless concurrency. So much so that an entire chapter of the Rust Book is devoted to it.

In Python, concurrency is possible however we are impacted by the GIL.

What's really fascinating (to me, anyways) is how decisions about how memory is managed in both languages has a direct impact on how concurrency is handled.

Before we dig into concurrency, let's take a step back and talk about memory.

Memory

Every programming language stores objects in memory. Whether it's variables, functions, or other data, we store these in memory to allow fast access to them when we need them.

How languages manage memory defines the flavor and performance characteristics of the language.

The GIL and Python's Memory Management

In Python, the infamous Global Interpreter Lock (or GIL) exists because objects in Python are reference counted. This means that every object has a counter associated with it that is incremented as it is referenced and decremented as it is removed from scope. When an object has 0 references, it is cleared from memory, freeing up space.

To prevent two different threads from accessing or releasing the same reference to an object, the GIL is used to prevent multiple threads from accessing the same object. This has the effect of serializing access to objects in memory and effectively making CPU-bound Python code single-threaded.

To work around these limitations, CPU-bound Python code has to rely on threads, which has its own set of limitations and overhead costs.

About the Author

This Rust for Data book was created by me, Pedram Navid.

You can find me on Twitter @pdrmnvd

and on LinkedIn @pedramnavid

and on GitHub @pedramnavid

and on Substack @databased.

and on my website pedramnavid.com.