Chapter 1 - Introduction
This book is available for free online at rustfordata.com
You can find the source code for book in ./rust4data-book
This book is very early, so expect code to change quite a bit.
Feel free to open an issue with any bugs, mistakes, or requests you may have.
What is this book about?
This series of posts is about using Rust for data engineering tasks for people who are already familiar with Python and are curious about Rust. It will not cover every aspect of Rust, or of Python. Instead, it aims to give practical examples of how common engineering tasks done in Python might be done in Rust, along with representative benchmarks.
We will cover a variety of topics, including:
- Getting data from an API
- Scraping a website
- Parsing data and using structs
- Data Transformation
- Writing Data
- Other topics as I feel like it
This book is not an introduction to either Rust or Python. There are many great resources to both out there. If you are not familiar with Python, the official Python Tutorial is a great starting point.
Should I use Rust for Data Engineering?
Probably not. Rust is a great language, it is fun, it is pleasant to use, and it is fast. But choosing a language for a project is more than choosing a language that is fun. There are cautionary tales about using Rust at a startup, and I think they are worth reading.
There are many reasons why you might not want to use Rust for data engineering. The first is that Rust is not as mature as Python. There are many libraries that are missing. For example, as of this writing, there are no Snowflake libraries for querying data in that warehouse. Most people do not know Rust, and it is harder to hire and harder to train people.
There may be good reasons to use Rust for data engineering however. When it comes to cost and performance, Rust is clearly faster than Python for many types of tasks. Memory usage is also much lower, which can be important when you are constrained by small IoT devices, for example.
I can't tell you when to use Rust and when to use Python, but I do believe that by understanding both languages, their merits and pitfalls, you will be better positioned to make that decision for yourself.
Why Should I Learn Rust?
Because it is fun to learn new things. I can't promise you that anything you learn here will ever have a material impact on your life or career. But if you enjoy learning and tinkering, then you might want to tinker with this. If you are like me, and you like learning for learning's sake, then you will enjoy this experience too. I learned vim and lua not because it was useful, but because I was curious about it. I did end up benfiting from it, but I never approached it from a purely utilitarian perspective. There are better ways to spend your time if your goal is purely career advancement.
But, if you are curious about Rust, and if you like to have fun, then I think you will be pleasantly surprised by what Rust has to offer.
Prerequisites
Installing Rust and Python
You will need Rust and Python installed to follow along with the examples here. There are many ways to setup an environment, especially in Python, I've listed the one I like here.
For Rust, you can install Rust by going to rustup.rs.
For Python, I recommend using pyenv along with pyenv-virtualenv. With both installed, here is my typical workflow:
# Create a folder
mkdir -p ~/projects/somepyproj
cd ~/projects/somepyproj
# Install a recent version of Python
pyenv install 3.10
# Create a virtualenv for the project named proj310
pyenv virtualenv 3.10 proj310
pyenv local proj310
# Confirm pyenv is working
pyenv version
> proj310 (set by /Users/username/projects/somepyproj/.python-version)```
Installing the Code
The code for this book can be found here: https://github.com/PedramNavid/rust-for-data
git clone git@github.com:PedramNavid/rust-for-data.git
Fetching from an API
One of the simplest examples to start with is fetching data from an API endpoint. This is often the beginning of many data pipeline journeys.
In our first case, we will use the OpenWeatherMap API to fetch the current weather in a configurable location by providing a latitude and longitude on the command line.
You will need to sign up for a free account to get an API key, once you've signed up, create an API key.
Starting a Project
One of the first differences between Rust and Python you will experience is through initializing a project.
Rust
In Rust, this is as simple as running
# Create the project
cargo init wxrs
# Add a dependency
cd wxrs
cargo add reqwest --features blocking
This will create a new directory called wxrs
with a Hello World example.
It will also add the reqwest
crate to our dependencies, similar to pip install
.
Unlike pip install
though, this will also update Cargo.toml
with our dependency,
and create a Cargo.lock
file that will lock the reqwest
crate to a specific version.
The --features
flag is used to express optional compilation features. Reqwest
has several options, described in the crate's documentation.
We will use the blocking
feature, which will allow us to use the blocking API.
It gives us a simpler interface to reqwest instead of futures that require
an async runtime. We will eventually use async to show the power of Rust's
fearless concurrency.
In Python, we have to manually maintain dependencies and create lockfiles through
updating setup.py
or pyproject.toml
, or using a tool like pipenv
or poetry
.
We are not using any specific features, but Python too allows optional features,
for example pip install snowflake-connector-python[pandas]
.
# Cargo.toml
[package]
name = "wxrs"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
chrono = "0.4.26"
polars = { version = "0.30.0", features = ["lazy"] }
reqwest = { version = "0.11.18", features = ["blocking", "json"] }
serde = { version = "1.0.164", features = ["derive"] }
serde_json = "1.0.97"
tempfile = "3.6.0"
Python
In Python, we create the directory manually, create a virtual environment, hand-write a pyproject.toml file, name our dependencies, and then install the Python package locally.
# Python venv stuff
pyenv virtualenv wxpy
pyenv shell wxpy
pip install --upgrade pip build
# Create the project directory and pyproject.toml
mkdir -p wxpy/wxpy
cd wxpy
vim pyproject.toml
# pyproject.toml
[project]
name = "wxpy"
version = "0.0.1"
dependencies = [
"requests",
'importlib-metadata; python_version<"3.8"',
"polars",
"pandas"
]
pip install -e .
Now, admittedly we can skip all of the above steps, create a random file
anywhere we want and run it with python myfile.py
, but the goal here is to
build a more stable distribution that can be packaged, shared, and tested.
Fetching the Weather
Now that we have a project, let's fetch the weather. To fetch from an API,
we will use the requests
package in Python and the reqwest
package in Rust.
We will read the API key from the environment variable, and get the latitude and longitude from the command line arguments.
Given that I'm in California, it only makes sense to start with the Air Pollution API.
Python
In Python, we'll create a folder for this chapter to keep code organized, and then run that file directly.
mkdir wxpy/wxpy/ch3
# wxpy/wxpy/ch3/fetch_api.py
import os
import sys
import requests
API_KEY = os.getenv("OWM_APPID")
def get_air_pollution(lat, lon):
url = f"http://api.openweathermap.org/data/2.5/air_pollution?lat={lat}&lon={lon}&appid={API_KEY}"
body = requests.get(url).text
return body
if __name__ == "__main__":
usage = f"Usage: python {__file__} <lat> <lon>"
if not API_KEY:
print("Please set OWM_APPID environment variable")
sys.exit(1)
if len(sys.argv) != 3:
print(usage)
sys.exit(1)
lat = sys.argv[1]
lon = sys.argv[2]
body = get_air_pollution(lat, lon)
print(body)
Rust
In Rust, usually we'll have a main.rs
file that runs our code, with
additional code imported as modules from other files. There's a great
convention for package layouts in Rust.
But since we want to execute our code directly, and we'll have multiple
binaries, we'll create a bin
folder and save the code for ch3
there.
mkdir wxrs/src/bin
// wxrs/src/bin/ch3.rs pub fn get_air_pollution(lat: f32, lon: f32) -> String { let api_key = std::env::var("OWM_APPID").expect( "Environment Variable OWM_APPID not set. Please set it to your OpenWeatherMap API key. https://home.openweathermap.org/api_keys", ); let url = format!( "http://api.openweathermap.org/data/2.5/air_pollution?lat={}&lon={}&appid={}", lat, lon, api_key ); reqwest::blocking::get(url) .expect("request failed") .text() .expect("body failed") } pub fn main() { let usage = format!("Usage: {} [lat] [lon]", std::env::args().next().unwrap()); let lat = std::env::args() .nth(1) .expect(&usage) .parse::<f32>() .expect(&usage); let lon = std::env::args() .nth(2) .expect(&usage) .parse::<f32>() .expect(&usage); let body = get_air_pollution(lat, lon); println!("{}", body); }
Running the Program
Running the program is simple in both languages. We'll provide the latitude and longitude of beautiful Fairfax, CA, birthplace of mountain biking, and nestled in the foothills of Mount Tamalpais.
Google gives the coordinates as 37.9871
and -122.5889
Python
In Python, we can use -m
to run the module directly.
# in wxpy/wxpy
# export OPENWEATHERMAP_API_KEY=your-api-key
python -m wxpy.ch3.fetch_api 37.9871 -122.5889
> {"coord":{"lon":-122.5889,"lat":37.9871},"list":[{"main":{"aqi":2},"components":{"co":178.58,"no":0.1,"no2":0.47,"o3":70.81,"so2":0.64,"pm2_5":2.58,"pm10":4.18,"nh3":0},"dt":1687221287}]}
Rust
In Rust, we must first compile the program before running it. If we run
cargo build
Rust will create a binary for us in ./target/debug/wxrs
We can also compile and run it with one command cargo run
When using cargo build
Rust will a debug version of our application in
./target/debug
for both the main.rs
file which will be named wxrs
as
well for any files located in src/bin
, such as ch3.rs
# in wxrs/
cargo build
./target/debug/ch3 37.9871 -122.5889
> {"coord":{"lon":-122.5889,"lat":37.9871},"list":[{"main":{"aqi":2},"components":{"co":178.58,"no":0.1,"no2":0.47,"o3":70.81,"so2":0.64,"pm2_5":2.58,"pm10":4.18,"nh3":0},"dt":1687221453}]}
# or
cargo run --bin ch3 37.9871 -122.5889
> {"coord":{"lon":-122.5889,"lat":37.9871},"list":[{"main":{"aqi":2},"components":{"co":178.58,"no":0.1,"no2":0.47,"o3":70.81,"so2":0.64,"pm2_5":2.58,"pm10":4.18,"nh3":0},"dt":1687221453}]}
Discussion
Looking at both programs, we can see a fairly similar approach to solving this problem.
Both programs use an external library or crate (not-so-coincidentally named request/reqwest).
In both programs, we've created a function that takes a latitude and longitude, fetches the results from an API and returns the results as text. We'll cover handling structured data from JSON soon.
Types
One obvious difference is that in Rust, we declare the types of the lat and lon arguments, and in Python we do not. The trouble with talking about types is that it inevitably leads to a discussion of memory, which can devolve into a conversation around null pointer references, which we will largely avoid until the next chapter, but here's a light introduction.
In the Rust code, we've very explicitly defined the types for our function:
#![allow(unused)] fn main() { pub fn get_air_pollution(lat: f32, lon: f32) -> String { }
Both lat
and lon
are f32
or 32-bit floats. These are floating-point numbers
that take exactly 32-bits of memory. The compiler knows exactly how much space
to reserve for these values: 32-bits, or 4-bytes.
Given that lat
and lon
doesn't require much precision beyond a few
decimals, f32
seems like the best choice for our code. We could even opt for
greater precision by using a 64-bit float or f64
in Rust which would take 8
bytes of memory.
Because we know exactly how much memory we need for these variables, Rust is able to store these values on the heap.
In Python, we don't know how much memory lat and lon need until runtime because Python will accept anything in this function.
def get_air_pollution(lat, lon):
We could pass it a string, numbers, another function, or even None
.
>>> def join_two(a, b):
... return f"a+b={a}+{b}"
...
>>> join_two(1,2)
'a+b=1+2'
>>> join_two(None, None)
'a+b=None+None'
>>> join_two(join_two, join_two)
'a+b=<function join_two at 0x7f8f7f7de980>+<function join_two at 0x7f8f7f7de980>'
>>> join_two(join_two, join_two(join_two, join_two))
'a+b=<function join_two at 0x7f8f7f7de980>+a+b=<function join_two at 0x7f8f7f7de980>+<function join_
two at 0x7f8f7f7de980>'
Even the url
line will not fail, because in Python duck-typing allows
us great flexibility in what we do with variables. We can pass numbers into an
f-string for concatenation just as easily as we can pass characters.
Python will allocate these values on the heap, and it turns out that Python allocates about 24 bytes for each float there. The actual values are stored in a private heap.
Now, the difference between 24 bytes and 8 bytes is trivial for an application such as this, and even on the most memory-constrained devices it's not worth noting. But it's important to know that heap allocation is slower, and even small applications may iterate over millions of values. Small differences can add up.
You might then ask yourself: what about mypy? Doesn't that give us typing? Mypy is a static type checker, but it doesn't change the underlying compilation of Python code. It can provide hints as to what you expect the types to be, but it doesn't change how memory is allocated.
Handling Errors
Another subtle but important difference is the handling of errors. In Python, errors are handled as exceptions that are caught. Knowing when to catch an exception is mostly an art. It's difficult to know which functions throw exceptions, what exceptions to expect, and when to deal with them.
In Rust, errors are handled as values that are returned. This is a much more
explicit approach, and it's easier to know what errors to expect and how to
handle them. In fact, if a function returns a Result
type, the compiler will
force you to handle the error. This is a huge benefit to Rust, and it's one of
the reasons why Rust is so reliable.
Let's take a closer look at an exception we haven't caught yet. If we run the
Python program with invalid arguments, we get a ValueError
exception.
python -m wxpy.ch3.fetch_api nice birds
> ❯ python -m wxpy.ch3.fetch_api nice birds
{"cod":"400","message":"wrong latitude"}
./target/debug/ch3 nice birds
> thread 'main' panicked at 'Usage: ./target/debug/ch3 [lat] [lon]: ParseFloatError { kind: Invalid }', src/main.rs:24:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
What happened here?
In Python, we didn't check that our input is valid, and so the application send incorrect values to the API, which returned an error message. Fortunately, we get a million API requests for free a month, so this one doesn't cost us much.
In Rust, the application panicked because it could not parse the inputs we provided
as a float. On line 23:24 we call parse
on the arguments, and we expect them
to be floats.
#![allow(unused)] fn main() { let lat = std::env::args() .nth(1) .expect(&usage) .parse::<f32>() .expect(&usage); }
the expect
method tells Rust that if parse
failed to convert the input,
then the application must panic. We output the usage
message and exit.
You will see expect
and its cousin unwrap
used frequently in Rust. They are
useful for debugging, but they are not the best way to handle errors. We'll
cover error handling in more detail soon.
Benchmarks
Let me preface this by saying speed isn't everything. No doubt someone familiar in Python will spend far more time learning Rust than they might ever save by running a slightly more optimized program. But it is nice to get a sense of
Let's use hyperfine
to benchmark the two programs. We'll run each program
10 times and take the average. Before we benchmark the Rust application,
we'll compile it using --release
which builds a release rather than a
debug version and it should provide us with a faster application.
cargo build --release
# in wxpy
hyperfine --warmup 3 --min-runs 10 \
'python -m wxpy.ch3.fetch_api 37.9871 -122.5889' \
'./target/release/ch3 37.9871 -122.5889' \
--export-markdown ../benchmarks/ch3_fetch_api.md
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
../wxrs/target/release/ch3 30 -140 | 177.0 ± 6.5 | 166.2 | 189.9 | 1.00 |
python ../wxpy/wxpy/ch3/fetch_api.py 30 -140 | 320.2 ± 46.8 | 261.2 | 379.7 | 1.81 ± 0.27 |
On my system, the Python application took an average of 335ms to complete, while the Rust application was 1.7x faster at 198ms. Memory consumption was also lower in Rust, with the Python application using 26MB vs only 10MB in Rust.
Again, this is a trivial application with trivial requirements and performance is not a key factor in deciding what language to build. But as we build more intensive applications we'll keep an eye on memory and performance to see how the gap changes.
Summary
In this chapter we've built a simple application that fetches data from an API and returns the results. We've seen how Rust and Python differ in their approach to handling errors and types, and we've seen how Rust can be faster and more memory efficient than Python.
Serializing Data
In the last chapter we fetched data from the OpenWeather API in order to get Air Pollution data. The astute observe will have noticed that we parsed the response as pure text, although the response was in JSON format.
The goal of this chapter is to walk through how we would take raw data and serialize it into a structured data format, such as JSON.
We'll dive into theory in a little but let's start with practice.
Serialization
Serialization is the process of taking data and encoding it into a known format that can later be retrieved. There are many ways to encode data, but largely these are broken into human-readable and binary formats.
CSVs, JSON, XML, and YAML are all human-readable serialization formats. Conversely, many binary formats exist, such as Parquet, Avro, and Protcol Buffers. Binary formats trade reduced readability for improved performance and size.
In the end, any data that needs to be persisted outside of a computer's memory requires some type of serialization.
Let's look at how serialization varies across both Rust and Python.
Python
In Python, we can serialize nearly any arbitrary data structure to JSON
using the json
module.
In [1]: import json
In [2]: my_obj = [{'a': 1, 'b': None}, "foo", "bar", ("baz", "baz")]
In [3]: json.dumps(my_obj)
Out[3]: '[{"a": 1, "b": null}, "foo", "bar", ["baz", "baz"]]'
Here's the updated project code that serializes the response from the OpenWeather API.
import os
import sys
import requests
API_KEY = os.getenv("OWM_APPID")
def get_air_pollution(lat, lon):
url = f"http://api.openweathermap.org/data/2.5/air_pollution?lat={lat}&lon={lon}&appid={API_KEY}"
body = requests.get(url).json()
return body
def parse_air_pollution(body):
aqi = body["list"][0]["main"]["aqi"]
components = body["list"][0]["components"]
return (aqi, components)
if __name__ == "__main__":
usage = f"Usage: python {__file__} <lat> <lon>"
if not API_KEY:
print("Please set OWM_APPID environment variable")
sys.exit(1)
if len(sys.argv) != 3:
print(usage)
sys.exit(1)
lat = sys.argv[1]
lon = sys.argv[2]
body = get_air_pollution(lat, lon)
aqi, components = parse_air_pollution(body)
print(f"Air Quality Index: {aqi}")
print("Components:")
for k, v in components.items():
print(f" {k}: {v}")
There are a few key things to note here.
first, we're assuming the request was successful and that there is a json response body, and that it can parse correctly. if any of these assumptions are incorrect an exception will be raised, and we have no obvious way of knowing what these exceptions are or which method might raise one.
def parse_air_pollution(body):
aqi = body["list"][0]["main"]["aqi"]
components = body["list"][0]["components"]
return (aqi, components)
When parsing the response, we slice into the response body to get
various components. We're explicitly fetching keys from a dictionary under
the assumption that the payload is properly formed. There are safer dictionary
methods to use, such as .get()
which will return None
if the key is missing
rather than an exception, but in our case an Exception is warranted since we
can't do anything with the data if it's missing.
We also haven't explicitly typed the response from the API. This is something
we can do with mypy
or other tools like pydantic
, but the Python interpret
itself has no type-guarantees.
Let's look at how we might do this in Rust.
Rust
In Rust, we'll need to install the serde
crate as well as the json
feature
for reqwest
.
cargo add serde --features derive
cargo add serde_json
cargo add reqwest --features json
Because Rust is a typed language, we will define the struct that represents the data we expect. The API response looks like the following:
{
"coord": {
"lon": -122.5889,
"lat": 37.9871
},
"list": [
{
"main": {
"aqi": 2
},
"components": {
"co": 168.56,
"no": 0.14,
"no2": 0.75,
"o3": 80.11,
"so2": 0.7,
"pm2_5": 3.48,
"pm10": 5.58,
"nh3": 0
},
"dt": 1687308878
}
]
}
We can define a struct that represents this data as follows:
#![allow(unused)] fn main() { use serde::{Deserialize, Serialize}; #[derive(Debug, Serialize, Deserialize)] pub struct AirPollution { pub coord: Coord, pub list: Vec<List>, } #[derive(Debug, Serialize, Deserialize)] pub struct Coord { pub lon: f32, pub lat: f32, } #[derive(Debug, Serialize, Deserialize)] pub struct List { pub main: Main, pub components: Components, pub dt: usize, } #[derive(Debug, Serialize, Deserialize)] pub struct Main { pub aqi: u8, } #[derive(Debug, Serialize, Deserialize)] pub struct Components { pub co: f32, pub no: f32, pub no2: f32, pub o3: f32, pub so2: f32, pub pm2_5: f32, pub pm10: f32, pub nh3: f32, } }
As you can see, the struct mirrors the underlying JSON structure. The serde
crate gives us a lot of flexibility here, in particular the section on
Attributes and the
Examples are worth spending some time on.
The reqwest
crate also provides a json
method that will automatically
deserialize the response body into a struct.
#![allow(unused)] fn main() { pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution { let api_key = std::env::var("OWM_APPID").expect( "Environment Variable OWM_APPID not set. Please set it to your OpenWeatherMap API key. https://home.openweathermap.org/api_keys", ); let url = format!( "http://api.openweathermap.org/data/2.5/air_pollution?lat={}&lon={}&appid={}", lat, lon, api_key ); reqwest::blocking::get(url) .expect("request failed") .json() .expect("json failed") }
Our function now returns an AirPollution
struct, instead of a String, and
reqwest
's json
method will automatically deserialize the response body
to the correct type.
Rust uses type inference to reduce the amount of syntax required. While function parameters and signatures always require types, local variables can usually be inferred by the compiler.
Let's look at how returning a typed Struct changes how we interact with the data
#![allow(unused)] fn main() { pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) { let main = &body.list[0].main; let components = &body.list[0].components; (main, components) } }
We can access the underlying fields in the struct directly. Unlike a Python dictionary, the compiler will ensure that the fields we're accessing exist.
If we add a missing field, for example:
#![allow(unused)] fn main() { let foo = &body.list[0].foo; }
And run cargo check
we'll get the following error:
error[E0609]: no field `foo` on type `List`
--> src/bin/ch4.rs:65:29
|
65 | let foo = &body.list[0].foo;
| ^^^ unknown field
|
= note: available fields are: `main`, `components`, `dt`
For more information about this error, try `rustc --explain E0609`.
error: could not compile `wxrs` (bin "ch4") due to previous error
Compare to Python where we'd only get a run-time error if we tried to
access a missing field, unless we opt-in to type hints using mypy
.
Here's the full Rust code for reference
use serde::{Deserialize, Serialize}; #[derive(Debug, Serialize, Deserialize)] pub struct AirPollution { pub coord: Coord, pub list: Vec<List>, } #[derive(Debug, Serialize, Deserialize)] pub struct Coord { pub lon: f32, pub lat: f32, } #[derive(Debug, Serialize, Deserialize)] pub struct List { pub main: Main, pub components: Components, pub dt: usize, } #[derive(Debug, Serialize, Deserialize)] pub struct Main { pub aqi: u8, } #[derive(Debug, Serialize, Deserialize)] pub struct Components { pub co: f32, pub no: f32, pub no2: f32, pub o3: f32, pub so2: f32, pub pm2_5: f32, pub pm10: f32, pub nh3: f32, } pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution { let api_key = std::env::var("OWM_APPID").expect( "Environment Variable OWM_APPID not set. Please set it to your OpenWeatherMap API key. https://home.openweathermap.org/api_keys", ); let url = format!( "http://api.openweathermap.org/data/2.5/air_pollution?lat={}&lon={}&appid={}", lat, lon, api_key ); reqwest::blocking::get(url) .expect("request failed") .json() .expect("json failed") } pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) { let main = &body.list[0].main; let components = &body.list[0].components; (main, components) } pub fn main() { let usage = format!("Usage: {} [lat] [lon]", std::env::args().next().unwrap()); let lat = std::env::args() .nth(1) .expect(&usage) .parse::<f32>() .expect(&usage); let lon = std::env::args() .nth(2) .expect(&usage) .parse::<f32>() .expect(&usage); let body = get_air_pollution(lat, lon); let (main, components) = parse_air_pollution(&body); println!("Air Quality Index: {}", main.aqi); println!("Carbon Monoxide: {} μg/m³", components.co); println!("Nitrogen Monoxide: {} μg/m³", components.no); println!("Nitrogen Dioxide: {} μg/m³", components.no2); println!("Ozone: {} μg/m³", components.o3); println!("Sulfur Dioxide: {} μg/m³", components.so2); println!("Particulate Matter < 2.5 μm: {} μg/m³", components.pm2_5); println!("Particulate Matter < 10 μm: {} μg/m³", components.pm10); println!("Ammonia: {} μg/m³", components.nh3); }
Serialization Formats
Something worth mentioning about the Rust serde
crate is that it does not
come with any built-in serialization formats. Instead, it provides a framework
for serialization. We installed serde_json
but there are many other formats
available, such as serde_yaml
and serde_avro
.
Why Bother?
You might be wondering why we'd go through the trouble of defining a struct
and serializing the response body into a struct. In Python, we avoid the
boilerplate, we access fields directly, we can throw a little type-hinting at
our code, we get to use # type: ignore
freely, and if our application crashes,
well, we'll just fix it and run it again.
You are absolutely right! This is all true. However, any seasoned Python programmer is also aware of all the ways that poorly typed code can go wrong.
If you've ever created a compute-intensive application that operates on many gigabytes of data, you've probably run into a situation where you've had to re-run the application because it crashed. Type-safety helps prevent these types of issues, but types also provide another nice benefit: improved performance.
The compiler can optimize code based on the types it knows about. In Python, we can use type-hints to help the compiler, but ultimately the Python interpreter is still dynamically resolving types at runtime. In Rust, the compiler knows the types at compile-time and can optimize prior to running.
What's that little & doing?
Ah, yes, the &
. Now we are getting into the heart of Rust. Let's look
at the code for parsing air pollution again:
#![allow(unused)] fn main() { pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) { let main = &body.list[0].main; let components = &body.list[0].components; (main, components) } }
parse_air_pollution
is a function that takes a reference to an AirPollution
struct. The &
is the syntax for creating a reference. In Rust, references
are a way of passing a value to a function without transferring ownership of
the value. This is a key concept in Rust, and it's what allows Rust to
guarantee memory safety.
In Python, values are passed around using counters. Every time you use a variable, Python's Garbage Collector keeps track of how many times it's been used. Whenever a function that used a reference exists, the counter is decremented. A Garbage Collector occasionally runs and cleans up all unused references.
In Rust, there is no garbage collector. Instead, the compiler keeps track of the lifetime of every variable. When a variable goes out of scope, the compiler will automatically free the memory associated with the variable.
This means that you cannot use a variable after transferring ownership. For a deeper dive into the concept of ownership, read the Rust Book.
For example, if we tried print the value of body after assigning it, the compiler would give us an error:
#![allow(unused)] fn main() { fn parse_air(body: AirPollution) { let foo = body; println!("{:?}", body); } }
error[E0382]: borrow of moved value: `body`
--> src/bin/ch4.rs:71:22
|
69 | fn parse_air(body: AirPollution) {
| ---- move occurs because `body` has type `AirPollution`, which does not implement the `Copy` trait
70 | let foo = body;
| ---- value moved here
71 | println!("{:?}", body);
| ^^^^ value borrowed here after move
It's beyond the scope of this post to explain all the details of ownership and references, but it's important to understand that Rust's compiler is keeping track of the lifetime of every variable, and will not allow you to use a variable after it's been moved.
Instead, we can use a reference to a variable. This keeps the underlying data in the same place in memory, but allows us to pass it to a function as a reference to the original value.
#![allow(unused)] fn main() { fn parse_air(body: &AirPollution) { let foo = body; println!("{:?}", body); } }
This has some really nice benefits when it comes to processing large amounts of data, as data engineers tend to do.
In Python, it's not always clear when data is being copied, moved, or referenced. In Rust, copying code is very explicit. If we didn't want to borrow a reference in the code above, we could also copy.
#![allow(unused)] fn main() { fn parse_air(body: AirPollution) { let foo = body.clone(); println!("{:?}", body); } }
For the above code to work, we would also need to implement the Clone
trait
for the AirPollution
struct and all of its fields:
#![allow(unused)] fn main() { #[derive(Debug, Clone, Deserialize)] pub struct AirPollution { ... }
Understanding ownership, references, and borrowing can be an uphill battle for new Rust programmers who are used to dynamically-typed languages, but with time and patience, it will come to you too.
Performance
To benchmark our code, we're going to change our code to fetch an entire forecast rather than a single day, increasing the payload from 0.5kb to about 13kb.
In Python, we change the url and then iterate over every element in the list provided.
def get_air_pollution(lat, lon):
url = f"http://api.openweathermap.org/data/2.5/air_pollution/forecast?lat={lat}&lon={lon}&appid={API_KEY}"
body = requests.get(url).json()
return body
def parse_air_pollution(body):
res = []
print(body)
for row in body["list"]:
res.append((row["main"]["aqi"], row["components"], row["dt"]))
return res
def print_air_pollution(main, components, dt):
print("---")
print(f"Air pollution forecast for {dt}")
print(f"Air quality index: {main}")
print("Components:")
for k, v in components.items():
print(f" {k}: {v}")
In Rust, we also change the url and use the common iter().map().collect()
pattern.
#![allow(unused)] fn main() { pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution { let api_key = std::env::var("OWM_APPID").expect( "Environment Variable OWM_APPID not set. Please set it to your OpenWeatherMap API key. https://home.openweathermap.org/api_keys", ); let url = format!( "http://api.openweathermap.org/data/2.5/air_pollution/forecast?lat={}&lon={}&appid={}", lat, lon, api_key ); reqwest::blocking::get(url) .expect("request failed") .json() .expect("json failed") } pub fn parse_air_pollution(body: AirPollution) -> Vec<(Main, Components, usize)> { body.list .iter() .map(|x| (x.main, x.components, x.dt)) .collect() } }
Here are the results of the benchmarks:
Again, we see a 1.7x improvement in performance, or about 58% faster.
Offline Benchmarks
Benchmarking against a network connection can be a bit iffy. It also makes it hard to test larger and larger payloads, so we'll create a large payload file and use that for an offline benchmark.
I've created a 9mb JSON file that mirrors the payload from the OpenWeather
API, and created offline versions of the Rust and Python code to read from
a local file. The code for both can be found in the sample repository under
wxpy/wxpy/ch4/serialized_offline_benchmark.py
and wxrs/src/bin/ch4_offline_benchmark.rs
.
Here are the results of the offline benchmarks:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
../wxrs/target/release/ch4_offline_benchmark | 47.0 ± 0.6 | 46.5 | 50.6 | 1.00 |
python ../wxpy/wxpy/ch4/serialized_offline_benchmark.py 30 -140 | 248.0 ± 22.4 | 210.8 | 278.2 | 5.27 ± 0.48 |
Rust is now running twice as fast as Python for these larger payloads.
Transforming Data using Polars
In this chapter, we'll look at how to transform data using Polars in both Python and Rust.
Polars is a "blazing fast DataFrame library" available in both Python and Rust. It is similar to pandas, although has fewer capabilities, however, it supports a wide-variety of common transformation tasks.
It is several times faster than pandas, and is a good choice for data transformation tasks.
The Polars documentation is a great resource for getting started, and the API docs have even more details on syntax.
Let's look at some key differences between the syntax in Python and Rust.
Python
import os
import polars as pl
script_path = os.path.dirname(os.path.realpath(__file__))
bird_path = os.path.join(script_path, "../../../lib/PFW_2016_2020_public.csv")
codes_path = os.path.join(script_path, "../../../lib/species_code.csv")
columns = [
"LATITUDE",
"LONGITUDE",
"SUBNATIONAL1_CODE",
"Month",
"Day",
"Year",
"SPECIES_CODE",
"HOW_MANY",
"VALID",
]
birds = pl.read_csv(
bird_path,
columns=columns,
new_columns=[s.lower() for s in columns],
)
codes = pl.read_csv(codes_path).select(
[
pl.col("SPECIES_CODE").alias("species_code"),
pl.col("PRIMARY_COM_NAME").alias("species_name"),
]
)
birds_df = (
birds.select(
pl.col(
[
"latitude",
"longitude",
"subnational1_code",
"month",
"day",
"year",
"species_code",
"how_many",
"valid",
]
)
)
.filter(pl.col("valid") == 1)
.groupby(["subnational1_code", "species_code"])
.agg(
[
pl.sum("how_many").alias("total_species"),
pl.count("how_many").alias("total_sightings"),
]
)
.sort("total_species", descending=True)
.join(codes, on="species_code", how="inner")
)
print(birds_df)
The Python code very concise, columns can be selected as a list of strings,
the sort
function takes a simple descending
argument, and the general
API is very straightforward.
I've also included an attempt at the same logic in pandas. While largely similar, there are a few differences, for example, in how we filter for valid results.
import os
import pandas as pd
script_path = os.path.dirname(os.path.realpath(__file__))
bird_path = os.path.join(script_path, "../../../lib/PFW_2016_2020_public.csv")
codes_path = os.path.join(script_path, "../../../lib/species_code.csv")
# adding usecols reducing memory usage and runtime from 13s to 7s
birds = pd.read_csv(
bird_path,
usecols=[
"LATITUDE",
"LONGITUDE",
"SUBNATIONAL1_CODE",
"Month",
"Day",
"Year",
"SPECIES_CODE",
"HOW_MANY",
"VALID",
],
).rename(columns=lambda x: x.lower())
codes = pd.read_csv(codes_path)[["SPECIES_CODE", "PRIMARY_COM_NAME"]].rename(
columns={"SPECIES_CODE": "species_code", "PRIMARY_COM_NAME": "species_name"}
)
birds = birds[
[
"latitude",
"longitude",
"subnational1_code",
"month",
"day",
"year",
"species_code",
"how_many",
"valid",
]
]
birds = birds[birds["valid"] == 1]
birds = (
birds.groupby(["subnational1_code", "species_code"])
.agg(total_species=("how_many", "sum"), total_sightings=("how_many", "count"))
.reset_index()
.sort_values("total_species", ascending=False)
)
birds = pd.merge(birds, codes, on="species_code", how="inner")
print(birds)
Now let's compare the above to Rust code.
Rust
use polars::prelude::*; use std::env; fn main() { let current_dir = env::current_dir().expect("Failed to get current directory"); let bird_path = current_dir.join("../lib/PFW_2016_2020_public.csv"); let codes_path = current_dir.join("../lib/species_code.csv"); let cols = vec![ "LATITUDE".into(), "LONGITUDE".into(), "SUBNATIONAL1_CODE".into(), "Month".into(), "Day".into(), "Year".into(), "SPECIES_CODE".into(), "HOW_MANY".into(), "VALID".into(), ]; let birds_df = CsvReader::from_path(bird_path) .expect("Failed to read CSV file") .has_header(true) .with_columns(Some(cols.clone())) .finish() .unwrap() .lazy(); let mut codes_df = CsvReader::from_path(codes_path) .expect("Failed to read CSV file") .infer_schema(None) .has_header(true) .finish() .unwrap(); codes_df = codes_df .clone() .lazy() .select([ col("SPECIES_CODE").alias("species_code"), col("PRIMARY_COM_NAME").alias("species_name"), ]) .collect() .unwrap(); let birds_df = birds_df .rename(cols.clone(), cols.into_iter().map(|x| x.to_lowercase())) .filter(col("valid").eq(lit(1))) .groupby(["subnational1_code", "species_code"]) .agg(&[ col("how_many").sum().alias("total_species"), col("how_many").count().alias("total_sightings"), ]) .sort( "total_species", SortOptions { descending: true, nulls_last: false, multithreaded: true, }, ) .collect() .unwrap(); let joined = birds_df .join( &codes_df, ["species_code"], ["species_code"], JoinType::Inner, None, ) .unwrap(); println!("{}", joined); }
In Rust, the code 75% longer and the syntax is more verbose. There are a lot of
unwrap
calls to handle errors, although some of these could be replaced with
?
in a real application.
The sort
function takes a SortOptions
struct, which is a bit more verbose.
Overall, the API is very similar.
Benchmarks
Let's look at some benchmarks for polars in both Python and Rust, as well as similar code in Pandas.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
../wxrs/target/release/ch5 | 473.9 ± 5.8 | 461.3 | 480.7 | 1.00 |
python ../wxpy/wxpy/ch5/ch5.py | 764.8 ± 26.8 | 732.9 | 815.2 | 1.61 ± 0.06 |
python ../wxpy/wxpy/ch5/ch5_pandas.py | 5644.0 ± 39.2 | 5584.9 | 5710.0 | 11.91 ± 0.17 |
The Rust version is the fastest again. The Python-polars code is 1.6x slower than the Rust code, but the Pandas code is exceptionally slow, taking over 5 seconds to complete while both Polars versions take less than 1 second.
Concurrent Programming
One of Rust's major goals as a language is to enable fearless concurrency. So much so that an entire chapter of the Rust Book is devoted to it.
In Python, concurrency is possible however we are impacted by the GIL.
What's really fascinating (to me, anyways) is how decisions about how memory is managed in both languages has a direct impact on how concurrency is handled.
Before we dig into concurrency, let's take a step back and talk about memory.
Memory
Every programming language stores objects in memory. Whether it's variables, functions, or other data, we store these in memory to allow fast access to them when we need them.
How languages manage memory defines the flavor and performance characteristics of the language.
The GIL and Python's Memory Management
In Python, the infamous Global Interpreter Lock (or GIL) exists because objects in Python are reference counted. This means that every object has a counter associated with it that is incremented as it is referenced and decremented as it is removed from scope. When an object has 0 references, it is cleared from memory, freeing up space.
To prevent two different threads from accessing or releasing the same reference to an object, the GIL is used to prevent multiple threads from accessing the same object. This has the effect of serializing access to objects in memory and effectively making CPU-bound Python code single-threaded.
To work around these limitations, CPU-bound Python code has to rely on threads, which has its own set of limitations and overhead costs.
About the Author
This Rust for Data book was created by me, Pedram Navid.
You can find me on Twitter @pdrmnvd
and on LinkedIn @pedramnavid
and on GitHub @pedramnavid
and on Substack @databased.
and on my website pedramnavid.com.