Serializing Data

In the last chapter we fetched data from the OpenWeather API in order to get Air Pollution data. The astute observe will have noticed that we parsed the response as pure text, although the response was in JSON format.

The goal of this chapter is to walk through how we would take raw data and serialize it into a structured data format, such as JSON.

We'll dive into theory in a little but let's start with practice.

Serialization

Serialization is the process of taking data and encoding it into a known format that can later be retrieved. There are many ways to encode data, but largely these are broken into human-readable and binary formats.

CSVs, JSON, XML, and YAML are all human-readable serialization formats. Conversely, many binary formats exist, such as Parquet, Avro, and Protcol Buffers. Binary formats trade reduced readability for improved performance and size.

In the end, any data that needs to be persisted outside of a computer's memory requires some type of serialization.

Let's look at how serialization varies across both Rust and Python.

Python

In Python, we can serialize nearly any arbitrary data structure to JSON using the json module.

In [1]: import json

In [2]: my_obj = [{'a': 1, 'b': None}, "foo", "bar", ("baz", "baz")]

In [3]: json.dumps(my_obj)
Out[3]: '[{"a": 1, "b": null}, "foo", "bar", ["baz", "baz"]]'

Here's the updated project code that serializes the response from the OpenWeather API.

import os
import sys
import requests

API_KEY = os.getenv("OWM_APPID")


def get_air_pollution(lat, lon):
    url = f"http://api.openweathermap.org/data/2.5/air_pollution?lat={lat}&lon={lon}&appid={API_KEY}"
    body = requests.get(url).json()
    return body

def parse_air_pollution(body):
    aqi = body["list"][0]["main"]["aqi"]
    components = body["list"][0]["components"]
    return (aqi, components)

if __name__ == "__main__":
    usage = f"Usage: python {__file__} <lat> <lon>"

    if not API_KEY:
        print("Please set OWM_APPID environment variable")
        sys.exit(1)

    if len(sys.argv) != 3:
        print(usage)
        sys.exit(1)

    lat = sys.argv[1]
    lon = sys.argv[2]
    body = get_air_pollution(lat, lon)
    aqi, components = parse_air_pollution(body)

    print(f"Air Quality Index: {aqi}")
    print("Components:")
    for k, v in components.items():
        print(f"  {k}: {v}")

There are a few key things to note here.

first, we're assuming the request was successful and that there is a json response body, and that it can parse correctly. if any of these assumptions are incorrect an exception will be raised, and we have no obvious way of knowing what these exceptions are or which method might raise one.

def parse_air_pollution(body):
    aqi = body["list"][0]["main"]["aqi"]
    components = body["list"][0]["components"]
    return (aqi, components)

When parsing the response, we slice into the response body to get various components. We're explicitly fetching keys from a dictionary under the assumption that the payload is properly formed. There are safer dictionary methods to use, such as .get() which will return None if the key is missing rather than an exception, but in our case an Exception is warranted since we can't do anything with the data if it's missing.

We also haven't explicitly typed the response from the API. This is something we can do with mypy or other tools like pydantic, but the Python interpret itself has no type-guarantees.

Let's look at how we might do this in Rust.

Rust

In Rust, we'll need to install the serde crate as well as the json feature for reqwest.

cargo add serde --features derive
cargo add serde_json
cargo add reqwest --features json

Because Rust is a typed language, we will define the struct that represents the data we expect. The API response looks like the following:

{
    "coord": {
        "lon": -122.5889,
        "lat": 37.9871
    },
    "list": [
        {
            "main": {
                "aqi": 2
            },
            "components": {
                "co": 168.56,
                "no": 0.14,
                "no2": 0.75,
                "o3": 80.11,
                "so2": 0.7,
                "pm2_5": 3.48,
                "pm10": 5.58,
                "nh3": 0
            },
            "dt": 1687308878
        }
    ]
}

We can define a struct that represents this data as follows:

#![allow(unused)]
fn main() {
use serde::{Deserialize, Serialize};
#[derive(Debug, Serialize, Deserialize)]
pub struct AirPollution {
    pub coord: Coord,
    pub list: Vec<List>,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Coord {
    pub lon: f32,
    pub lat: f32,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct List {
    pub main: Main,
    pub components: Components,
    pub dt: usize,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Main {
    pub aqi: u8,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Components {
    pub co: f32,
    pub no: f32,
    pub no2: f32,
    pub o3: f32,
    pub so2: f32,
    pub pm2_5: f32,
    pub pm10: f32,
    pub nh3: f32,
}
}

As you can see, the struct mirrors the underlying JSON structure. The serde crate gives us a lot of flexibility here, in particular the section on Attributes and the Examples are worth spending some time on.

The reqwest crate also provides a json method that will automatically deserialize the response body into a struct.

#![allow(unused)]
fn main() {
pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution {
    let api_key = std::env::var("OWM_APPID").expect(
        "Environment Variable OWM_APPID not set. Please set it to your
    OpenWeatherMap API key. https://home.openweathermap.org/api_keys",
    );

    let url = format!(
        "http://api.openweathermap.org/data/2.5/air_pollution?lat={}&lon={}&appid={}",
        lat, lon, api_key
    );

    reqwest::blocking::get(url)
        .expect("request failed")
        .json()
        .expect("json failed")
}

Our function now returns an AirPollution struct, instead of a String, and reqwest's json method will automatically deserialize the response body to the correct type.

Rust uses type inference to reduce the amount of syntax required. While function parameters and signatures always require types, local variables can usually be inferred by the compiler.

Let's look at how returning a typed Struct changes how we interact with the data

#![allow(unused)]
fn main() {
pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) {
    let main = &body.list[0].main;
    let components = &body.list[0].components;
    (main, components)
}
}

We can access the underlying fields in the struct directly. Unlike a Python dictionary, the compiler will ensure that the fields we're accessing exist.

If we add a missing field, for example:

#![allow(unused)]
fn main() {
let foo = &body.list[0].foo;
}

And run cargo check we'll get the following error:


error[E0609]: no field `foo` on type `List`
  --> src/bin/ch4.rs:65:29
   |
65 |     let foo = &body.list[0].foo;
   |                             ^^^ unknown field
   |
   = note: available fields are: `main`, `components`, `dt`

For more information about this error, try `rustc --explain E0609`.
error: could not compile `wxrs` (bin "ch4") due to previous error

Compare to Python where we'd only get a run-time error if we tried to access a missing field, unless we opt-in to type hints using mypy.

Here's the full Rust code for reference


use serde::{Deserialize, Serialize};
#[derive(Debug, Serialize, Deserialize)]
pub struct AirPollution {
    pub coord: Coord,
    pub list: Vec<List>,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Coord {
    pub lon: f32,
    pub lat: f32,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct List {
    pub main: Main,
    pub components: Components,
    pub dt: usize,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Main {
    pub aqi: u8,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct Components {
    pub co: f32,
    pub no: f32,
    pub no2: f32,
    pub o3: f32,
    pub so2: f32,
    pub pm2_5: f32,
    pub pm10: f32,
    pub nh3: f32,
}

pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution {
    let api_key = std::env::var("OWM_APPID").expect(
        "Environment Variable OWM_APPID not set. Please set it to your
    OpenWeatherMap API key. https://home.openweathermap.org/api_keys",
    );

    let url = format!(
        "http://api.openweathermap.org/data/2.5/air_pollution?lat={}&lon={}&appid={}",
        lat, lon, api_key
    );

    reqwest::blocking::get(url)
        .expect("request failed")
        .json()
        .expect("json failed")
}

pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) {
    let main = &body.list[0].main;
    let components = &body.list[0].components;
    (main, components)
}

pub fn main() {
    let usage = format!("Usage: {} [lat] [lon]", std::env::args().next().unwrap());

    let lat = std::env::args()
        .nth(1)
        .expect(&usage)
        .parse::<f32>()
        .expect(&usage);

    let lon = std::env::args()
        .nth(2)
        .expect(&usage)
        .parse::<f32>()
        .expect(&usage);

    let body = get_air_pollution(lat, lon);
    let (main, components) = parse_air_pollution(&body);

    println!("Air Quality Index: {}", main.aqi);

    println!("Carbon Monoxide: {} μg/m³", components.co);
    println!("Nitrogen Monoxide: {} μg/m³", components.no);
    println!("Nitrogen Dioxide: {} μg/m³", components.no2);
    println!("Ozone: {} μg/m³", components.o3);
    println!("Sulfur Dioxide: {} μg/m³", components.so2);
    println!("Particulate Matter < 2.5 μm: {} μg/m³", components.pm2_5);
    println!("Particulate Matter < 10 μm: {} μg/m³", components.pm10);
    println!("Ammonia: {} μg/m³", components.nh3);
}

Serialization Formats

Something worth mentioning about the Rust serde crate is that it does not come with any built-in serialization formats. Instead, it provides a framework for serialization. We installed serde_json but there are many other formats available, such as serde_yaml and serde_avro.

Why Bother?

You might be wondering why we'd go through the trouble of defining a struct and serializing the response body into a struct. In Python, we avoid the boilerplate, we access fields directly, we can throw a little type-hinting at our code, we get to use # type: ignore freely, and if our application crashes, well, we'll just fix it and run it again.

You are absolutely right! This is all true. However, any seasoned Python programmer is also aware of all the ways that poorly typed code can go wrong.

If you've ever created a compute-intensive application that operates on many gigabytes of data, you've probably run into a situation where you've had to re-run the application because it crashed. Type-safety helps prevent these types of issues, but types also provide another nice benefit: improved performance.

The compiler can optimize code based on the types it knows about. In Python, we can use type-hints to help the compiler, but ultimately the Python interpreter is still dynamically resolving types at runtime. In Rust, the compiler knows the types at compile-time and can optimize prior to running.

What's that little & doing?

Ah, yes, the &. Now we are getting into the heart of Rust. Let's look at the code for parsing air pollution again:

#![allow(unused)]
fn main() {
pub fn parse_air_pollution(body: &AirPollution) -> (&Main, &Components) {
    let main = &body.list[0].main;
    let components = &body.list[0].components;
    (main, components)
}
}

parse_air_pollution is a function that takes a reference to an AirPollution struct. The & is the syntax for creating a reference. In Rust, references are a way of passing a value to a function without transferring ownership of the value. This is a key concept in Rust, and it's what allows Rust to guarantee memory safety.

In Python, values are passed around using counters. Every time you use a variable, Python's Garbage Collector keeps track of how many times it's been used. Whenever a function that used a reference exists, the counter is decremented. A Garbage Collector occasionally runs and cleans up all unused references.

In Rust, there is no garbage collector. Instead, the compiler keeps track of the lifetime of every variable. When a variable goes out of scope, the compiler will automatically free the memory associated with the variable.

This means that you cannot use a variable after transferring ownership. For a deeper dive into the concept of ownership, read the Rust Book.

For example, if we tried print the value of body after assigning it, the compiler would give us an error:

#![allow(unused)]
fn main() {
fn parse_air(body: AirPollution) {
    let foo = body;
    println!("{:?}", body);
}
}

error[E0382]: borrow of moved value: `body`
  --> src/bin/ch4.rs:71:22
   |
69 | fn parse_air(body: AirPollution) {
   |              ---- move occurs because `body` has type `AirPollution`, which does not implement the `Copy` trait
70 |     let foo = body;
   |               ---- value moved here
71 |     println!("{:?}", body);
   |                      ^^^^ value borrowed here after move

It's beyond the scope of this post to explain all the details of ownership and references, but it's important to understand that Rust's compiler is keeping track of the lifetime of every variable, and will not allow you to use a variable after it's been moved.

Instead, we can use a reference to a variable. This keeps the underlying data in the same place in memory, but allows us to pass it to a function as a reference to the original value.

#![allow(unused)]
fn main() {
fn parse_air(body: &AirPollution) {
    let foo = body;
    println!("{:?}", body);
}
}

This has some really nice benefits when it comes to processing large amounts of data, as data engineers tend to do.

In Python, it's not always clear when data is being copied, moved, or referenced. In Rust, copying code is very explicit. If we didn't want to borrow a reference in the code above, we could also copy.

#![allow(unused)]
fn main() {
fn parse_air(body: AirPollution) {
    let foo = body.clone();
    println!("{:?}", body);
}
}

For the above code to work, we would also need to implement the Clone trait for the AirPollution struct and all of its fields:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Deserialize)]
pub struct AirPollution {
...

}

Understanding ownership, references, and borrowing can be an uphill battle for new Rust programmers who are used to dynamically-typed languages, but with time and patience, it will come to you too.

Performance

To benchmark our code, we're going to change our code to fetch an entire forecast rather than a single day, increasing the payload from 0.5kb to about 13kb.

In Python, we change the url and then iterate over every element in the list provided.

def get_air_pollution(lat, lon):
    url = f"http://api.openweathermap.org/data/2.5/air_pollution/forecast?lat={lat}&lon={lon}&appid={API_KEY}"
    body = requests.get(url).json()
    return body


def parse_air_pollution(body):
    res = []
    print(body)
    for row in body["list"]:
        res.append((row["main"]["aqi"], row["components"], row["dt"]))
    return res


def print_air_pollution(main, components, dt):
    print("---")
    print(f"Air pollution forecast for {dt}")
    print(f"Air quality index: {main}")
    print("Components:")
    for k, v in components.items():
        print(f"  {k}: {v}")

In Rust, we also change the url and use the common iter().map().collect() pattern.

#![allow(unused)]
fn main() {
pub fn get_air_pollution(lat: f32, lon: f32) -> AirPollution {
    let api_key = std::env::var("OWM_APPID").expect(
        "Environment Variable OWM_APPID not set. Please set it to your
    OpenWeatherMap API key. https://home.openweathermap.org/api_keys",
    );

    let url = format!(
        "http://api.openweathermap.org/data/2.5/air_pollution/forecast?lat={}&lon={}&appid={}",
        lat, lon, api_key
    );

    reqwest::blocking::get(url)
        .expect("request failed")
        .json()
        .expect("json failed")
}

pub fn parse_air_pollution(body: AirPollution) -> Vec<(Main, Components, usize)> {
    body.list
        .iter()
        .map(|x| (x.main, x.components, x.dt))
        .collect()
}
}

Here are the results of the benchmarks:

Again, we see a 1.7x improvement in performance, or about 58% faster.

Offline Benchmarks

Benchmarking against a network connection can be a bit iffy. It also makes it hard to test larger and larger payloads, so we'll create a large payload file and use that for an offline benchmark.

I've created a 9mb JSON file that mirrors the payload from the OpenWeather API, and created offline versions of the Rust and Python code to read from a local file. The code for both can be found in the sample repository under wxpy/wxpy/ch4/serialized_offline_benchmark.py and wxrs/src/bin/ch4_offline_benchmark.rs.

Here are the results of the offline benchmarks:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`../wxrs/target/release/ch4_offline_benchmark`	47.0 ± 0.6	46.5	50.6	1.00
`python ../wxpy/wxpy/ch4/serialized_offline_benchmark.py 30 -140`	248.0 ± 22.4	210.8	278.2	5.27 ± 0.48

Rust is now running twice as fast as Python for these larger payloads.

Rust For Data