Lab 3: Parsing Frames

Due February 28, 11:59pm

As stated on the course page, one of the goals of the course is to build a chat server, and that starts now. We will spend the remaining labs building up small components that we will piece together at the end.

Networking applications such as chat servers rely on being able to communicate between computers, and this happens by sending bytes between them, which are just streams of 1s and 0s. By themselves, bytes are meaningless, but if both computers agree on a protocol for how the bytes are structured, then they can derive meaning from them and communicate. An analogy to this is how making monkey noises isn't an effective mode of communication, but speaking a shared language is.

In this lab, we'll be (partially) implementing a protocol which will allow us to parse unstructured bytes into flexible higher-level structures that will be much easier to work with later on.

Since bytes are unstructured, however, there are many things that might go wrong during parsing. It might tell us to read in the next integer represented in ASCII digits, but then we find a non-digit character. Or it might tell us to read the next 100 bytes as an array of bytes, but there's only 50. As such, one of the focuses of this lab is to practice writing robust, fallible code with useful error types.

Grading Rubric

  • Code is formatted properly using cargo fmt: 5%
  • Code passes cargo clippy without warnings: 10%
  • Code passes our tests: 60%
  • Responses in questionnaire.md: 25%

Learning Objectives

  • Practice defining and using enum types.
  • Write fallible code with elegant error handling.

Table of Contents

Defining the Frame type

We want to communicate with structured messages, and will do this via the RESP protocol. In short, RESP is an easily parsable, human readable message format. RESP allows for sending various kinds of messages called frames, which can be one of the following variants:

  • Simple UTF-8 strings (String)
  • Error messages as UTF-8 strings (String)
  • 64-bit signed integers (i64)
  • Bulk strings, which are arbitrary bytes (Vec<u8>)
  • Vectors of other frames (Vec<Frame>)

Since frames can take on any one of these variants, it's the perfect candidate for declaring as an enum, which is the first task. Begin by opening your project and filling in the variants for the Frame type, one for each bullet point above. Make sure that each variant can carry the data associated with it, e.g. you should have a Simple(String) variant.

Filename: src/frame.rs

pub enum Frame {
    Simple(String),
    // todo: fill in variants
}

The RESP protocol

Since we're building a basic networking application from the ground up, we're going to write the parsing code that will convert raw bytes into Frame objects that we can use later on.

This section will tell you how each variant is encoded in bytes.

  • RESP Simple Strings:
    • Leading byte is "+", followed by a string that cannot contain a \r or \n character, and terminated by \r\n.
    • Example: "I love Rust 🦀" becomes "+I love Rust 🦀\r\n".
  • RESP Errors:
    • Same as simple strings, but the leading byte is "-".
    • Example: "NOT FOUND" becomes "-NOT FOUND\r\n".
  • RESP Integers:
    • Leading byte is ":", followed by a \r\n-terminated string representing an integer.
    • Example: 123 becomes ":123\r\n".
  • RESP Bulk Strings:
    • Leading byte is "$", followed by the number of bytes composing the string in ASCII digits (a prefixed length), followed by \r\n, followed by the actual bytes, followed by a final \r\n.
    • Example: "Æ’oo" becomes "$4\r\nÆ’oo\r\n", since "Æ’" is encoded in 2 bytes and "oo" is 2 bytes, one for each "o". Yes, characters can be variable length encoded. No, we won't be diving into this rabbit hole.
  • RESP Arrays:
    • Leading byte is "*", followed by the number of frames in the array in ASCII digits, followed by \r\n, followed by the encodings of the frames within the array.
    • Example: ["hello", 100] becomes "*2\r\n+hello\r\n:100\r\n".

When you start writing your Frame::check method (defined below), you should be consulting this section so you know how to properly parse data from the bytes.

Implementing methods for the Frame type

The main functionality we'll be implementing in this lab is parsing Frame values from a slice of bytes (&[u8]), as described by the protocol. Instead of reading straight from a &[u8], however, we'll read from a &mut Cursor<&[u8]> instead. Let's break down this type signature.

The &mut part is self explanatory- we're getting a mutable reference to a Cursor<&[u8]>. The Cursor part is a type provided by the standard library as std::io::Cursor, and it looks like this:

struct Cursor<T> {
    inner: T,
    pos: u64,
}

So when we have Cursor<&[u8]>, we're essentially just pairing a byte slice, &[u8], with a position, u64. This means that it can keep track of how much we've read so far without changing anything about the underlying buffer. Don't worry too much about the syntax or the name of the type for now, we'll be exploring it more next week.

One of the benefits of Rust is that you can view source code from the documentation. If you're curious about how Cursor works, you can see its documentation here, and even see its source code by pressing the orange source button on the right side.

Next, we need a way to take a &mut Cursor<&[u8]>, read from it, and return a Frame to us (or some kind of error). One thing to know about communicating over a network is that even if a chunk of bytes is sent all at once, the receiving computer might not get them all at the same time.

To combat this, we'll create two methods, Frame::check and Frame::decode, where the former checks that a Frame can be parsed without actually parsing it, and the ladder actually parses the bytes and returns a Frame. To see why we might want this, consider the following sequence of bytes (with whitespace added for readability):

*3\r\n
+GET\r\n
+Potato\r\n

This is a frame that is an array of three inner frames, but only two are present so far. If we eagerly create a Frame::Array with an empty Vec and push two Frame::Simples to it, we'll find out that we don't yet have data and return an error, meaning our String and Vec allocations will have been for nothing.

Thus, having distinct Frame::check and Frame::decode methods will allow us to cheaply check that there is enough data without having to perform potentially wasted heap allocations if there turns out not to be.

Let's briefly walk through how these methods should be implemented.

Frame::check

Frame::check ensures that the bytes are encoded as a valid frame as described below. First, it checks that the first byte matches one of the expected bytes (+, -, :, $, *), and performs more checks depending on the kind:

  • Simple strings and errors have a terminating \r\n, and have been checked as valid UTF-8. To get you started, you could implement this branch as just

    let slice: &[u8] = line(src)?;
    let _validated: &str = from_utf8(slice)?;
    // validated, don't need to do anything else now

    This should work because you should already have use std::str::from_utf8; at the top.

  • Integers can be parsed as an i64 and have a terminating \r\n.

  • Bulk strings have a valid prefix length and that there are that many bytes remaining, followed by \r\n.

  • Arrays have a valid prefix length, n, and that n frames afterwards can also be checked with Frame::check. This can be accomplished by looping n times and recursively calling Frame::check in each iteration. For example, if we have this sequence of bytes in our &mut Cursor<&[u8]> (the ^ is where the cursor position is)

    *2\r\n+Hello\r\n+world\r\n
    ^
    

    Then calling Frame::check on it will first detect that it's an array of 2 elements, while advancing the cursor forward past the first line:

    *2\r\n+Hello\r\n+world\r\n
          ^
    

    It'll then loop twice, where the first iteration will check and advance past the first inner frame (the "Hello" simple string):

    *2\r\n+Hello\r\n+world\r\n
                    ^
    

    and the second iteration will check and advance past the second and last inner frame (the "world" simple string):

    *2\r\n+Hello\r\n+world\r\n
                              ^
    

    If any of the checks on the inner n frames returns an Err, propogate it up with the ? operator.

    Note that because we're using a Cursor, each call to Frame::check will advance the position of the cursor, so we don't have to manually track where in the buffer we are.

On success, it returns Ok(()), which just tells us "hey, this operation succeeded but I don't have anything to give you." On failure, it returns Err(ParseError). The ParseError type, which will be defined by you, is explained more here.

Frame::decode

Frame::decode is nearly identical to Frame::check, except it creates the Frame objects as it goes instead of just throwing out bytes after validation.

For example, if you branch for simple frames in encode looked like this:

let slice: &[u8] = line(src)?;
let _validated: &str = from_utf8(slice)?;

Then you would want this instead in the decode method:

let slice: &[u8] = line(src)?;
let validated: &str = from_utf8(slice)?;
Ok(Frame::Simple(validated.to_string()))

You could also be clever and write it in one big expression if you feel comfortable with that too:

Ok(Frame::Simple(from_utf8(line(src)?)?.to_string()))

Another change would be that instead of just checking frames in a loop in the array variant, you'd actually want to be pushing the successfully parsed frames to a Vec as you go.

Starter code

The outline for these functions is provided in the same file:

Filename: src/frame.rs

impl Frame {
    pub fn check(src: &mut Cursor<&[u8]>) -> Result<(), ParseError> {
        todo!("implement Frame::check")
    }

    pub fn decode(src: &mut Cursor<&[u8]>) -> Result<Self, ParseError> {
        todo!("implement Frame::decode")
    }
}

It is your task to implement these functions according to the RESP protocol and with proper error handling.

Speaking of error handling...

The ParseError type

When writing up the Frame::check and Frame::decode methods, there are several things that can go awry. For example, one of the cursor helper functions might return a CursorError, or the first byte of a frame might not be one of the expected bytes. In these cases, you'll want to add a variant to the ParseError type defined at the bottom of src/error.rs.

To get you started, two possible things that might go wrong are one of the cursor functions might return a CursorError, or the first byte of a frame might not be one of the expected bytes as shown in the last branch above.

pub enum ParseError {
    Cursor(CursorError),
    FrameType(u8),
    // add more
}

For a refresher on errors as enums and how to make your errors work with the ? operator, read back on our lecture notes in 3.1. Enums.

Cursor helper functions

For help working with the Cursor type, we've created a small library with utility functions which you can see in Cargo.toml if you're interested.

Near the top of src/frame.rs, you should see the following:

use cursor::{byte, integer, line, size, slice};

To learn how to use these, navigate inside your project directory and enter the following command:

cargo doc --open

This will generate documentation for your Rust project in the form of HTML, JS, and CSS, and then open the corresponding files as a web page on your default browser. Under the Crates section on the left-hand side, click on cursor, and you should be brought to the documentation which will explain how each method works.

Get started with a match expression

We've seen how match expressions can be used to extract the value of an enum, but they can also be used like switch statements in C/C++.

Here's an example that you can run for yourself:

fn main() {
    // We can specify a byte by prefixing a char literal with `b`
    // Try changing the byte!
    let byte: u8 = b'+';

    match byte {
        b'+' => {
            // all good!
            println!("Plus!");
        }
        b'-' => {
            // all good!
            println!("Minus!");
        }
        x => {
            // hey, this isn't right and we can't proceed properly from here
            println!("None of the expected bytes matched, found: {}", x);
        }
    }
}

You may find this useful when determining which kind of frame is encoded based off the first byte read.

If you're ever at a point in code where you're thinking "hey, this isn't right and we can't proceed properly from here", than it's probably an error and you should add a variant for the particular scenario on the ParseError type so you can return it.

Helpful conversions

Converting &[u8] to Vec<u8>

You can use the .to_vec() method to create an owned Vec<u8> from a borrowed slice, &[u8].

#![allow(unused)]
fn main() {
let line: &[u8] = &[1, 2, 3];
let bulk: Vec<u8> = line.to_vec();
}

Converting &[u8] to String

You can use the std::str::from_utf8 function to get from &[u8] to &str (without any copying!), and then use .to_string() to get from &str to String.

fn main() -> Result<(), std::str::Utf8Error> {
use std::str::from_utf8;

let line: &[u8] = &[240, 159, 166, 128];
let borrowed_string: &str = from_utf8(line)?;
let owned_string: String = borrowed_string.to_string();

println!("Secret message: {}", owned_string);
Ok(())
}

Testing your code

In lab 1, we had the following at the top of our lib.rs file:

#[cfg(test)]
mod tests;

This created a tests module inside of our project, which was accessible in tests.rs.

For this lab, however, we'll be writing tests at the bottom of src/frame.rs. Near the bottom, you'll find the following:

Filename: src/frame.rs

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_check() {
        // Use the following syntax to get your `Cursor<&[u8]>` buffers.
        let mut src = Cursor::new(":100\r\n".as_bytes());

        // `.is_ok()` returns true if `Result` is the `Ok` variant.
        assert!(Frame::check(&mut src).is_ok());
    }
}

This is essentially equivalent to the above (with a sample test though), but the contents of the tests module can go in the block here instead of in another file. Write some tests to check that your code works!

One of the reasons we might want to write tests in the same file is if we're building up a larger framework and don't want test files everywhere, but want to keep our tests close to what we're testing for clarity.

Questionnaire

The last part of this assignment is to fill our several short response questions. Since we'll be reusing this repository for the rest of the course, you'll find the questionnaire in questionnaire/lab3.md this time. In future labs, we'll push other questionnaires to this directory, so be sure to not add other files or move files around.

Feedback

Please make sure that each member fills out this short feedback form individually.

Submitting

Once you're finished, be sure to verify that:

  • cargo fmt has been run
  • cargo clippy passes
  • cargo test passes

Then you can push your changes with the usual: git add ., git commit -m "your message", and git push.

However, there's one final step: Git tags.

In the industry, tags are how software is officially released. Essentially, it creates an immutable copy of the repository and all the contents and makes it fancy so people can download and use it. We'll be using it to denote stages of completion on this project, so one for each of the remaining labs.

The benefit of this is that when you create a git tag, we can look at it to see your repository at that moment in time. This means that if you're working on lab 4 and your repository is in an incomplete state that potentially doesn't compile, we can still look back at the point where you created a tag for this lab and see your working code from that point in time. It will also help us know exactly what code you want to submit- that is, if you need an extension, we'll know not to grade your code until you create a git tag.

So, when you're sure you're completely finished with the lab, start by creating a tag:

git tag lab3

And then push it to your repository:

git push origin lab3

Congratulations on finishing!