Kanidm[7]

July 12

Monday TODO list:

Figure out why on_close isn’t called.
Log span open/closing.

I started off the morning digging into the source code for Layered, which provides the majority of the functionality behind my subscriber. I found that in the case where the inner subscriber fails to close, on_close isn’t called. Furthermore, the Registry type (the underlying subscriber in my Layered type) only fails to close if there’s more than one reference, according to its source code. However, all the references should be gone by the time the program successfully terminates. Was this not how Rc works to avoid memory leaks? So why was on_close never called throughout the entire duration of the program?

My last resort was to go directly into the source code for tracing downloaded on my laptop, and attempt to manually find where everything when wrong with panic debugging directly in Layered’s try_close implementation. I made my changes, ran cargo clean for a fresh compile, and ran my tests again and… No panic. Huh? I tried the same thing but directly inside of Registrys try_close implementation, which was as low level as I could get, and still no panic. This indicated that I was misunderstanding when try_close should be called, because my code was never even attempting to close either of my spans.

More searching uncovered the issue, which was completely my fault. In my subscriber type that wrapped a Layered object, I failed to realize there was a default implementation for try_close that ignores the hook, which I wasn’t overriding. Fixing this was incredibly simple, and everything work afterwards.

After I got on_close to work, I shifted gears towards inserting open/close logs with included timings.

To start, I went to last weeks post and found where I linked the macro that tracing-subscriber used to create events. I inlined the macro in order to gain a better understanding for how it worked. One thing that was still bothering me was I didn’t know how to make the main message get directed to record_debug instead of record_str, but more digging through the docs revealed that I needed to store the value as a fields::Value type, which fmt::Arguments implemented. I got curious and began looking through the source code for fmt::Arguments and came across something surprising. It had an internal struct in some part called ArgumentV1, and I immediately thought back to one of William’s PRs where there was a type called DbValueV1 (I think). At the time, I thought having a version suffix was just a William (or OpenSUSE) thing, but now I know that it’s a pretty widespread practice (or fun coincedence). Neat!

Emotional rollercoaster incoming:

I took a short break, and realized: Why couldn’t I just implement my own ZST types called Begin and End like the following:

struct Begin;

impl field::Value for Begin {
    fn record(&self, key: &Field, visitor: &mut dyn Visit) {
        visitor.record_debug(key, &"begin" as &dyn fmt::Debug)
    }
}

I thought this was a great idea, but unfortunately field::Value required that an internal private trait, crate::sealed::Sealed was also implemented, essentially preventing me from doing this completely. As I was about to go into a rage about how annoying the authors were for stopping me from doing the most basic things for (seemingly) no reason, I decided to dig a little deeper, telling myself there must be a good reason for it.

And in fact, there was. The library comes with a handy DebugStruct value which acts as a thin wrapping around an fmt::Debug type, and already implements field::Value. Perfect!

…except it had no constructor implementation and its fields were private. Hello??

It was only after 10 minutes of confusion that I decided to cmd+F through the source code for DebugStruct to see if it was returned by any functions, and found the top-level function, fields::debug that acted as a constructor for it. Of course I had to discover all of this by digging through the source code, because the docs only covered high level stuff. Lovely.

The good news was that this resulted in less lines of code. Creating a variable that I could use as the value in an Event’s field-value pair was only three lines of code:

#[derive(Debug)]
struct Begin;
let begin = field::debug(Begin);

And it worked! Hooray!

But I wasn’t done there. A few hours later, I found there was an analogous DisplayStruct and fields::display function, allowing me to make my code even shorter:

let begin = field::display("Begin");

I also spent around an hour trying to find an idiomatic way to generate logs from within the Layer hooks, but I ended up just essentially copying what tracing-subscriber does internally with a macro.

July 13

Tuesday TODO list:

Make a SubscriberBuilder for my custom type.
Implement configuring JSON/Pretty, writer, and what span events to log.

Implementing a builder for my subscriber seemed like a great idea at first, but I soon realized it was trivial because it had so few configurations, leading me to revert my short-lived changes.

I also looked into the AuditScope type in kanidmd and implemented wrapper macros associated with each of the LogTag variants that it uses. Some examples are shown here:

admin_error!("Something bad that's admin related");
request_info!("Request ID: {}", 1111);

I then shifted my focus towards adding the ability to log when a span idles over an await call, but I came across a frustrating setback: tracing::instrument enters and exits the scope on each poll, meaning that it can do this many times over an await call. This lead to logs being spammed with [await] traces.

Jul 07 22:26:32.728 TRACE second: tracing_tests::tests: [await] | time.idle=14875
Jul 07 22:26:32.929 TRACE second: tracing_tests::tests: [await] | time.idle=200175500
Jul 07 22:26:32.980 TRACE second: tracing_tests::tests: [await] | time.idle=51447000
Jul 07 22:26:32.980 TRACE second: tracing_tests::tests: [await] | time.idle=12583
Jul 07 22:26:33.181 TRACE second: tracing_tests::tests: [await] | time.idle=201165375
Jul 07 22:26:33.230 TRACE second: tracing_tests::tests: [await] | time.idle=48163334

If the #[instrument] macro were able to exit and enter just once over await calls, this would have worked great. Since I had no way to implicitly mark exactly the beginning and end of an await, I decided to scrap this idea.

Then, I added some emojis to the log messages, because I saw that AuditScope was doing that and it looked cute.

I also wondered: “would it be better if JSON logs were chronological and pretty logs were clumped into spans?” I didn’t know the answer to that so I called it a day there.

July 14

Wednesday TODO list:

Change interal buffer to store logs in struct representations, instead of formatting when the event happens.
Start integrating within Kanidmd.

I had sent my progress to William on Tuesday, and he was able to get back to me overnight with feedback. His main suggestion was to delay formatting until the root span ended, so that we could send the log information to another thread for analyzing and formatting. I was able to quickly implement this change to store log information in an internal struct representation, MyEvent. In addition to making my code more logical, I was able to remove the MyBuffer type by replacing it with Vec<MyEvent>.

It did take me about an hour to rewrite the formatter using the vector of event structs representation. A positive though is that we only have to check the format type (JSON/Pretty) once, and then we know every event should be formatted like that.

Furthermore, delaying formatting allowed me to clean up MyLayer::log_event quite a bit! Here’s the original:

fn log_event(&self, event: &Event, ctx: &Context<Registry>) {
    let span = match ctx.event_span(event) {
        Some(span) => span,
        _ => return,
    };

    let mut extensions = span.extensions_mut();

    let buf = extensions
        .get_mut::<MyBuffer>()
        .expect("Log buffer not found, this is a bug");

    self.fmt
        .format_event(buf, event, ctx)
        .expect("Write failed");
}

And here’s the updated:

fn log_event(&self, event: &Event, ctx: &Context<Registry>) {
    if let Some(span) = ctx.event_span(event) {
        span.extensions_mut()
            .get_mut::<Vec<MyEvent>>()
            .expect("Log buffer not found, this is a bug")
            .push(MyEvent::new(event, ctx));
    }
}

The final step was to send the logs across threads for formatting, so my first stop to get ideas for how to accomplish this was in AuditScope. However, I instantly got sidetracked in the source code. I saw that it reserved space in the log buffer depending on the level of logs it was detecting, and decided to implement that too (in only six lines of code!)

I also realized that in order to avoid hogging the stderr guard, we should format everything first to a String buffer, and then grab the stderr guard and dump that all at once so we hold the guard for the shortest amount of time possible. This took me about 15 minutes to implement.

I also realized that I could easily calculate the exact amount of space the entire buffer with take up after formatting before doing any work. Implementing this took almost an hour, but I got it to calculate a tight upper bound for the length of the buffer! This was pretty pointless for short logs, but instances that create massive log batches will greatly benefit from this because there will be zero reallocations of massive string buffers. However, this was a huge pain in the butt because I felt like I was back to string formatting in C with all the char counting. :(

July 15

Thursday TODO list:

Apply suggestion to turn raw data into information that tells a story.

Thursday started off with reviewing a round of feedback on tracing-tests. William gave me a lot of suggestions for how to improve my subscriber, and I began to feel like I was just creating an AuditScope clone. However, I think that tracing provides a lot of utilities that are still better, like that fact that it significantly reduces boilerplate code of having to explicitly pass a logging object around. We can just create one global subscriber at the very beginning, and then forget about it.

Something that really stuck with me was the idea of transforming raw data into information that tells a story. It seemed so obvious in retrospect, but logs are meant to tell a story, and you can be a lot more creative and helpful than simply writing what happened.

In order to structure the logs to tell a story, I decided to write out an example of what I hoped the logs would eventually look like. Here’s what I came up with:

Jul 15 00:00:00.000 📍 TRACE    try_from_entry_ro [ 1ms | 28.209% / 100.000% ]
Jul 15 00:00:00.000 📍 TRACE    ┕━ server::internal_search [ 895µs | 6.412% / 71.791% ]
Jul 15 00:00:00.000 💬 INFO        ┝━ [request.info]: Some request info...
Jul 15 00:00:00.000 📍 TRACE       ┕━ server::search [ 815µs | 0.538% / 65.379% ] 
Jul 15 00:00:00.000 📍 TRACE          ┝━ be::search [ 535µs | 0.836% / 42.931% ]
Jul 15 00:00:00.000 📍 TRACE          │  ┝━ be::search -> filter2idl [ 268µs | 18.743% / 21.500% ]
Jul 15 00:00:00.000 📍 TRACE          │  │  ┝━ be::idl_arc_sqlite::get_idl [ 27µs | 2.221% ]
Jul 15 00:00:00.000 💬 INFO           │  │  │  ┕━ [filter.info]: Some filter info...
Jul 15 00:00:00.000 📍 TRACE          │  │  ┕━ be::idl_arc_sqlite::get_idl [ 6µs | 0.536% ]
Jul 15 00:00:00.000 🚨 ERROR          │  │     ┝━ [admin.error]: Oopsies, an admin error occurred :)
Jul 15 00:00:00.000 🐛 DEBUG          │  │     ┕━ []: An untagged debug log
Jul 15 00:00:00.000 📍 TRACE          │  ┕━ be::idl_arc_sqlite::get_identry [ 256µs | 20.595% ]
Jul 15 00:00:00.000 🔐 CRITICAL       │     ┝━ [security.critical] A security critical log
Jul 15 00:00:00.000 🔓 ACCESS         │     ┕━ [security.access] A security access log
Jul 15 00:00:00.000 📍 TRACE          ┕━ server::search<filter_resolve> [ 273µs | 21.910% ]

Everything about the design was very intentional. Timestamp and level are kept to the left to form a relatively one-dimensional column with basic information. The tree on the right has the spans and events.

Spans start with the name, making it easy to follow the execution timeline, and are followed by relavant times. The first percentage is the percentage load of the root span that is happening directly in the span. Spans with child spans have another percentage, describing how much total work is happening while that span is open. This allows readers to easily determine if contents directly in a span are expensive, or if the span simply has expensive child spans.

Events start with the full tag name, if they have one. This means that readers can look at the tree and know the type and severity without having to look back at the left column.

I realized my old version was extremely one-dimensional and stagnant in comparison to the redesign, which felt alive and actually tells a story. I think the emojis also helped.

I also made the executive decision to enable timing by default, because it made everything simpler. What if a parent span opts for timing, the child doesn’t, and the grandchild does? That would be complicated, and frankly this wasn’t worth my time to worry about.

July 16

Friday TODO list:

Write post-processing %load algorithm.
Implement fancy formatting.

Spent the morning writing my post-processing algorithm. It essentially traverses a tree of events/spans, and transforms the spans in it to have information about their durations in relation to both the root and their children. I realized while working on it that I’ve definitely reached the point where, when I’m not fighting trait bounds and initially unintuitive library design choices, I’ve reached a nice harmony with the compiler. I’m never fighting the borrow checker or getting type errors anymore.

Designing the algorithm was a relatively straightforward and unproblematic process, leaving very little to write about here. In my new implementation, I represented logs as an enum type, MyLogs, that could either be an event or a span containing a Vec<MyLogs>, making it a recursive type. I took more inspiration from AuditScope by creating my own MyProcessedSpan and MyProcessedLogs types that are returned from MyLogs::process. Since events didn’t need processing, I left the MyEvent type unprocessed.

After I finished that (“finished” - I didn’t test it at all oops), I transitioned to working on the formatter. I actually anticipated that this would be a lot harder because I wanted my fancy tree diagrams, but I actually got it working on the first try!

During my implementation of the formatter, I actually came across something really cool. I knew the @ token existed in Rust, but I had no idea what it was for. A few days back, I saw it somewhere in a match expression. For some reason I got distracted and decided to look into it a little more, and found that it can be used to match onto a pattern while also binding it to a variable. Here’s an example:

let size = match Some(8) {
    Some(x @ 0..5) => "small",
    Some(x) if 5 <= x && x < 10 => "medium",
    Some(x) => "large",
    _ => panic!("no number),
}

This seems silly at first, because if guards already exist. However, I found a really great use for it in the formatter.

In order to make the fancy indentation work, I created an enum called Fill containing variants for each way the whitespace in the tree can be filled in.

enum Fill {
    Void, // "   "
    Line, // "│  "
    Fork, // "┝━ "
    Turn, // "┕━ "
}

The formatter maintains a vector of Fill variants to know what to draw before each log. When a span iterates through its children for formatting, it is either the last child or not the last child. We can tell if it’s the last child of another span, because the vector will end in the Turn variant, indicating to the span to not extend that bar further downwards. Otherwise, it will end in the Fork variant, indicating to continue to bar down. It is the spans responibility to prepare the vector of Fills for its children, meaning it has to change the last item accordingly.

This turned out to be the perfect use case for @:

match indent.last_mut() {
    Some(f @ Turn) => *f = Void,
    Some(f @ Fork) => *f = Line,
    _ => {}
}

Since the @ operator allows you to match and bind at the same time, it allowed my to match on and bind to a &mut Fill at the same time, allowing me to concisely express “if the last is a Turn, make it a Void, and if the last is a Fork, make it a Line.”

Testing this showed that everything surprisingly worked on the first try! I spent the rest of the day refactoring, and finally pushed changes to my little test repo.

Weekly Reflections

This week was amazing, both because I got to write a lot of code, and because I got to work with a lot of William’s feedback. Although I wasn’t unhappy in interface jail, it was definitely great feeling to return to my comfort zone of just writing my own thing without having to constantly refer to docs, and getot see real progress. I also know I grew a ton this week because each day, I felt like I knew so much better than the previous day. For example, I was initially happy to format all my logs the moment they occurred. The next day, I realized that was a terrible idea and implemented delayed formatting on a separate thread. A few days later, William helped me realize that simply delaying formatting was still not good enough, and that I should do a processing step to transform the logs from raw data into information that tells a story. This is the type of advice that I think really only comes from experienced mentors who have been around the block a couple times and have seen what works and what doesn’t, as it never occurred to me (or the writers of tracing’s documentation) that raw logs aren’t the most useful.

Written on July 18, 2021