Wednesday 28th of June 2023

A while back I read a blog post by Dr. Drang about ChatGPT, which briefly discussed a way to generate text from a source input that, while random, had something of the feel of the original. This idea resonated with me, and after I read the original Computer Recreations article Dr. Drang’s post mentioned, I decided to try my hand at writing a program in Rust to do the same thing, based on the description of the algorithm in that article.

Here’s an example of the kind of output it can generate:

tranquillity; and fearful of inquire of Elizabeth for so many steps. He added, with an embargo on every day. And we all kindness to me. But not on any other talked less eccentric, than he liked her that he would not say that she chooses that of genius nor taste between friend’s dislike of meeting for their meeting, most agree with other? He declare for the man who have the same stated time. Accept me.”

“I think, rather talked off, and as Dawson does not rather a confidence, and the general topics, Meryton relations with Kent?”

A short time to calculate, with it every charming manner which not alarmed, madam, of course renewal of himself, Elizabeth looked at table, less interesting them in the place him!” thoughts to Netherfield, and his colossal cousin Elizabeth found a rapid and were traced them in every female intended her.

But Elizabeth was of course, therefore, I am sure (and I do not propitious, I suppose, is one of your retrospection between him in good-humour was not like a bal

A bit abrupt at the start and end, and the quote marks and brackets aren’t matched properly, but otherwise quite Austen-esque.

So why Rust? Well, mainly because I just really like the language. The software that runs this blog is written in it! It’s a great language, that combines a lot of the ease of use and expressiveness of something like Python with the raw performance of C++. Moreover, the correctness and safety guarantees are very confidence-inspiring.

One of the links in the Dr. Drang post goes back to a much earlier post, from 2011, where he talks about writing a Perl script to do the text generation. He mentions a couple of times that performance might be an issue, especially on older (than 2011) computers, and this further reinforced my confidence in choosing Rust: it was very likely to be faster than Perl (not that I know any Perl).

First pass: using Rust’s standard library

This worked… fine. I began testing using just the first chapter of Moby Dick, which is only 12KB, and the results were encouraging: even in debug mode and flushing each character to STDOUT as it was generated, the whole output took less than a second. But, when you think about it, one whole second just to generate 1000 random characters from a mere 12KB of input is actually not very fast at all for a modern computer. Of course, this is in debug mode; switching to release mode made things significantly quicker, as expected, enough so that I put aside my performance concerns for now.

Speed bumps: Unicode

The algorithm works by looking for substring matches in the input text, and changing the substring for each new character added to the output. On the face of it this sounds simple, and were everything plain ASCII, it would be, because 1 byte equals one character, so indexing into a string is trivial. Alas in the real world it’s not that simple: probably most text these days is encoded using UTF-8, which allows for multi-byte characters, so indexing into a string requires using the .chars() iterator rather than just, for example, input[start..end]. Plus, as the Rust documentation on char notes, “‘character’ isn’t a well-defined concept in Unicode”, with issues around normalisation, graphemes versus code points, combinatorial emojis, and a bunch more things, and that’s a whole Pandora’s Box that I’m largely ignoring in the interests of maintaining sanity. Feed the program sensible text and you’ll get sensible output. It’ll probably get very confused with things like “👨‍👩‍👦‍👦” in your text, but it’ll handle “tête-à-tête” just fine.

Abandoned path 1: word-based tokens

One of the things the Games::Dissociate library that Dr. Drang’s Perl script uses can do is use words as the tokens used for substring building, rather than characters. I added this to my program, and while it worked, it was a lot of duplication of code, but not quite similar enough that I could (be bothered to) write a generic algorithm to operate on either words or characters. Moreover, the output wasn’t really that different to simple character-based operation, so I eventually just removed it. I may revisit the idea at some point, though.

Abandoned path 2: multithreading?!

Once I’d done a bit of refactoring to get the text generation part of the program into its own module, leaving the main program to handle command-line argument parsing etc., I turned my attention back to performance. As I said, it was okay, but I felt like it should be a lot better.

One idea I had was to use the rayon crate’s parallel iterators: basically splitting up the substring matching part (the part with the biggest performance impact) across multiple threads and/or CPU cores. Unfortunately this was problematic for two reasons that are fairly obvious in hindsight.

First of all, Rust is very concerned with making sure only one mutable reference can be held to a thing at a time. For each substring match found in the input, a single frequency table needs to be updated, and due to the way Rayon hides all the messy thread stuff behind a nice easy-to-use abstraction, it can’t allow its threads to have mutable access to the frequency table.

As a workaround to this, I rewrote that part of the code in a more functional style, where each thread would have its own frequency table, and once they’re all done, the tables could be merged into one main table. This took quite some time, and I believe it would have worked fine if not for one teensy problem: Rayon’s par_match_indices() function does not accept a substring as its pattern to use for searching, unlike the Rust standard library’s match_indices() function. It does, however, accept &[char] (as does match_indices()), so I thought “great, I’ll just send it a slice of chars instead of a &str”, but the results were nowhere near what I expected.

It turns out, when you give either function a &[char], it matches on any of the characters in the slice, not all of them. I don’t think there’s a way around this; even if you pass a function or closure as the pattern, that can only accept a single character as a parameter, so there’s no way to do full substring matching.

BurntSushi to the rescue!

I’m sure the Rust standard library match_indices() is very nicely optimised, but it does have one big limitation: it has to deal with Rust strings, which are guaranteed to be valid UTF-8. This introduces a lot of overhead, as BurntSushi, aka Andrew Gallant, explains. He’s the guy who wrote ripgrep, a super-fast alternative to grep, and also, more relevantly for my purposes, the author of the memchr crate, which provides blazingly fast string searching by working on raw bytes, rather than dealing with chars or &strs and all the overhead they entail, and also by making use of SIMD and AVX CPU instructions.

How fast? Well, the sample at the top of this post took about 800ms using the Rust standard library match_indices(), but a mere 5ms using memchr’s find_iter(). A 730KB input, 1000-character output, almost instantly. Much better!

And that’s more or less where I’ll leave it now, I think. There’s probably some minor refactoring I could do, but for now I’m happy with it. I may eventually extract the generator module into its own crate, so it can easily be used in other programs, but only if I feel there’s demand for it.

The source for the program is available on GitHub.