Friday 27th October 2023
Some months ago I wrote Haywright, a filler text generator that took an input text and produced output that was random, but had the feel of the input. As I mentioned in a recent post, at one point during development I tried implementing word-based tokens, but abandoned it as it felt like a lot of unnecessary code duplication.
Well, ever since then the idea has remained in my mind, stewing away, and so I decided to take another crack at it. I started from a clean slate, removing all the old character-based generator code, and spent a bit more time thinking about how best to use Rust’s excellent type system to ensure correctness. I think the solution I came up with is actually quite nice, and is based on the description of the algorithm used by the Games::Dissociate Perl library.
I also tried to make the same algorithm work for character-based tokens, using Rust’s enums for a nice generic solution, but it turns out that algorithm only really works for word-based tokens; Brian Hayes’ algorithm works much better for character-based tokens.
So, one copy-paste later to re-add the previously excised code, and some tidying up, and we arrive at the present program, which can work using either token type. The output from each is subtly different, but still surprisingly effective. The character-based version currently produce results with slightly better punctuation and capitalisation, but I think it should be fairly straightforward to pass over the word-based output and fix these things.
One other thing about the word-based solution is that it works much better on long input texts, because they tend to have more instances of matching token sequences. Using Moby Dick as the input, if we restrict it to only the first chapter, the algorithm usually only manages to find 10 – 20 matches of two-word sequences for a 1,000-character length output, so most of the output is just a random sampling of word pairs from the input. Using the entire text of Moby Dick, on the other hand, results in 35 – 50 matches, so there are many more longer chains of words that fit together plausibly.
Just for funsies, here’s a couple of examples using my previous Haywright blog post as the input text, first using a 5-character token, and then using a 2-word token
hich is one mainly be updated, enough so that Dr. Drang’s Perl script to do full substring in the real with Kent?” A short time. Accept a single characters from 2011) computer Recreation does have the Rust document parsing match found in other a confidence can do is use for now I’m happy with Kent?” A short time, and AVX CPU cores. Unfortunately that she characters from 2011) computer Recreation a &[char] (as does have the original Computer. Of course, that parsing UTF-8, which allow its own frequency table accept a single character added to) writing for the idea I had some minor refactoring building, rather than Perl (not that I expected, and that he would do, but it can be held to be fast alternative to calculate, so indexing into its pattern to use the tokens One idea at somethings like the descript to do full substring a Perl (not that I expected, and once the original Computer. Of course, than he like Python with the original Computer Recreation back to a string match_indices() functio
why Rust? Well, the Rust to due to in the real world operation, so character as for searching, a workaround On the face of near what to handle I read the original Computer Recreations UTF-8, which things the while random, that’s a standard library This worked… A while its threads would not do not real world to be faster than a super-fast the ease put aside a modern more relevantly This idea it has all of a great therefore, I added this example, input[start..end]. the raw it was very likely of chars text these one mutable reasons that rayon crate’s but only on any other talked bothered to) very confidence-inspiring. it, one rayon crate’s expected, enough other? He the output. On the description of all kindness output wasn’t from a mere 12KB And we is written for now I’m happy just to at the start and a parameter, Drang post that combines aren’t matched Andrew Gallant, crate’s parallel be a was not a nice them in the place was to release mode the rescue! deal with mode and