History

Elf M. Sternberg 9f0158be7e Some documentation.		2022-12-29 11:38:13 -08:00
..
bench	Make prepare_pattern more Rust-like.	2022-11-24 12:56:13 -08:00
docs/patprep	The C and Rust versions are now comparable.	2022-11-13 12:33:34 -08:00
src	Adding an Errors block.	2022-11-27 10:37:44 -08:00
Cargo.toml	Intermediate progress: Squozen	2022-11-26 16:49:25 -08:00
README.md	Some documentation.	2022-12-29 11:38:13 -08:00

README.md

Squozen

This crate contains a library for reading the Squozen database format, the original format used to store the database for the Unix locate command.

The format

It's important to remember that the Squozen format was formalized in 1983; at the time, the Unix filesystem handled only the characters between 0x00 and 0x7f (0-127), filesystems were much smaller, the convention of short paths such as /usr and /etc and eight-character-dot-three-character filenames were in force. As such, the use of bigrams, the topmost bit as a sentinel, and the likelihood that each paths would deviate from its prior by less than 14 characters was sensible.

The Squozen format consists of a 256-long block that encompasses the 128 most common bigrams (two-letter sequences) that appear in the database, followed by a stream that has the following characteristics:

A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the next two bytes represent a 16-bit a number; if not, the byte itself is treated as a 8-bit integer. This integer represents the number of characters from the preceding read (and must be zero if this is the first read!) that can be re-used in the current iteration. The starting pointer for inserts is moved to that position.

As the byte stream is read, if the uppermost bit is set, the remaining 7-bits are treated as a look up into the bigram table, the two bytes of which are inserted into the result buffer, otherwise the character encountered is inserted into the result buffer. This read terminates when a character equal to or less than the RS symbol is encountered. That character is preserved for the next read.

Analysis

Before reading the database, a substring of the pattern requested for matching is extracted. This substring contains no wildcards or other glob special characters.

When a result buffer is produced, the analysis starts from the end of the results buffer, comparing it to the prepared pattern. If it finds the last character of the prepared pattern, it starts a backward analysis, stopping only when comparison either fails or we reach the beginning of the prepared pattern.

If the comparison fails, the analysis resumes again until we run out of string to compare. If it suceeds, we compare the whole pattern to the string using fnmatch(), and print the result if that comparison succeeds.

If the comparison fails, we also record that it did, and establish that analyzing any part of the path prior to that marker is unnecessary, since the comparison failed to find anything within it.

Analysis continues until the stream is exhausted.

The C implementation

The C implementation wraps this entirely in one function, re-using string buffers and pointers with abandon. The 1983 C compilers had no notion of variable shadowing or scoping; it just did what you told it to without ever questioning your decisions. It's only 39 lines long and does all the work in that space. It also never uses more than 2KB of memory, and 130 bytes of that are the copyright notice!

The Rust implementation

Prepare Pattern

The Rust implementation is a little more modern, and uses a bit more memory of course, but it's the same algorithm. The pattern preparer, which extracts the "non-glob" portion of the pattern to make base comparisons faster returns a ''Vec'' rather than re-using a global array of 128 bytes. I also identified three places in the original code that performed the same action: "from a starting point, scan backwards until this function is satified or the array is exhausted." I've abstracted that out into a local function that takes the predicate as a closure.

Squozen

The implementation of Squozen itself is broken up into three phases. When the Squozen struct is instantiated with the path to the database, the bigram table is read in, but only the bigram table and the path are stored.

The implementation of the Squozen database has two methods: ''paths()'' and ''matches(pattern)''. ''paths()'' opens the database and returns an iterator, which begins reading it, uncompressing it according to the algorithm described above, and returning a read-only reference to an internal byte slice where the uncompressed results are stored.

''matches(pattern)'' performs the pattern preparation and then wraps ''paths()'', returning only those strings that match the pattern, again using the read-only reference to the internal array presented by ''paths()''.

Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a new instance of the file reader, which is buffered, as well as the return slice. Multiple threads can be reading different instances at the same time, although I imagine some filesystem thrashing is likely if you have more than two or three. The scanner is pretty fast!