Added documentation to the Squozen README describing the decompressor algorithm.

2022-11-30 08:20:04 -08:00 · 2022-11-30 08:20:04 -08:00 · c7ba092293
parent e624aea369
commit c7ba092293
1 changed files with 98 additions and 0 deletions
--- a/crates/squozen/README.md
+++ b/crates/squozen/README.md
@ -0,0 +1,98 @@
+# Squozen
+
+This crate contains a library for *reading* the Squozen database format, the
+original format used to store the database for the Unix `locate` command.
+
+## The format
+
+It's important to remember that the Squozen format was formalized in 1983; at
+the time, the Unix filesystem handled only the characters between 0x00 and 0x7f
+(0-127), filesystems were much smaller, the convention of short paths such as
+`/usr` and `/etc` and eight-character-dot-three-character filenames were in
+force. As such, the use of bigrams, the topmost bit as a sentinel, and the
+likelihood that each paths would deviate from its prior by less than 14
+characters was sensible.
+
+The Squozen format consists of a 256-long block that encompasses the 128 most
+common bigrams (two-letter sequences) that appear in the database, followed by a
+stream that has the following characteristics:
+
+A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the
+next two bytes represent a 16-bit a number; if not, the byte itself is treated
+as a 8-bit integer. This integer represents the number of characters from the
+preceding read (and must be zero if this is the first read!) that can be re-used
+in the current iteration. The starting pointer for inserts is moved to that
+position.
+
+As the byte stream is read, if the uppermost bit is set, the remaining 7-bits
+are treated as a look up into the bigram table, the two bytes of which are
+inserted into the result buffer, otherwise the character encountered is inserted
+into the result buffer. This read terminates when a character equal to or less
+than the RS symbol is encountered. That character is preserved for the next
+read.
+
+## Analysis
+
+Before reading the database, a substring of the pattern requested for matching
+is extracted. This substring contains no wildcards or other glob special
+characters.
+
+When a result buffer is produced, the analysis starts from the *end* of the
+results buffer, comparing it to the prepared pattern.  If it finds the last
+character of the prepared pattern, it starts a backward analysis, stopping only
+when comparison either fails or we reach the beginning of the prepared pattern.  
+
+If the comparison fails, the analysis resumes again until we run out of string
+to compare. If it suceeds, we compare the whole pattern to the string using
+`fnmatch()`, and print the result if that comparison succeeds.
+
+If the comparison fails, we also record that it did, and establish that
+analyzing any part of the path *prior* to that marker is unnecessary, since the
+comparison failed to find anything within it.
+
+Analysis continues until the stream is exhausted.
+
+## The C implementation
+
+The C implementation wraps this entirely in one function, re-using string
+buffers and pointers with abandon.  The 1983 C compilers had no notion of
+variable shadowing or scoping; it just did what you told it to without ever
+questioning your decisions.  It's only 39 lines long and does all the work in
+that space.  It also never uses more than 2KB of memory, and 130 bytes of that
+are the copyright notice!
+
+## The Rust implementation
+
+### Prepare Pattern 
+
+The Rust implementation is a little more modern, and uses a bit more memory of
+course, but it's the same algorithm. The pattern preparer, which extracts the
+"non-glob" portion of the pattern to make base comparisons faster returns a
+''Vec'' rather than re-using a global array of 128 bytes. I also identified
+three places in the original code that performed the same action: "from a
+starting point, scan backwards until this function is satified or the array is
+exhausted." I've abstracted that out into a local function that takes the
+predicate as a closure.
+
+### Squozen
+
+The implementation of Squozen itself is broken up into three phases.  When the
+Squozen struct is instantiated with the path to the database, the bigram table
+is read in, but only the bigram table and the path are stored.
+
+The implementation of the Squozen database has two methods: ''paths()'' and
+''matches(pattern)''. ''paths()'' opens the database and returns an iterator,
+which begins reading it, uncompressing it according to the algorithm described
+above, and returning a read-only reference to an internal byte slice where the
+uncompressed results are stored.
+
+''matches(pattern)'' performs the pattern preparation and then wraps
+''paths()'', returning only those strings that match the pattern, again using
+the read-only reference to the internal array presented by ''paths()''.
+
+Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a
+new instance of the file reader, which is buffered, as well as the return slice.
+Multiple threads can be reading different instances at the same time, although I
+imagine some filesystem thrashing is likely if you have more than two or three.
+The scanner is pretty fast!
+