Some documentation.

2022-12-29 11:38:13 -08:00 · 2022-12-29 11:38:13 -08:00 · 9f0158be7e
parent e624aea369
commit 9f0158be7e
2 changed files with 114 additions and 0 deletions
--- a/TODO.md
+++ b/TODO.md
@ -0,0 +1,16 @@
 # TODOs:
 - Implement the Squozen decompression algorithm
 - Implement the Squozen compression algorithm
 - Implement the Merging (mlocate) decompression algorithm
 - Implement the Merging (mlocate) compression algorithm
 - Implement the Posting (plocate) decompression algorithm
 - Implement the Posting (plocate) compression algorithm
 - Implement the root binary, providing a universal API for all of these
  functions where possible. Make including/excluding them a configuration in
  Cargo.toml.
 - Implement the updater function, stealing from the fast-find (fd & rg) walkers
  as needed.
 - Implement a real-time server.
 - Provide a C API to the libraries.
--- a/crates/squozen/README.md
+++ b/crates/squozen/README.md
@ -0,0 +1,98 @@
 # Squozen
 This crate contains a library for *reading* the Squozen database format, the
 original format used to store the database for the Unix `locate` command.
 ## The format
 It's important to remember that the Squozen format was formalized in 1983; at
 the time, the Unix filesystem handled only the characters between 0x00 and 0x7f
 (0-127), filesystems were much smaller, the convention of short paths such as
 `/usr` and `/etc` and eight-character-dot-three-character filenames were in
 force. As such, the use of bigrams, the topmost bit as a sentinel, and the
 likelihood that each paths would deviate from its prior by less than 14
 characters was sensible.
 The Squozen format consists of a 256-long block that encompasses the 128 most
 common bigrams (two-letter sequences) that appear in the database, followed by a
 stream that has the following characteristics:
 A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the
 next two bytes represent a 16-bit a number; if not, the byte itself is treated
 as a 8-bit integer. This integer represents the number of characters from the
 preceding read (and must be zero if this is the first read!) that can be re-used
 in the current iteration. The starting pointer for inserts is moved to that
 position.
 As the byte stream is read, if the uppermost bit is set, the remaining 7-bits
 are treated as a look up into the bigram table, the two bytes of which are
 inserted into the result buffer, otherwise the character encountered is inserted
 into the result buffer. This read terminates when a character equal to or less
 than the RS symbol is encountered. That character is preserved for the next
 read.
 ## Analysis
 Before reading the database, a substring of the pattern requested for matching
 is extracted. This substring contains no wildcards or other glob special
 characters.
 When a result buffer is produced, the analysis starts from the *end* of the
 results buffer, comparing it to the prepared pattern.  If it finds the last
 character of the prepared pattern, it starts a backward analysis, stopping only
 when comparison either fails or we reach the beginning of the prepared pattern.  
 If the comparison fails, the analysis resumes again until we run out of string
 to compare. If it suceeds, we compare the whole pattern to the string using
 `fnmatch()`, and print the result if that comparison succeeds.
 If the comparison fails, we also record that it did, and establish that
 analyzing any part of the path *prior* to that marker is unnecessary, since the
 comparison failed to find anything within it.
 Analysis continues until the stream is exhausted.
 ## The C implementation
 The C implementation wraps this entirely in one function, re-using string
 buffers and pointers with abandon.  The 1983 C compilers had no notion of
 variable shadowing or scoping; it just did what you told it to without ever
 questioning your decisions.  It's only 39 lines long and does all the work in
 that space.  It also never uses more than 2KB of memory, and 130 bytes of that
 are the copyright notice!
 ## The Rust implementation
 ### Prepare Pattern 
 The Rust implementation is a little more modern, and uses a bit more memory of
 course, but it's the same algorithm. The pattern preparer, which extracts the
 "non-glob" portion of the pattern to make base comparisons faster returns a
 ''Vec'' rather than re-using a global array of 128 bytes. I also identified
 three places in the original code that performed the same action: "from a
 starting point, scan backwards until this function is satified or the array is
 exhausted." I've abstracted that out into a local function that takes the
 predicate as a closure.
 ### Squozen
 The implementation of Squozen itself is broken up into three phases.  When the
 Squozen struct is instantiated with the path to the database, the bigram table
 is read in, but only the bigram table and the path are stored.
 The implementation of the Squozen database has two methods: ''paths()'' and
 ''matches(pattern)''. ''paths()'' opens the database and returns an iterator,
 which begins reading it, uncompressing it according to the algorithm described
 above, and returning a read-only reference to an internal byte slice where the
 uncompressed results are stored.
 ''matches(pattern)'' performs the pattern preparation and then wraps
 ''paths()'', returning only those strings that match the pattern, again using
 the read-only reference to the internal array presented by ''paths()''.
 Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a
 new instance of the file reader, which is buffered, as well as the return slice.
 Multiple threads can be reading different instances at the same time, although I
 imagine some filesystem thrashing is likely if you have more than two or three.
 The scanner is pretty fast!