From c7ba09229371a67009d152a1ae4e2108307e5bda Mon Sep 17 00:00:00 2001 From: "Elf M. Sternberg" Date: Wed, 30 Nov 2022 08:20:04 -0800 Subject: [PATCH 1/2] Added documentation to the Squozen README describing the decompressor algorithm. --- crates/squozen/README.md | 98 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+) create mode 100644 crates/squozen/README.md diff --git a/crates/squozen/README.md b/crates/squozen/README.md new file mode 100644 index 0000000..1d15b3c --- /dev/null +++ b/crates/squozen/README.md @@ -0,0 +1,98 @@ +# Squozen + +This crate contains a library for *reading* the Squozen database format, the +original format used to store the database for the Unix `locate` command. + +## The format + +It's important to remember that the Squozen format was formalized in 1983; at +the time, the Unix filesystem handled only the characters between 0x00 and 0x7f +(0-127), filesystems were much smaller, the convention of short paths such as +`/usr` and `/etc` and eight-character-dot-three-character filenames were in +force. As such, the use of bigrams, the topmost bit as a sentinel, and the +likelihood that each paths would deviate from its prior by less than 14 +characters was sensible. + +The Squozen format consists of a 256-long block that encompasses the 128 most +common bigrams (two-letter sequences) that appear in the database, followed by a +stream that has the following characteristics: + +A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the +next two bytes represent a 16-bit a number; if not, the byte itself is treated +as a 8-bit integer. This integer represents the number of characters from the +preceding read (and must be zero if this is the first read!) that can be re-used +in the current iteration. The starting pointer for inserts is moved to that +position. + +As the byte stream is read, if the uppermost bit is set, the remaining 7-bits +are treated as a look up into the bigram table, the two bytes of which are +inserted into the result buffer, otherwise the character encountered is inserted +into the result buffer. This read terminates when a character equal to or less +than the RS symbol is encountered. That character is preserved for the next +read. + +## Analysis + +Before reading the database, a substring of the pattern requested for matching +is extracted. This substring contains no wildcards or other glob special +characters. + +When a result buffer is produced, the analysis starts from the *end* of the +results buffer, comparing it to the prepared pattern. If it finds the last +character of the prepared pattern, it starts a backward analysis, stopping only +when comparison either fails or we reach the beginning of the prepared pattern. + +If the comparison fails, the analysis resumes again until we run out of string +to compare. If it suceeds, we compare the whole pattern to the string using +`fnmatch()`, and print the result if that comparison succeeds. + +If the comparison fails, we also record that it did, and establish that +analyzing any part of the path *prior* to that marker is unnecessary, since the +comparison failed to find anything within it. + +Analysis continues until the stream is exhausted. + +## The C implementation + +The C implementation wraps this entirely in one function, re-using string +buffers and pointers with abandon. The 1983 C compilers had no notion of +variable shadowing or scoping; it just did what you told it to without ever +questioning your decisions. It's only 39 lines long and does all the work in +that space. It also never uses more than 2KB of memory, and 130 bytes of that +are the copyright notice! + +## The Rust implementation + +### Prepare Pattern + +The Rust implementation is a little more modern, and uses a bit more memory of +course, but it's the same algorithm. The pattern preparer, which extracts the +"non-glob" portion of the pattern to make base comparisons faster returns a +''Vec'' rather than re-using a global array of 128 bytes. I also identified +three places in the original code that performed the same action: "from a +starting point, scan backwards until this function is satified or the array is +exhausted." I've abstracted that out into a local function that takes the +predicate as a closure. + +### Squozen + +The implementation of Squozen itself is broken up into three phases. When the +Squozen struct is instantiated with the path to the database, the bigram table +is read in, but only the bigram table and the path are stored. + +The implementation of the Squozen database has two methods: ''paths()'' and +''matches(pattern)''. ''paths()'' opens the database and returns an iterator, +which begins reading it, uncompressing it according to the algorithm described +above, and returning a read-only reference to an internal byte slice where the +uncompressed results are stored. + +''matches(pattern)'' performs the pattern preparation and then wraps +''paths()'', returning only those strings that match the pattern, again using +the read-only reference to the internal array presented by ''paths()''. + +Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a +new instance of the file reader, which is buffered, as well as the return slice. +Multiple threads can be reading different instances at the same time, although I +imagine some filesystem thrashing is likely if you have more than two or three. +The scanner is pretty fast! + From 73bca38bb9d45af967ec38f44f980046ce768727 Mon Sep 17 00:00:00 2001 From: "Elf M. Sternberg" Date: Wed, 30 Nov 2022 08:27:49 -0800 Subject: [PATCH 2/2] Added a TODO. --- TODO.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) create mode 100644 TODO.md diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..774660b --- /dev/null +++ b/TODO.md @@ -0,0 +1,16 @@ +# TODOs: + +- Implement the Squozen decompression algorithm +- Implement the Squozen compression algorithm +- Implement the Merging (mlocate) decompression algorithm +- Implement the Merging (mlocate) compression algorithm +- Implement the Posting (plocate) decompression algorithm +- Implement the Posting (plocate) compression algorithm +- Implement the root binary, providing a universal API for all of these + functions where possible. Make including/excluding them a configuration in + Cargo.toml. +- Implement the updater function, stealing from the fast-find (fd & rg) walkers + as needed. +- Implement a real-time server. +- Provide a C API to the libraries. +