Added documentation to the Squozen README describing the decompressor algorithm.

This commit is contained in:
Elf M. Sternberg 2022-11-30 08:20:04 -08:00
parent e624aea369
commit c7ba092293
1 changed files with 98 additions and 0 deletions

98
crates/squozen/README.md Normal file
View File

@ -0,0 +1,98 @@
# Squozen
This crate contains a library for *reading* the Squozen database format, the
original format used to store the database for the Unix `locate` command.
## The format
It's important to remember that the Squozen format was formalized in 1983; at
the time, the Unix filesystem handled only the characters between 0x00 and 0x7f
(0-127), filesystems were much smaller, the convention of short paths such as
`/usr` and `/etc` and eight-character-dot-three-character filenames were in
force. As such, the use of bigrams, the topmost bit as a sentinel, and the
likelihood that each paths would deviate from its prior by less than 14
characters was sensible.
The Squozen format consists of a 256-long block that encompasses the 128 most
common bigrams (two-letter sequences) that appear in the database, followed by a
stream that has the following characteristics:
A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the
next two bytes represent a 16-bit a number; if not, the byte itself is treated
as a 8-bit integer. This integer represents the number of characters from the
preceding read (and must be zero if this is the first read!) that can be re-used
in the current iteration. The starting pointer for inserts is moved to that
position.
As the byte stream is read, if the uppermost bit is set, the remaining 7-bits
are treated as a look up into the bigram table, the two bytes of which are
inserted into the result buffer, otherwise the character encountered is inserted
into the result buffer. This read terminates when a character equal to or less
than the RS symbol is encountered. That character is preserved for the next
read.
## Analysis
Before reading the database, a substring of the pattern requested for matching
is extracted. This substring contains no wildcards or other glob special
characters.
When a result buffer is produced, the analysis starts from the *end* of the
results buffer, comparing it to the prepared pattern. If it finds the last
character of the prepared pattern, it starts a backward analysis, stopping only
when comparison either fails or we reach the beginning of the prepared pattern.
If the comparison fails, the analysis resumes again until we run out of string
to compare. If it suceeds, we compare the whole pattern to the string using
`fnmatch()`, and print the result if that comparison succeeds.
If the comparison fails, we also record that it did, and establish that
analyzing any part of the path *prior* to that marker is unnecessary, since the
comparison failed to find anything within it.
Analysis continues until the stream is exhausted.
## The C implementation
The C implementation wraps this entirely in one function, re-using string
buffers and pointers with abandon. The 1983 C compilers had no notion of
variable shadowing or scoping; it just did what you told it to without ever
questioning your decisions. It's only 39 lines long and does all the work in
that space. It also never uses more than 2KB of memory, and 130 bytes of that
are the copyright notice!
## The Rust implementation
### Prepare Pattern
The Rust implementation is a little more modern, and uses a bit more memory of
course, but it's the same algorithm. The pattern preparer, which extracts the
"non-glob" portion of the pattern to make base comparisons faster returns a
''Vec'' rather than re-using a global array of 128 bytes. I also identified
three places in the original code that performed the same action: "from a
starting point, scan backwards until this function is satified or the array is
exhausted." I've abstracted that out into a local function that takes the
predicate as a closure.
### Squozen
The implementation of Squozen itself is broken up into three phases. When the
Squozen struct is instantiated with the path to the database, the bigram table
is read in, but only the bigram table and the path are stored.
The implementation of the Squozen database has two methods: ''paths()'' and
''matches(pattern)''. ''paths()'' opens the database and returns an iterator,
which begins reading it, uncompressing it according to the algorithm described
above, and returning a read-only reference to an internal byte slice where the
uncompressed results are stored.
''matches(pattern)'' performs the pattern preparation and then wraps
''paths()'', returning only those strings that match the pattern, again using
the read-only reference to the internal array presented by ''paths()''.
Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a
new instance of the file reader, which is buffered, as well as the return slice.
Multiple threads can be reading different instances at the same time, although I
imagine some filesystem thrashing is likely if you have more than two or three.
The scanner is pretty fast!