Added documentation to the Squozen README describing the decompressor algorithm.
This commit is contained in:
parent
e624aea369
commit
c7ba092293
|
@ -0,0 +1,98 @@
|
|||
# Squozen
|
||||
|
||||
This crate contains a library for *reading* the Squozen database format, the
|
||||
original format used to store the database for the Unix `locate` command.
|
||||
|
||||
## The format
|
||||
|
||||
It's important to remember that the Squozen format was formalized in 1983; at
|
||||
the time, the Unix filesystem handled only the characters between 0x00 and 0x7f
|
||||
(0-127), filesystems were much smaller, the convention of short paths such as
|
||||
`/usr` and `/etc` and eight-character-dot-three-character filenames were in
|
||||
force. As such, the use of bigrams, the topmost bit as a sentinel, and the
|
||||
likelihood that each paths would deviate from its prior by less than 14
|
||||
characters was sensible.
|
||||
|
||||
The Squozen format consists of a 256-long block that encompasses the 128 most
|
||||
common bigrams (two-letter sequences) that appear in the database, followed by a
|
||||
stream that has the following characteristics:
|
||||
|
||||
A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the
|
||||
next two bytes represent a 16-bit a number; if not, the byte itself is treated
|
||||
as a 8-bit integer. This integer represents the number of characters from the
|
||||
preceding read (and must be zero if this is the first read!) that can be re-used
|
||||
in the current iteration. The starting pointer for inserts is moved to that
|
||||
position.
|
||||
|
||||
As the byte stream is read, if the uppermost bit is set, the remaining 7-bits
|
||||
are treated as a look up into the bigram table, the two bytes of which are
|
||||
inserted into the result buffer, otherwise the character encountered is inserted
|
||||
into the result buffer. This read terminates when a character equal to or less
|
||||
than the RS symbol is encountered. That character is preserved for the next
|
||||
read.
|
||||
|
||||
## Analysis
|
||||
|
||||
Before reading the database, a substring of the pattern requested for matching
|
||||
is extracted. This substring contains no wildcards or other glob special
|
||||
characters.
|
||||
|
||||
When a result buffer is produced, the analysis starts from the *end* of the
|
||||
results buffer, comparing it to the prepared pattern. If it finds the last
|
||||
character of the prepared pattern, it starts a backward analysis, stopping only
|
||||
when comparison either fails or we reach the beginning of the prepared pattern.
|
||||
|
||||
If the comparison fails, the analysis resumes again until we run out of string
|
||||
to compare. If it suceeds, we compare the whole pattern to the string using
|
||||
`fnmatch()`, and print the result if that comparison succeeds.
|
||||
|
||||
If the comparison fails, we also record that it did, and establish that
|
||||
analyzing any part of the path *prior* to that marker is unnecessary, since the
|
||||
comparison failed to find anything within it.
|
||||
|
||||
Analysis continues until the stream is exhausted.
|
||||
|
||||
## The C implementation
|
||||
|
||||
The C implementation wraps this entirely in one function, re-using string
|
||||
buffers and pointers with abandon. The 1983 C compilers had no notion of
|
||||
variable shadowing or scoping; it just did what you told it to without ever
|
||||
questioning your decisions. It's only 39 lines long and does all the work in
|
||||
that space. It also never uses more than 2KB of memory, and 130 bytes of that
|
||||
are the copyright notice!
|
||||
|
||||
## The Rust implementation
|
||||
|
||||
### Prepare Pattern
|
||||
|
||||
The Rust implementation is a little more modern, and uses a bit more memory of
|
||||
course, but it's the same algorithm. The pattern preparer, which extracts the
|
||||
"non-glob" portion of the pattern to make base comparisons faster returns a
|
||||
''Vec'' rather than re-using a global array of 128 bytes. I also identified
|
||||
three places in the original code that performed the same action: "from a
|
||||
starting point, scan backwards until this function is satified or the array is
|
||||
exhausted." I've abstracted that out into a local function that takes the
|
||||
predicate as a closure.
|
||||
|
||||
### Squozen
|
||||
|
||||
The implementation of Squozen itself is broken up into three phases. When the
|
||||
Squozen struct is instantiated with the path to the database, the bigram table
|
||||
is read in, but only the bigram table and the path are stored.
|
||||
|
||||
The implementation of the Squozen database has two methods: ''paths()'' and
|
||||
''matches(pattern)''. ''paths()'' opens the database and returns an iterator,
|
||||
which begins reading it, uncompressing it according to the algorithm described
|
||||
above, and returning a read-only reference to an internal byte slice where the
|
||||
uncompressed results are stored.
|
||||
|
||||
''matches(pattern)'' performs the pattern preparation and then wraps
|
||||
''paths()'', returning only those strings that match the pattern, again using
|
||||
the read-only reference to the internal array presented by ''paths()''.
|
||||
|
||||
Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a
|
||||
new instance of the file reader, which is buffered, as well as the return slice.
|
||||
Multiple threads can be reading different instances at the same time, although I
|
||||
imagine some filesystem thrashing is likely if you have more than two or three.
|
||||
The scanner is pretty fast!
|
||||
|
Loading…
Reference in New Issue