Added documentation to the Squozen README describing the decompressor algorithm.
This commit is contained in:
parent
e624aea369
commit
c7ba092293
|
@ -0,0 +1,98 @@
|
||||||
|
# Squozen
|
||||||
|
|
||||||
|
This crate contains a library for *reading* the Squozen database format, the
|
||||||
|
original format used to store the database for the Unix `locate` command.
|
||||||
|
|
||||||
|
## The format
|
||||||
|
|
||||||
|
It's important to remember that the Squozen format was formalized in 1983; at
|
||||||
|
the time, the Unix filesystem handled only the characters between 0x00 and 0x7f
|
||||||
|
(0-127), filesystems were much smaller, the convention of short paths such as
|
||||||
|
`/usr` and `/etc` and eight-character-dot-three-character filenames were in
|
||||||
|
force. As such, the use of bigrams, the topmost bit as a sentinel, and the
|
||||||
|
likelihood that each paths would deviate from its prior by less than 14
|
||||||
|
characters was sensible.
|
||||||
|
|
||||||
|
The Squozen format consists of a 256-long block that encompasses the 128 most
|
||||||
|
common bigrams (two-letter sequences) that appear in the database, followed by a
|
||||||
|
stream that has the following characteristics:
|
||||||
|
|
||||||
|
A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the
|
||||||
|
next two bytes represent a 16-bit a number; if not, the byte itself is treated
|
||||||
|
as a 8-bit integer. This integer represents the number of characters from the
|
||||||
|
preceding read (and must be zero if this is the first read!) that can be re-used
|
||||||
|
in the current iteration. The starting pointer for inserts is moved to that
|
||||||
|
position.
|
||||||
|
|
||||||
|
As the byte stream is read, if the uppermost bit is set, the remaining 7-bits
|
||||||
|
are treated as a look up into the bigram table, the two bytes of which are
|
||||||
|
inserted into the result buffer, otherwise the character encountered is inserted
|
||||||
|
into the result buffer. This read terminates when a character equal to or less
|
||||||
|
than the RS symbol is encountered. That character is preserved for the next
|
||||||
|
read.
|
||||||
|
|
||||||
|
## Analysis
|
||||||
|
|
||||||
|
Before reading the database, a substring of the pattern requested for matching
|
||||||
|
is extracted. This substring contains no wildcards or other glob special
|
||||||
|
characters.
|
||||||
|
|
||||||
|
When a result buffer is produced, the analysis starts from the *end* of the
|
||||||
|
results buffer, comparing it to the prepared pattern. If it finds the last
|
||||||
|
character of the prepared pattern, it starts a backward analysis, stopping only
|
||||||
|
when comparison either fails or we reach the beginning of the prepared pattern.
|
||||||
|
|
||||||
|
If the comparison fails, the analysis resumes again until we run out of string
|
||||||
|
to compare. If it suceeds, we compare the whole pattern to the string using
|
||||||
|
`fnmatch()`, and print the result if that comparison succeeds.
|
||||||
|
|
||||||
|
If the comparison fails, we also record that it did, and establish that
|
||||||
|
analyzing any part of the path *prior* to that marker is unnecessary, since the
|
||||||
|
comparison failed to find anything within it.
|
||||||
|
|
||||||
|
Analysis continues until the stream is exhausted.
|
||||||
|
|
||||||
|
## The C implementation
|
||||||
|
|
||||||
|
The C implementation wraps this entirely in one function, re-using string
|
||||||
|
buffers and pointers with abandon. The 1983 C compilers had no notion of
|
||||||
|
variable shadowing or scoping; it just did what you told it to without ever
|
||||||
|
questioning your decisions. It's only 39 lines long and does all the work in
|
||||||
|
that space. It also never uses more than 2KB of memory, and 130 bytes of that
|
||||||
|
are the copyright notice!
|
||||||
|
|
||||||
|
## The Rust implementation
|
||||||
|
|
||||||
|
### Prepare Pattern
|
||||||
|
|
||||||
|
The Rust implementation is a little more modern, and uses a bit more memory of
|
||||||
|
course, but it's the same algorithm. The pattern preparer, which extracts the
|
||||||
|
"non-glob" portion of the pattern to make base comparisons faster returns a
|
||||||
|
''Vec'' rather than re-using a global array of 128 bytes. I also identified
|
||||||
|
three places in the original code that performed the same action: "from a
|
||||||
|
starting point, scan backwards until this function is satified or the array is
|
||||||
|
exhausted." I've abstracted that out into a local function that takes the
|
||||||
|
predicate as a closure.
|
||||||
|
|
||||||
|
### Squozen
|
||||||
|
|
||||||
|
The implementation of Squozen itself is broken up into three phases. When the
|
||||||
|
Squozen struct is instantiated with the path to the database, the bigram table
|
||||||
|
is read in, but only the bigram table and the path are stored.
|
||||||
|
|
||||||
|
The implementation of the Squozen database has two methods: ''paths()'' and
|
||||||
|
''matches(pattern)''. ''paths()'' opens the database and returns an iterator,
|
||||||
|
which begins reading it, uncompressing it according to the algorithm described
|
||||||
|
above, and returning a read-only reference to an internal byte slice where the
|
||||||
|
uncompressed results are stored.
|
||||||
|
|
||||||
|
''matches(pattern)'' performs the pattern preparation and then wraps
|
||||||
|
''paths()'', returning only those strings that match the pattern, again using
|
||||||
|
the read-only reference to the internal array presented by ''paths()''.
|
||||||
|
|
||||||
|
Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a
|
||||||
|
new instance of the file reader, which is buffered, as well as the return slice.
|
||||||
|
Multiple threads can be reading different instances at the same time, although I
|
||||||
|
imagine some filesystem thrashing is likely if you have more than two or three.
|
||||||
|
The scanner is pretty fast!
|
||||||
|
|
Loading…
Reference in New Issue