Compare commits
No commits in common. "73bca38bb9d45af967ec38f44f980046ce768727" and "e624aea36932f322179ddc90b8f5f73450aad093" have entirely different histories.
73bca38bb9
...
e624aea369
16
TODO.md
16
TODO.md
|
@ -1,16 +0,0 @@
|
||||||
# TODOs:
|
|
||||||
|
|
||||||
- Implement the Squozen decompression algorithm
|
|
||||||
- Implement the Squozen compression algorithm
|
|
||||||
- Implement the Merging (mlocate) decompression algorithm
|
|
||||||
- Implement the Merging (mlocate) compression algorithm
|
|
||||||
- Implement the Posting (plocate) decompression algorithm
|
|
||||||
- Implement the Posting (plocate) compression algorithm
|
|
||||||
- Implement the root binary, providing a universal API for all of these
|
|
||||||
functions where possible. Make including/excluding them a configuration in
|
|
||||||
Cargo.toml.
|
|
||||||
- Implement the updater function, stealing from the fast-find (fd & rg) walkers
|
|
||||||
as needed.
|
|
||||||
- Implement a real-time server.
|
|
||||||
- Provide a C API to the libraries.
|
|
||||||
|
|
|
@ -1,98 +0,0 @@
|
||||||
# Squozen
|
|
||||||
|
|
||||||
This crate contains a library for *reading* the Squozen database format, the
|
|
||||||
original format used to store the database for the Unix `locate` command.
|
|
||||||
|
|
||||||
## The format
|
|
||||||
|
|
||||||
It's important to remember that the Squozen format was formalized in 1983; at
|
|
||||||
the time, the Unix filesystem handled only the characters between 0x00 and 0x7f
|
|
||||||
(0-127), filesystems were much smaller, the convention of short paths such as
|
|
||||||
`/usr` and `/etc` and eight-character-dot-three-character filenames were in
|
|
||||||
force. As such, the use of bigrams, the topmost bit as a sentinel, and the
|
|
||||||
likelihood that each paths would deviate from its prior by less than 14
|
|
||||||
characters was sensible.
|
|
||||||
|
|
||||||
The Squozen format consists of a 256-long block that encompasses the 128 most
|
|
||||||
common bigrams (two-letter sequences) that appear in the database, followed by a
|
|
||||||
stream that has the following characteristics:
|
|
||||||
|
|
||||||
A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the
|
|
||||||
next two bytes represent a 16-bit a number; if not, the byte itself is treated
|
|
||||||
as a 8-bit integer. This integer represents the number of characters from the
|
|
||||||
preceding read (and must be zero if this is the first read!) that can be re-used
|
|
||||||
in the current iteration. The starting pointer for inserts is moved to that
|
|
||||||
position.
|
|
||||||
|
|
||||||
As the byte stream is read, if the uppermost bit is set, the remaining 7-bits
|
|
||||||
are treated as a look up into the bigram table, the two bytes of which are
|
|
||||||
inserted into the result buffer, otherwise the character encountered is inserted
|
|
||||||
into the result buffer. This read terminates when a character equal to or less
|
|
||||||
than the RS symbol is encountered. That character is preserved for the next
|
|
||||||
read.
|
|
||||||
|
|
||||||
## Analysis
|
|
||||||
|
|
||||||
Before reading the database, a substring of the pattern requested for matching
|
|
||||||
is extracted. This substring contains no wildcards or other glob special
|
|
||||||
characters.
|
|
||||||
|
|
||||||
When a result buffer is produced, the analysis starts from the *end* of the
|
|
||||||
results buffer, comparing it to the prepared pattern. If it finds the last
|
|
||||||
character of the prepared pattern, it starts a backward analysis, stopping only
|
|
||||||
when comparison either fails or we reach the beginning of the prepared pattern.
|
|
||||||
|
|
||||||
If the comparison fails, the analysis resumes again until we run out of string
|
|
||||||
to compare. If it suceeds, we compare the whole pattern to the string using
|
|
||||||
`fnmatch()`, and print the result if that comparison succeeds.
|
|
||||||
|
|
||||||
If the comparison fails, we also record that it did, and establish that
|
|
||||||
analyzing any part of the path *prior* to that marker is unnecessary, since the
|
|
||||||
comparison failed to find anything within it.
|
|
||||||
|
|
||||||
Analysis continues until the stream is exhausted.
|
|
||||||
|
|
||||||
## The C implementation
|
|
||||||
|
|
||||||
The C implementation wraps this entirely in one function, re-using string
|
|
||||||
buffers and pointers with abandon. The 1983 C compilers had no notion of
|
|
||||||
variable shadowing or scoping; it just did what you told it to without ever
|
|
||||||
questioning your decisions. It's only 39 lines long and does all the work in
|
|
||||||
that space. It also never uses more than 2KB of memory, and 130 bytes of that
|
|
||||||
are the copyright notice!
|
|
||||||
|
|
||||||
## The Rust implementation
|
|
||||||
|
|
||||||
### Prepare Pattern
|
|
||||||
|
|
||||||
The Rust implementation is a little more modern, and uses a bit more memory of
|
|
||||||
course, but it's the same algorithm. The pattern preparer, which extracts the
|
|
||||||
"non-glob" portion of the pattern to make base comparisons faster returns a
|
|
||||||
''Vec'' rather than re-using a global array of 128 bytes. I also identified
|
|
||||||
three places in the original code that performed the same action: "from a
|
|
||||||
starting point, scan backwards until this function is satified or the array is
|
|
||||||
exhausted." I've abstracted that out into a local function that takes the
|
|
||||||
predicate as a closure.
|
|
||||||
|
|
||||||
### Squozen
|
|
||||||
|
|
||||||
The implementation of Squozen itself is broken up into three phases. When the
|
|
||||||
Squozen struct is instantiated with the path to the database, the bigram table
|
|
||||||
is read in, but only the bigram table and the path are stored.
|
|
||||||
|
|
||||||
The implementation of the Squozen database has two methods: ''paths()'' and
|
|
||||||
''matches(pattern)''. ''paths()'' opens the database and returns an iterator,
|
|
||||||
which begins reading it, uncompressing it according to the algorithm described
|
|
||||||
above, and returning a read-only reference to an internal byte slice where the
|
|
||||||
uncompressed results are stored.
|
|
||||||
|
|
||||||
''matches(pattern)'' performs the pattern preparation and then wraps
|
|
||||||
''paths()'', returning only those strings that match the pattern, again using
|
|
||||||
the read-only reference to the internal array presented by ''paths()''.
|
|
||||||
|
|
||||||
Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a
|
|
||||||
new instance of the file reader, which is buffered, as well as the return slice.
|
|
||||||
Multiple threads can be reading different instances at the same time, although I
|
|
||||||
imagine some filesystem thrashing is likely if you have more than two or three.
|
|
||||||
The scanner is pretty fast!
|
|
||||||
|
|
Loading…
Reference in New Issue