Some documentation.
This commit is contained in:
		
							parent
							
								
									e624aea369
								
							
						
					
					
						commit
						9f0158be7e
					
				|  | @ -0,0 +1,16 @@ | ||||||
|  | # TODOs: | ||||||
|  | 
 | ||||||
|  | - Implement the Squozen decompression algorithm | ||||||
|  | - Implement the Squozen compression algorithm | ||||||
|  | - Implement the Merging (mlocate) decompression algorithm | ||||||
|  | - Implement the Merging (mlocate) compression algorithm | ||||||
|  | - Implement the Posting (plocate) decompression algorithm | ||||||
|  | - Implement the Posting (plocate) compression algorithm | ||||||
|  | - Implement the root binary, providing a universal API for all of these | ||||||
|  |   functions where possible. Make including/excluding them a configuration in | ||||||
|  |   Cargo.toml. | ||||||
|  | - Implement the updater function, stealing from the fast-find (fd & rg) walkers | ||||||
|  |   as needed. | ||||||
|  | - Implement a real-time server. | ||||||
|  | - Provide a C API to the libraries. | ||||||
|  |    | ||||||
|  | @ -0,0 +1,98 @@ | ||||||
|  | # Squozen | ||||||
|  | 
 | ||||||
|  | This crate contains a library for *reading* the Squozen database format, the | ||||||
|  | original format used to store the database for the Unix `locate` command. | ||||||
|  | 
 | ||||||
|  | ## The format | ||||||
|  | 
 | ||||||
|  | It's important to remember that the Squozen format was formalized in 1983; at | ||||||
|  | the time, the Unix filesystem handled only the characters between 0x00 and 0x7f | ||||||
|  | (0-127), filesystems were much smaller, the convention of short paths such as | ||||||
|  | `/usr` and `/etc` and eight-character-dot-three-character filenames were in | ||||||
|  | force. As such, the use of bigrams, the topmost bit as a sentinel, and the | ||||||
|  | likelihood that each paths would deviate from its prior by less than 14 | ||||||
|  | characters was sensible. | ||||||
|  | 
 | ||||||
|  | The Squozen format consists of a 256-long block that encompasses the 128 most | ||||||
|  | common bigrams (two-letter sequences) that appear in the database, followed by a | ||||||
|  | stream that has the following characteristics: | ||||||
|  | 
 | ||||||
|  | A leading byte. If the byte is the RS symbol (Record Separator, ASCII 30), the | ||||||
|  | next two bytes represent a 16-bit a number; if not, the byte itself is treated | ||||||
|  | as a 8-bit integer. This integer represents the number of characters from the | ||||||
|  | preceding read (and must be zero if this is the first read!) that can be re-used | ||||||
|  | in the current iteration. The starting pointer for inserts is moved to that | ||||||
|  | position. | ||||||
|  | 
 | ||||||
|  | As the byte stream is read, if the uppermost bit is set, the remaining 7-bits | ||||||
|  | are treated as a look up into the bigram table, the two bytes of which are | ||||||
|  | inserted into the result buffer, otherwise the character encountered is inserted | ||||||
|  | into the result buffer. This read terminates when a character equal to or less | ||||||
|  | than the RS symbol is encountered. That character is preserved for the next | ||||||
|  | read. | ||||||
|  | 
 | ||||||
|  | ## Analysis | ||||||
|  | 
 | ||||||
|  | Before reading the database, a substring of the pattern requested for matching | ||||||
|  | is extracted. This substring contains no wildcards or other glob special | ||||||
|  | characters. | ||||||
|  | 
 | ||||||
|  | When a result buffer is produced, the analysis starts from the *end* of the | ||||||
|  | results buffer, comparing it to the prepared pattern.  If it finds the last | ||||||
|  | character of the prepared pattern, it starts a backward analysis, stopping only | ||||||
|  | when comparison either fails or we reach the beginning of the prepared pattern.   | ||||||
|  | 
 | ||||||
|  | If the comparison fails, the analysis resumes again until we run out of string | ||||||
|  | to compare. If it suceeds, we compare the whole pattern to the string using | ||||||
|  | `fnmatch()`, and print the result if that comparison succeeds. | ||||||
|  | 
 | ||||||
|  | If the comparison fails, we also record that it did, and establish that | ||||||
|  | analyzing any part of the path *prior* to that marker is unnecessary, since the | ||||||
|  | comparison failed to find anything within it. | ||||||
|  | 
 | ||||||
|  | Analysis continues until the stream is exhausted. | ||||||
|  | 
 | ||||||
|  | ## The C implementation | ||||||
|  | 
 | ||||||
|  | The C implementation wraps this entirely in one function, re-using string | ||||||
|  | buffers and pointers with abandon.  The 1983 C compilers had no notion of | ||||||
|  | variable shadowing or scoping; it just did what you told it to without ever | ||||||
|  | questioning your decisions.  It's only 39 lines long and does all the work in | ||||||
|  | that space.  It also never uses more than 2KB of memory, and 130 bytes of that | ||||||
|  | are the copyright notice! | ||||||
|  | 
 | ||||||
|  | ## The Rust implementation | ||||||
|  | 
 | ||||||
|  | ### Prepare Pattern  | ||||||
|  | 
 | ||||||
|  | The Rust implementation is a little more modern, and uses a bit more memory of | ||||||
|  | course, but it's the same algorithm. The pattern preparer, which extracts the | ||||||
|  | "non-glob" portion of the pattern to make base comparisons faster returns a | ||||||
|  | ''Vec'' rather than re-using a global array of 128 bytes. I also identified | ||||||
|  | three places in the original code that performed the same action: "from a | ||||||
|  | starting point, scan backwards until this function is satified or the array is | ||||||
|  | exhausted." I've abstracted that out into a local function that takes the | ||||||
|  | predicate as a closure. | ||||||
|  | 
 | ||||||
|  | ### Squozen | ||||||
|  | 
 | ||||||
|  | The implementation of Squozen itself is broken up into three phases.  When the | ||||||
|  | Squozen struct is instantiated with the path to the database, the bigram table | ||||||
|  | is read in, but only the bigram table and the path are stored. | ||||||
|  | 
 | ||||||
|  | The implementation of the Squozen database has two methods: ''paths()'' and | ||||||
|  | ''matches(pattern)''. ''paths()'' opens the database and returns an iterator, | ||||||
|  | which begins reading it, uncompressing it according to the algorithm described | ||||||
|  | above, and returning a read-only reference to an internal byte slice where the | ||||||
|  | uncompressed results are stored. | ||||||
|  | 
 | ||||||
|  | ''matches(pattern)'' performs the pattern preparation and then wraps | ||||||
|  | ''paths()'', returning only those strings that match the pattern, again using | ||||||
|  | the read-only reference to the internal array presented by ''paths()''. | ||||||
|  | 
 | ||||||
|  | Each invocation of ''paths()'' (or the ''matches(pattern)'' wrapper) creates a | ||||||
|  | new instance of the file reader, which is buffered, as well as the return slice. | ||||||
|  | Multiple threads can be reading different instances at the same time, although I | ||||||
|  | imagine some filesystem thrashing is likely if you have more than two or three. | ||||||
|  | The scanner is pretty fast! | ||||||
|  | 
 | ||||||
		Loading…
	
		Reference in New Issue