REFACTOR Moving around some code.

This places the parser in its own submodule so that we can be ready
for the next two or three phases of textual analysis.  Right now we
only scan for deliberate references, but the plan is to also scan
for explicit but incidental references, and then to go futher and
go the full tf-idf on the source.
This commit is contained in:
Elf M. Sternberg 2020-11-12 15:14:42 -08:00
parent 3068f18f0c
commit 1c948e41f3
4 changed files with 89 additions and 11 deletions

View File

@ -1,5 +1,5 @@
mod errors;
mod reference_parser;
mod parser;
mod store;
mod structs;

View File

@ -0,0 +1,59 @@
// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at http://mozilla.org/MPL/2.0/.
//! # Storage layer for Notesmachine
//!
//! This library implements the core functionality of Notesmachine and
//! describes that functionality to a storage layer. There's a bit of
//! intermingling in here which can't be helped, although it may make
//! sense in the future to separate the decomposition of the note
//! content into a higher layer.
//!
//! Notesmachine storage notes consist of two items: Note and Kasten.
//! This distinction is somewhat arbitrary, as structurally these two
//! items are stored in the same table.
//!
//! - Boxes have titles (and date metadata)
//! - Notes have content and a type (and date metadata)
//! - Notes are stored in boxes
//! - Notes are positioned with respect to other notes.
//! - There are two positions:
//! - Siblings, creating lists
//! - Children, creating trees like this one
//! - Notes may have references (pointers) to other boxes
//! - Notes may be moved around
//! - Notes may be deleted
//! - Boxes may be deleted
//! - When a box is renamed, every reference to that box is auto-edited to
//! reflect the change. If a box is renamed to match an existing box, the
//! notes in both boxes are merged.
//!
//! Note-to-note relationships form trees, and are kept in a SQL database of
//! (`parent_id`, `child_id`, `position`, `relationship_type`). The
//! `position` is a monotonic index on the parent (that is, every pair
//! (`parent_id`, `position`) must be unique). The `relationship_type` is
//! an enum and can specify that the relationship is *original*,
//! *embedding*, or *referencing*. An embedded or referenced note may be
//! read/write or read-only with respect to the original, but there is only
//! one original note at any time.
//!
//! Note-to-box relationships form a graph, and are kept in the SQL database
//! as a collection of *edges* from the note to the box (and naturally
//! vice-versa).
//!
//! - Decision: When an original note is deleted, do all references and
//! embeddings also get deleted, or is the oldest one elevated to be a new
//! "original"? Or is that something the user may choose?
//!
//! - Decision: Should the merging issue be handled at this layer, or would
//! it make sense to move this to a higher layer, and only provide the
//! hooks for it here?
//!
mod references;
use references::{build_page_titles, find_links};
pub(crate) fn build_references(content: &str) -> Vec<String> {
build_page_titles(&find_links(content))
}

View File

@ -4,7 +4,7 @@ use lazy_static::lazy_static;
use regex::bytes::Regex as BytesRegex;
use regex::Regex;
pub struct Finder(pub Vec<String>);
struct Finder(pub Vec<String>);
impl Finder {
pub fn new() -> Self {
@ -24,7 +24,7 @@ impl Finder {
}
}
fn find_links(document: &str) -> Vec<String> {
pub(super) fn find_links(document: &str) -> Vec<String> {
let arena = Arena::new();
let mut finder = Finder::new();
let root = parse_document(&arena, document, &ComrakOptions::default());
@ -50,25 +50,48 @@ fn find_links(document: &str) -> Vec<String> {
finder.0
}
// This function is for the camel and snake case handers.
fn recase(title: &str) -> String {
lazy_static! {
// Take every word that has a pattern of a capital letter
// followed by a lower case, and put a space between the
// capital and anything that preceeds it.
// TODO: Make Unicode aware.
static ref RE_PASS1: Regex = Regex::new(r"(?P<s>.)(?P<n>[A-Z][a-z]+)").unwrap();
// Take every instance of a lower case letter or number,
// followed by a capital letter, and put a space between them.
// TODO: Make Unicode aware. [[:lower:]] is an ASCII-ism.
static ref RE_PASS2: Regex = Regex::new(r"(?P<s>[[:lower:]]|\d)(?P<n>[[:upper:]])").unwrap();
static ref RE_PASS4: Regex = Regex::new(r"(?P<s>[a-z])(?P<n>\d)").unwrap();
// Take every instance of a word suffixed by a number and put
// a space between them.
// TODO: Make Unicode aware. [[:lower:]] is an ASCII-ism.
static ref RE_PASS4: Regex = Regex::new(r"(?P<s>[[:lower:]])(?P<n>\d)").unwrap();
// Take every instance of the one-or-more-of the symbols listed, and
// replace them with a space. This function is Unicode-irrelevant,
// although there is a list of symbols in the backreference parser
// that may disagree.
// TODO: Examime backreference parser and determine if this is
// sufficient.
static ref RE_PASS3: Regex = Regex::new(r"(:|_|-| )+").unwrap();
}
// This should panic if misused, so... :-)
let pass = title.to_string();
let pass = pass.strip_prefix("#").unwrap();
let pass = RE_PASS1.replace_all(&pass, "$s $n");
let pass = RE_PASS4.replace_all(&pass, "$s $n");
let pass = RE_PASS2.replace_all(&pass, "$s $n");
RE_PASS3.replace_all(&pass, " ").trim().to_string()
}
fn build_page_titles(references: &[String]) -> Vec<String> {
pub(super) fn build_page_titles(references: &[String]) -> Vec<String> {
references
.iter()
.filter_map(|s| match s.chars().next() {
@ -81,10 +104,6 @@ fn build_page_titles(references: &[String]) -> Vec<String> {
.collect()
}
pub(crate) fn build_references(content: &str) -> Vec<String> {
build_page_titles(&find_links(content))
}
#[cfg(test)]
mod tests {
use super::*;

View File

@ -52,7 +52,7 @@
//!
#![allow(clippy::len_zero)]
use crate::errors::NoteStoreError;
use crate::reference_parser::build_references;
use crate::parser::build_references;
use crate::store::private::*;
use crate::structs::*;
use sqlx::sqlite::SqlitePool;