Rethinking Anki's search syntax

At the moment, the syntax for searching cards has quite a few inconsistencies and undocumented behaviour.

Examples
  1. Single \ matches nothing but doesn’t throw an error, either.
  2. \\* matches literal \* (instead of \ and then anything).
  3. % is treated like a wildcard.
  4. ( can be matched anywhere, but ) only in quotes.
  5. If unquoted, \"text\" throws an error.
  6. a"b", "a""b", "a"(b), (a)(b) are all valid, but a(b) is not.
  7. tag:"a c" doesn’t match anything, but tag:a_c and tag:"a c*" (apart from matching the expected) match cards with tags a and c unless they have a tag b.
  8. : can’t be matched, either. A funny workaround is *:*:* .

I would like to redress these issues but for this there needs to exist a fixed and consistent syntax without ambiguities. To achieve that, I want to put forward the following list of clarifications and changes (numbers only for ease of reference):

  1. \ is always an escape character. An undefined escape sequence is an error.
  2. The defined escape sequences are: \\, \", \:, \*, \_, \(, \). They are required to match the corresponding special character.
  3. % is not a special character.
  4. All of the escape sequences might be used outside quotes.
  5. Escaping ( and ) inside quotes is optional.
  6. Escaping : is optional if it is preceded (not necessarily immediately) by another unescaped : in the containing string.
  7. Text preceded by certain keywords (re:) might deviate from all of the above, but escaping " is obligatory. Ergo, quoted text is always terminated by a " preceded by an even number (including 0) of \s.
  8. tag: doesn’t work across tag boundaries (spaces).
  9. A string starting with : is an error because the empty string can neither be a keyword nor a field name.

What do you think? Will implementing this will break anything (that is not broken, yet)? Or have I missed something that also should be included in this list? All feedback is welcome.

By the way, I’ve tried to be as permissive as possible because I know some users may depend on the status quo. But personally, I think Anki would benefit from a stricter rule set (in the search context and in general) because this would allow for more helpful feedback to the user if something goes wrong. An example would be enforcing white space between search strings.

3 Likes

Note to 8.: It would be possible to split the input on whitespace, thus regard a string like "tag:a b" as shorthand for tag:a tag:b. This would probably match the intention of a user who enters such a search expression.

I’m a bit confused by point 2. As you know, we’re dealing with two different things:

  • escaping Anki’s search syntax itself
  • escaping characters in the text passed to the regex or globbing engine

* is a regex wildcard, but _ is a single character glob. If you want to treat % as normal (and automatically escape it so the globbing code treats it as a normal character), wouldn’t it be consistent to treat _ normally as well, and make period escapable instead?

I guess we’d be basically making * and . as shortcuts for \* and \. But wouldn’t that mean any other other regex character would need double escaping, making it a bit of a footgun? “.*\[foo\]” to match “.*[foo]” for example.

I’m still a bit concerned that enforcing these changes will complicate the parser, and break things for existing users, but would be happy to be proved wrong on this :slight_smile:

My suggestions above refer solely to the user input. So when I write “% is not a special character.”, this means that a typed in % will match literally and doesn’t have to (in fact, must not) be escaped.
The parser will then handle that charcter depending on the context because it may be used as part of a RegEx or in an SQL like comparison. That, of course, is not a change I propose, but how it already works.

As for the wildcards, I’ve just sticked with the manual, i.e.:

Anki SQL RegEx
* % .*
_ _ .

Since % isn’t mentioned there, I thought it was the natural choice to stop treating it like a wildcard and leave the rest as it is. After all, changing Anki’s wildcards would mean a huge readjustment for the users and basically no benefit codewise.
What is more, in my opinion _ and * are pretty good choices because they are probably less likely to appear in flashcard text than . or %.

Now, if the user input is already a RegEx (prefixed by re:), this is where 7. comes in. To avoid our very own backslash plague, we only enforce escaping " and regard the input as a raw RegEx otherwise. This works because " isn’t a special character in RegEx. Lucky us!

I hope I’ve answered your questions. Maybe the following snippet from what I’ve written so far will clear things up.
The function will be called on an unqualified text string or the right side of a field search for example, generally speaking, stuff that the SQL writer will use after the like operator.
If you’re wondering, it has a counterpart called unescape_to_re and I was struggling to find a suitable function name. :sweat_smile:

Snippet
/// Handle escaped characters and convert Anki wildcards to SQL wildcards.
/// Return error if there is an undefined escape sequence.
fn unescape_to_glob(txt: &str) -> ParseResult<Cow<str>> {
    if is_invalid_escape(txt) {
        Err(ParseError {})
    } else {
        // escape sequences and unescaped special characters which need conversion
        lazy_static! { static ref RE: Regex = Regex::new(r"(\\.|[*%])").unwrap(); }
        Ok(RE.replace_all(&txt, |caps: &Captures| {
                match caps.get(0).unwrap().as_str() {
                    r"\\" => r"\\",
                    "\\\"" => "\"",
                    r"\:" => ":",
                    r"\*" => "*",
                    r"\_" => r"\_",
                    r"\(" => "(",
                    r"\)" => ")",
                    "*" => "%",
                    "%" => r"\%",
                    _ => unreachable!(),
                }
        }))
    }
}
1 Like

Thanks for elaborating; that clears things up a bit and it’s clear you’ve thought this through.

See below for the branch. No more “major” changes should be necessary.
On a completely unrelated note, may I ask which tools or setup you are using for debugging the Rust code, @dae?

The editor tooling is still very rough around the edges, but I’ve found Rust Analyzer in VSCode to be the least-bad option at the moment. It has an option to run ‘cargo check’ at startup which should fix code completion missing from generated files.

I haven’t bothered to look into debuggers, and just use dbg!() when trying to understand why something’s failing.

1 Like

I’ve just gone over the code - I don’t see any major problems with your approach, but have left some comments on the commit. Feel free to submit it as a draft PR - that will run the unit tests, and allow me to make multiple comments without notifying you multiple times.

1 Like

Thanks a lot for the feedback!
At the moment, there’s a bug which makes the unit tests fail.
I’ll make a PR once I’ve eradicated it.