Search using nc: does not find Polish diacritic letter L or Croatian diacritic letter D

sprvlcn · January 1, 2024, 1:36am

This might be a problem with some library rather than Anki itself. How is nc: implemented?

Languages tested so far:

Slovakian
Hungarian
Czech
Latvian
Polish
Lithuanian
Croatian
Estonian

Diacritic letters for the above languages, respectively, from Wikipedia:

áäčďéíľňóôŕšťúýž
áéóöőü
áčďéěíňóřšťúůýž
āčēģīķļņšūž
ąćęłńóśźż
ąčęėįšųūž
čćđljnjšž
šžõäöü

Searches:

nc:aacdeilnoorstuyz
nc:aeooou
nc:acdeeinorstuuyz
nc:acegiklnsuz
- nc:acelnoszz
- nc:ace_noszz
nc:aceeisuuz
- nc:ccdljnjsz
- nc:cc_ljnjsz
nc:szoaou

(I included lj and nj for Croatian because Unicode has a single-character version of both of these, even though they are usually just written with l+j and n+j.)

dae · January 1, 2024, 3:13am

github.com

ankitects/anki/blob/8f77e5198b2e0a3019fa143e621a80ae62961f20/rslib/src/text.rs#L372


      
              }
          }
          
          pub(crate) fn ensure_string_in_nfc(s: &mut String) {
              if !is_nfc(s) {
                  *s = s.chars().nfc().collect()
              }
          }
          
          /// Convert provided string to NFKD form and strip combining characters.
          pub(crate) fn without_combining(s: &str) -> Cow<str> {
              // if the string is already normalized
              if matches!(is_nfkd_quick(s.chars()), IsNormalized::Yes) {
                  // and no combining characters found, return unchanged
                  if !s.chars().any(is_combining_mark) {
                      return s.into();
                  }
              }
          
              // we need to create a new string without the combining marks
              s.chars()

sprvlcn · January 1, 2024, 6:06am

It looks like Unicode normalization will only work for things like acute, grave, circumflex, macron, caron, breve, cedilla, etc. but not for things like barred letters that don’t have a “combining” variant.

from unicodedata import normalize

for s in ("Lech Wałęsa", "Novak Đoković"):
    for form in ("NFC", "NFD", "NFKC", "NFKD"):
        n = normalize(form, s)
        print([ord(c) if ord(c) > 255 else c for c in n])

['L', 'e', 'c', 'h', ' ', 'W', 'a', 322, 281, 's', 'a']
['L', 'e', 'c', 'h', ' ', 'W', 'a', 322, 'e', 808, 's', 'a']
['L', 'e', 'c', 'h', ' ', 'W', 'a', 322, 281, 's', 'a']
['L', 'e', 'c', 'h', ' ', 'W', 'a', 322, 'e', 808, 's', 'a']

['N', 'o', 'v', 'a', 'k', ' ', 272, 'o', 'k', 'o', 'v', 'i', 263]
['N', 'o', 'v', 'a', 'k', ' ', 272, 'o', 'k', 'o', 'v', 'i', 'c', 769]
['N', 'o', 'v', 'a', 'k', ' ', 272, 'o', 'k', 'o', 'v', 'i', 263]
['N', 'o', 'v', 'a', 'k', ' ', 272, 'o', 'k', 'o', 'v', 'i', 'c', 769]

One approach is simply to have an exhaustive translation list, like WordPress offers here in PHP.

Unfortunately I have zero experience with Rust, so I don’t think I could formulate a PR.

dae · January 3, 2024, 3:49am

I’ve logged this on Our no-combining search does not ignore all diacritics · Issue #2926 · ankitects/anki · GitHub

system · February 2, 2024, 3:50am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Diacritical marks in Urdu not working Help	3	248	August 30, 2023
Cannot use the character á in browse Help	6	528	May 1, 2023
Bug with type:nc? Help	2	44	January 8, 2025
Ignoring diacritics Help	3	355	May 1, 2023
Problem typing foreign letters/diacritic in Mac Help	4	489	May 1, 2023

Search using nc: does not find Polish diacritic letter L or Croatian diacritic letter D

Related topics