Search using nc: does not find Polish diacritic letter L or Croatian diacritic letter D

This might be a problem with some library rather than Anki itself. How is nc: implemented?

Languages tested so far:

  • Slovakian
  • Hungarian
  • Czech
  • Latvian
  • Polish
  • Lithuanian
  • Croatian
  • Estonian

Diacritic letters for the above languages, respectively, from Wikipedia:

  • áäčďéíľňóôŕšťúýž
  • áéóöőü
  • áčďéěíňóřšťúůýž
  • āčēģīķļņšūž
  • ąćęłńóśźż
  • ąčęėįšųūž
  • čćđljnjšž
  • šžõäöü

Searches:

  • nc:aacdeilnoorstuyz
  • nc:aeooou
  • nc:acdeeinorstuuyz
  • nc:acegiklnsuz
    • nc:acelnoszz
    • nc:ace_noszz
  • nc:aceeisuuz
    • nc:ccdljnjsz
    • nc:cc_ljnjsz
  • nc:szoaou

(I included lj and nj for Croatian because Unicode has a single-character version of both of these, even though they are usually just written with l+j and n+j.)

It looks like Unicode normalization will only work for things like acute, grave, circumflex, macron, caron, breve, cedilla, etc. but not for things like barred letters that don’t have a “combining” variant.

from unicodedata import normalize

for s in ("Lech Wałęsa", "Novak Đoković"):
    for form in ("NFC", "NFD", "NFKC", "NFKD"):
        n = normalize(form, s)
        print([ord(c) if ord(c) > 255 else c for c in n])

['L', 'e', 'c', 'h', ' ', 'W', 'a', 322, 281, 's', 'a']
['L', 'e', 'c', 'h', ' ', 'W', 'a', 322, 'e', 808, 's', 'a']
['L', 'e', 'c', 'h', ' ', 'W', 'a', 322, 281, 's', 'a']
['L', 'e', 'c', 'h', ' ', 'W', 'a', 322, 'e', 808, 's', 'a']

['N', 'o', 'v', 'a', 'k', ' ', 272, 'o', 'k', 'o', 'v', 'i', 263]
['N', 'o', 'v', 'a', 'k', ' ', 272, 'o', 'k', 'o', 'v', 'i', 'c', 769]
['N', 'o', 'v', 'a', 'k', ' ', 272, 'o', 'k', 'o', 'v', 'i', 263]
['N', 'o', 'v', 'a', 'k', ' ', 272, 'o', 'k', 'o', 'v', 'i', 'c', 769]

One approach is simply to have an exhaustive translation list, like WordPress offers here in PHP.

Unfortunately I have zero experience with Rust, so I don’t think I could formulate a PR.

I’ve logged this on Our no-combining search does not ignore all diacritics · Issue #2926 · ankitects/anki · GitHub

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.