As of Anki 2.1.54, when sorting by “Sort Field”, letters with diacritics and umlauts don’t quite end up where users would expect them to. This is due to the comparison being a simple collate nocase
in SQLite. The nocase
collation also only folds ASCII letters, which causes Ü (U+00DC) to precede ä (U+00E4), even though a precedes U.
It doesn’t seem easy to do much better than this in SQLite. We could provide our own collation function, an idea that I don’t like very much as we would now have to maintain some C code around (I think). Alternatively, we could order by
a string obtained after performing some normalization of the “Sort Field” string in SQL, but I don’t believe that would be performant enough.
I think we can get to some fairly “language-agnostic” alphabetical sorting order which would be better than the one we currently have. To give an example, for German, there are at least two standardized transformations to produce “more” alphabetical sorting orders:
- DIN 5007 Variant 1 (“Dictionary order”)
ä = a
ö = o
ü = u
ß = ss
- DIN 5007 Variant 2 (“Phone book order”)
ä = ae
ö = oe
ü = ue
ß = ss
Some discussion and research are needed to figure out what this language-agnostic transformation would look like. It doesn’t sound like a novel problem, so we probably can get some inspiration from existing open-source software out there. We would still need to figure out how to implement this in a performant way.
Preemptively, I strongly recommend against any system locale-dependent solution. Anki is very often used to learn other languages, which I believe will often make the locale you get from the system not what the users want their decks to be sorted with.