Inconsistent unicode normalization

epistularum · August 27, 2022, 2:51pm

Anki still normalizes exported apkg notes, even if it’s set not to, using the command below.

mw.col.conf["normalize_note_text"] = False

Not sure if it is the wanted behavior, but it’s extremely confusing when you suddenly have your whole deck changed after an export. I was combing my exported deck for a whole day trying to understand what was going on until I finally figured it out.

BlackBeans · August 27, 2022, 7:26pm

I’m not sure how you would actually even notice that: I’m not sure about what Anki does, but unicode normalization is supposed to be unnoticeable unless you compare the raw byte sequences. Display and behavior in searches and alike should not be changed.

epistularum · August 28, 2022, 5:34am

You would think this is what unicode normalization does, but it’s really not the case, at least that’s not how anki handles it.

The anki manual makes it seem like it’s an edge case but it’s really not:

if you are studying certain material like archaic Japanese symbols, the normalization process can end up converting them to a more modern equivalent.

What this essentially means is that a bunch of 旧字(and others) will be converted to their 新字 form.
Some examples of this taken from the 旧字体 forms of 常用 kanjis:

喝 - 喝
嘆 - 嘆
器 - 器
塚 - 塚
塀 - 塀

Calling them “archaic” is, I believe, simply not true. Following the implementation of the 常用 and 人名用, these “old” (correct from a linguistic point of view, aka 正字) are still in use and allowed to be used by the Japanese government. This is very specifically stated in the first and second paragraph of the preface of the 常用漢字表 provided by the Agency for Cultural Affairs of the Japanese government.

1 この表は，法令，公⽤⽂書，新聞，雑誌，放送など，⼀般の社会⽣活において，現代の国語を書き表す場合の漢字使⽤の⽬安を⽰すものである。

“This table shows the standard kanji usage for writing modern Japanese in general social situations such as laws and regulations, official documents, newspapers, magazines, and broadcasts.”

2 この表は，科学，技術，芸術その他の各種専⾨分野や個々⼈の表記にまで及ぼそうとするものではない。ただし，専⾨分野の語であっても，⼀般の社会⽣活と密接に関連する語の表記については，この表を参考とすることが望ましい。

“This table does not attempt to extend to the notation of scientific, technical, artistic and other various specialties and individual persons. However, it is advisable to refer to this table for the notation of words closely related to general social life, even in one’s specific fields of specialty.”

It is important to make a difference between the government imposed standard kanji usage enforced in “general social situations such as laws and regulations, official documents, newspapers, magazines, and broadcasts” and the linguistically speaking correct form of kanjis aka 正字(often 旧字) or other forms.

Not only the term “archaic” has absolutely no legal basis, but it has also no basis on real world usage. I can quite literally go out of my house and find an “archaic form” within a minute at most. It is impossible to not find one of those supposedly “archaic” kanji every. single. day.

Regardless of this, here is a small list of characters that are officially supported by the 人名用漢字表 but are converted in anki:

渚
猪
祐
禎
靖
蓮

I’ve already had this same discussion on the old forums with a moderator, but I’ll reiterate my opinion. This default behavior makes no sense for a language learning app, even more where the name and main focus is Japanese.

Rumo · August 28, 2022, 6:30am

Even if the name is Japanese, Anki tries to be subject-agnostic. Most language learners are not learning Japanese and most Japanese learners focus on the 常用 kanji, I dare say. You’ll always have to make compromises.

But more to the point, which Anki version and which exporter are you using? The old or the new one?

epistularum · August 28, 2022, 7:57am

I totally understand, I’m just pointing out the irony.

Version ⁨2.1.54 (b6a7760c)⁩
Python 3.9.10 Qt 6.3.1 PyQt 6.3.1

I tested both the new and old exporter with the same results.

Rumo · August 28, 2022, 8:38am

I can’t reproduce this with either exporter. I’ve done the following:

Disable normalization.
Add note with 神.
Export into apkg.
Delete note.
Import apkg.

The reimported note still has 神, compared to its normalisation 神, on it. This seems to prove that neither importer nor exporter have performed any normalisation.

epistularum · August 28, 2022, 8:48am

I understand what is happening, this is a user error on my part.

I thought that disabling the unicode normalization would disable it for all profiles, but I was wrong. This is a per-profile setting. All my tests were done using a separate testing profile where the setting was enabled.

I am deeply sorry for the time I made you waste, and I sincerely appreciate the support provided.

BlackBeans · August 28, 2022, 8:52am

Just FYI (since you seem interested), even though the wording of Anki’s manual may be a bit poor about it, it’s not Anki that chooses that 喝 and 喝 are equivalent. They are said to be equivalent (by compatible “equivalence”) by an international standard (proposed by the Unicode consortium). All this to say that there is a “legal” basis for this transformation, and it’s not because some kanjis are considered archaic. It’s because they considered equivalent in usage, and because the canonical representative of this equivalence class of characters was chosen to be the more modern one (despite the other version being still is usage).

Anki just referred to these kanjis as archaic just because they are older than their modern counterpart.

Rumo · August 28, 2022, 8:54am

No worries, glad it’s sorted out.

epistularum · August 28, 2022, 9:57am

There is a lot of issues with the unicode standard when it comes to kanji, especially regarding the han unification. Some choices made by the unicode standard are based on a technical inability to record all characters, and thus do not reflect the linguistic reality. The argument of equivalence is merely a fine line drawn in sand when it comes to unicode. The vast majority of kanjis have multiple pure equivalents in the forms of 俗字、略字、誤字、本字、古字,… All of those pointing to a single correct form, the 親字(aka 正字). It’s not like the unicode consortium is combining every single variant into the 親字 and we know that not all characters are represented. This means that it’s basically up to them to pick and chose what should be combined or not.

As for the legal basis, I do not believe that the unicode consortium constitutes a legal entity. I’m just pointing the legality as a part of my argument, since even some 人名用 kanjis get converted. Even if the unicode consortiun were a legal entity, it doesn’t mean that they’ll do what is linguistically correct. We’ve seen how crooked the 常用 implementation is with magical standards like characters half 正字 and half 俗字 (ex: 餅 (正字𩙿with 俗字并 (instead of 幷))) or even characters that aren’t simplified at all like 牙.

Concerning the specific usage of archaic, I am not native in the english language but I believe the connotation is that it is something from a different age that should or has been replaced and still lives on in a minor form up until now. This is not the case of 旧字 as specifically explained in the 常用漢字表. The 旧字 is meant to still be in use along the 新字.

My point is that in the end, the default anki configuration makes the assumption that those characters shouldn’t be learned, when it’s far from being the truth. This is even reinforced by the lack of menu options to easily disable this feature, it is basically hidden to the vast majority of people. Even advanced users.

dae · August 30, 2022, 12:29pm

You make reasonable points, and I’d personally have been happier if NFC/NFD normalization was decoupled from han unification. That said, judging by the relatively few reports we get about this, it seems not to affect that many people. I worry that if this is exposed as a visible option, users will accidentally toggle it, then generate support requests when their searches don’t work on some platforms that happen to use a different encoding for text input. Any decks they share will propagate the problem further.

Topic		Replies	Views
Rare CJK normalization Help	2	403	May 1, 2023
Dealing with Japanese words appearing with Chinese writting (Han Unification) Help	4	1326	July 9, 2023
Character being replaced Help	2	305	May 1, 2023
Unicode normalisation Help	4	504	May 1, 2023
Some Japanese Characters are Saved Incorrectly Help	3	544	May 1, 2023

Inconsistent unicode normalization

Related topics