Python checksum != rust checksum?

minshall · March 10, 2021, 3:40pm

hi. i seem to have a case (ten, actually) where ti.importNotes doesn’t detect a duplicate note. the cards in question are in Farsi, but only count for 10 out a group of 1800 notes in Farsi. i’ll post field 0 of one here:

'<div align="center">مٌتِأَهِل</div>'

the python code calculates a checksum of 3012355254, whereas on disk (if i’m doing the sql right) seems to have 2856611645.

i’d love verification, or a Bronx cheer. i didn’t do things like check that all the non-problem notes have the same checksum via python and (i guess) rust, etc.

dae · March 11, 2021, 1:59am

The former value is before unicode normalization:

>>> from anki.utils import fieldChecksum
... from unicodedata import normalize
... text = '<div align="center">مٌتِأَهِل</div>'
... print(fieldChecksum(text))
... print(fieldChecksum(normalize("NFC", text)))

3012355254
2856611645

If you have checksums in your collection created from an older Anki version that was not ensuring the text was normalized first, a database check should fix them up.

minshall · March 11, 2021, 2:54am

hi. i did a database check, but the symptoms don’t change. the former value is what i see NoteImporter.importNotes (syntax?) getting. i don’t see it doing any ‘normalize’.

            fld0 = n.fields[fld0idx]
            csum = fieldChecksum(fld0)

(and, a simple find . ... grep doesn’t turn up any obvious candidate.)

the --version argument requires a VERSION? :) ah, history.

bash apollo2 (master): {49768} anki --version VERSION
Anki version '2.1.35'

dae · March 13, 2021, 12:25am

Sorry, I assumed you had the opposite problem. The importing code is due for a rewrite soon, but I’ve pushed a fix to git that should work around it for now.

minshall · March 13, 2021, 12:27pm

Damien,

thanks. that didn’t work. my sense is maybe you made the fix on the wrong end, i.e., in the rust code, rather than in NoteImporter::importNotes; i thought you were going to put a normalize() for the parameter to the fieldChecksum() call there.

i’ve been using the arch linux version. converting to the git version, there are random printf’s (println’s, i guess) that are new; i’m guessing these are just debugging printf’s while you’re developing, but, in case they indicate that my code or something else is tickling your code badly, they include:

begin: None
ended, undo steps count now 0
clearing undo+study due to insert or replace into notes values (?,?,?,?,?,?,?,?,?,?,?)
clearing undo+study due to update notes set mod = ?, usn = ?, flds = ?, tags = ? where id = ? and (flds != ? or tags != ?)
clearing undo+study due to update cards set type = 2, queue = 2, ivl = ?, due = ?, factor = ?, reps = ?, lapses = ? where nid = ? and ord = ?

btw, i was very pleased to run into consts.py, especially the card and queue type constants. thanks!

dae · March 15, 2021, 1:01am

When running from git you’re running development code, so it may have debug messages or bugs in it - it’s not intended to be used as a daily driver.

It surprises me that the change did not fix things for you: ensure fields normalized before checksumming · ankitects/anki@1ab085d · GitHub

Could you poke into it a bit to figure out what is not matching?

@ArthurMilchior deserves the credit for the queue and type consts in consts.py

minshall · March 16, 2021, 3:57pm

sure. is there a way to look at checksums on the web or in IOS app? i’m sure it’s not necessary, but might be convenient.

minshall · March 17, 2021, 7:15am

hi. the below seems to fix my problem. which, if i understand what normalize() might do, could make sense. the checksum issues were presumably a red herring – sorry about that.

2 “buts”:

in the middle of my debugging, i seem to made my collection unhappy. it asks me to do “check database”, which i do, repeatedly. then, suggests “please force a full sync in the preferences screen”. i do this (well, i do “force changes”, which i assume is the right thing?). but, when i sync, it does not ask for a direction, seems to just do a standard “delta” sync. and, i stay in the bad state. (in my debugging notes, this started way before i hit on the below change.)
caveat emptor: i have no idea what the more global effect of this change on everything else i or anyone else might want to do.

diff --git a/pylib/anki/importing/noteimp.py b/pylib/anki/importing/noteimp.py
index 8865c2674..f6f60f9e7 100644
--- a/pylib/anki/importing/noteimp.py
+++ b/pylib/anki/importing/noteimp.py
@@ -16,6 +16,7 @@ from anki.utils import (
     splitFields,
     timestampID,
 )
+from unicodedata import normalize
 
 # Stores a list of fields, tags and deck
 ######################################################################
@@ -170,7 +171,7 @@ class NoteImporter(Importer):
                 for id in csums[csum]:
                     flds = self.col.db.scalar("select flds from notes where id = ?", id)
                     sflds = splitFields(flds)
-                    if fld0 == sflds[0]:
+                    if normalize("NFC", fld0) == normalize("NFC", sflds[0]):
                         # duplicate
                         found = True
                         if self.importMode == UPDATE_MODE:

dae · March 17, 2021, 12:24pm

You found a bug - I’d recommend using check database to make sure everything is ok, then updating to the latest git and uploading your collection. I’ve also pushed change which should hopefully address your issue - we shouldn’t need to normalize the local collection as it should already be normalized. Please let me know how you go.

minshall · March 18, 2021, 6:26am

thanks. i can now import and re-import and i don’t seem to see the behavior i saw before. but, since running my mods probably “cleaned” the DB, i can’t be sure. (i’m not good about dealing with backup databases.)

however, i’m still in the situation where “force changes” doesn’t. when i launch anki, i get the “please check”, which i do, but then, on next synch (quitting anki, say), i get the “please check” again. and, when i try tools:preferences:network:force changes, then do a synch, it doesn’t ask for a direction, just does its normal thing (which fails with a “please check…”).

i’m guessing this is either user error, or something random introduced in your current development code. any ideas?

dae · March 18, 2021, 7:28am

I pushed a fix for the ‘force changes’ issue to the main branch yesterday, but didn’t get around to pushing the fix to the ‘refresh’ branch until earlier today - are you sure you’re running the latest git? You should see the sync button turns red after forcing a full sync and refreshing the deck list, and you should be able to choose to upload/download when syncing.

minshall · March 18, 2021, 4:51pm

sorry, i’m an unreliable informant. (i was having failures in the pip install which took me a long time to “fix” – by removing a bunch of dependencies it had added to ~/.local/lib/...; when i fixed it, the output reduced considerably, i was very excited, and i didn’t notice it reminding me to add --force-reinstall.)

so, now i’ve figured that out. now, with the latest git, force changes works, the sync button turns red, etc.

i went back two revs (to where you reverted the rust, i guess, checksum change), and i get the problem i had had before. moved back to the current revision. and, that problem has gone away.

but, i still see an anomaly, where instead of my code trying to delete ten cards, noteimp.py reports, each time, with the same set of notes, that it has “updated” ten cards. i’m guessing the same ten.

i’ll look more at that and report back.

minshall · March 18, 2021, 5:32pm

so, in noteimp.addUpdates(), i replaced rows with

[x for x in rows if x[2].find("moteahel") != -1]

(moteahel being the content of two, i guess, of the notes i was having problems with before.)

and looked at changes2 at the bottom. it was 2 (rather than, 10, or 0, say). which i take to mean that, not surprisingly, the ten are the ten from before.

if i decode the sql correctly

self.col.db.executemany("""update notes set mod = ?, usn = ?, flds = ?, tags = ? where id = ? and (flds != ? or tags != ?)""", rows,)

sql or something is comparing the flds field to what it has in the database and, in these ten cases, “for whatever reason”, finding a mis-matches, replacing those records, but in such a way that the same update finds the same mis-matches next time. ad infinitum. (the tags field is empty.)

i think i’ve exhausted my abilities. ideas?

for the record:

bash apollo2 (main): {52214} git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
bash apollo2 (main): {52216} git log | head -1
commit a90a6ab3cde92d2f407697307b14cca6f1ff92c3

cheers.

minshall · March 19, 2021, 2:36am

ah, i do have something to add. i realized i can “ask” sql what is actually in the database.

at the beginning of addUpdates(), i do:

(Pdb) self.col.db.execute("select flds from notes where csum = ?", 2856611645)
[['<div align="center">مٌتِأَهِل</div>\x1f<div align="center">moteahel <p></div><p><div><div align="left">evlenmiş</div>']]
(Pdb) y = [x for x in rows if x[2].find("moteahel") != -1]
(Pdb) y[0][2]
'<div align="center">مٌتِأَهِل</div>\x1f<div align="center">moteahel <p></div><p><div><div align="left">evlenmiş</div>'

then, od -x. first is the value returned from the database; the second, what is in the
argument to addUpdates():

bash apollo2 (master): {52122} od -x
مٌتِأَهِل
0000000 85d9 8cd9 aad8 90d9 a3d8 8ed9 87d9 90d9
0000020 84d9 000a
0000023
bash apollo2 (master): {52123} od -x
مٌتِأَهِل
0000000 85d9 8cd9 aad8 90d9 a7d8 8ed9 94d9 87d9
0000020 90d9 84d9 000a
0000025

then, back in addUpdates():

(Pdb) z = self.col.db.execute("select flds from notes where csum = ?", 2856611645)
(Pdb) z
[['<div align="center">مٌتِأَهِل</div>\x1f<div align="center">moteahel <p></div><p><div><div align="left">evlenmiş</div>']]
(Pdb) y = [x for x in rows if x[2].find("moteahel") != -1]
(Pdb) zz = z[0][0].split("\x1f")[0]
(Pdb) zz
'<div align="center">مٌتِأَهِل</div>'
(Pdb) y
[[1616121457, -1, '<div align="center">مٌتِأَهِل</div>\x1f<div align="center">moteahel <p></div><p><div><div align="left">evlenmiş</div>', '', 1616085188287, '<div align="center">مٌتِأَهِل</div>\x1f<div align="center">moteahel <p></div><p><div><div align="left">evlenmiş</div>', ''], [1616121457, -1, '<div align="center">مٌتِأَهِل</div><p><div align="right">anlam</div>\x1f<div align="center">evlenmiş <p></div><p><div><div align="left">moteahel</div>', '', 1616085188288, '<div align="center">مٌتِأَهِل</div><p><div align="right">anlam</div>\x1f<div align="center">evlenmiş <p></div><p><div><div align="left">moteahel</div>', '']]
(Pdb) yy = y[0][2].split("\x1f")[0]
(Pdb) yy
'<div align="center">مٌتِأَهِل</div>'
(Pdb) fieldChecksum(zz)
2856611645
(Pdb) fieldChecksum(yy)
3012355254
(Pdb) fieldChecksum(unicodedata.normalize('NFC', zz))
2856611645
(Pdb) fieldChecksum(unicodedata.normalize('NFC', yy))
2856611645

so, the data we are trying to add isn’t stable under normalize(). and
someone (anki? python sql? sql?) is normalizing before writing to
the disk (or something).

cheers.

dae · March 19, 2021, 1:51pm

Would you be able to share a 1 card apkg file and/or text file that I can reproduce the problem with?

minshall · March 19, 2021, 3:11pm

ah, it won’t let me upload a .html file. but, here are the two lines. does this work?

<div align="center">مٌتِأَهِل</div> <div align="center">moteahel <p></div><p><div><div align="left">evlenmiş</div>
<div align="center">مٌتِأَهِل</div><p><div align="right">anlam</div>        <div align="center">evlenmiş <p></div><p><div><div align="left">moteahel</div>

dae · March 22, 2021, 12:57am

Please try again with the latest git

minshall · March 23, 2021, 3:15am

./scripts/build failed:

---8<---8<--- Start of log, file at /home/minshall/.cache/bazel/_bazel_minshall/1acd5802e8914f711950b8c4027150fc/bazel-workers/worker-2-TypeScriptCompile.log ---8<---8<---
internal/modules/cjs/loader.js:797
    throw err;
    ^

Error: Cannot find module './perf_trace'

(let me know if you want more of the error output.)

minshall · March 23, 2021, 3:24am

but, then, i went back

git checkout 727399604160187c7edca642927f766ee216b698

and it built and installed just fine, and, thanks!! – seems to work! (i haven’t yet looked at the produced decks, or anything, but i assume that is fine.)

dae · March 23, 2021, 10:51am

Mind trying the latest git again? If you get the same problem, please try ‘bazel shutdown’, and/or removing the ts/node_modules folder. Does that resolve the issue?

Topic		Replies	Views
updating existing notes won't work - all notes are added as if they were new Syncing & AnkiWeb	11	1904	May 1, 2023
Unused file—but it doesn't exist Help	20	1603	May 1, 2023
Anki 23.10.1 Release Candidate Beta Testing	43	2714	November 10, 2023
Update deck via extended .csv with new synonyms, avoiding old duplicates Help	43	1472	April 21, 2024
An error occured while making new cards even after disabling all addons - it is not working what to do? Help	3	408	May 1, 2023

Python checksum != rust checksum?

Related topics