Feasibility of collecting user submitted datasets with card content included, to further improve FSRS parameters

nestorski · November 14, 2024, 10:38am

Reading @Expertium blog’s benchmark page, the following section caught my attention:

Consider the content of the cards: text, sound, and images. It would require adding another machine learning algorithm (or even several algorithms) just for text/audio/image recognition, and we wouldn’t be able to train it since Dae (the main Anki dev) can’t give us a dataset that has all of the content of cards. That is against Anki’s privacy policy, only scheduling data is available publicly.

It’s clear Damien won’t be able to provide the dataset with card content included, for obvious reasons, but what if users were asked to voluntarily send it themselves? Could that help gather a critical mass necessary for training?

Expertium · November 15, 2024, 8:20am

@dae @L.M.Sherlock

hzhgifk · November 15, 2024, 10:45am

I don’t see any blocker aside from use of copyrighted content in the training process. Not entirely sure. Making it discoverable is necessary.

sorata · November 15, 2024, 4:45pm

Where will the data be sent? Or, you’re saying AnkiWeb should have a opt-in for sharing card data? Don’t you think it’s a privacy issue if someone unknowingly opts into that?

nestorski · November 15, 2024, 5:00pm

To any server chosen by Expertium/L.M.Sherlock/dae. As to how, either through a form or through an option inside Ankiweb/Anki/fsrs4anki helper.

In such case it would be off by default and a warning would be displayed to ensure a user is fully aware before proceeding, and if that’s not enough, users could be required to check consent boxes as well.

nestorski · November 17, 2024, 1:42pm

@dae what do you think? If this is feasible I’m more than happy to offer my help. Take your time if you need to research the matter before responding, just wanted to make sure this thread didn’t get lost in your notifications.

dae · November 17, 2024, 2:11pm

I’m skeptical it will yield useful results, and don’t have the time to update AnkiWeb to handle this. As users would have to opt-in anyway, I suggest you implement it as an add-on that will upload the data somewhere instead.

nestorski · November 17, 2024, 2:51pm

Thank you for your feedback, dae, it’s appreciated. I’m about as skeptical, but I think it’s worth a try. If @L.M.Sherlock and @Expertium have no provider or plan of their own, I’ll look for appropriate providers to handle the data and then work on writing the add-on.

sound · November 17, 2024, 4:24pm

Yes GDPR really require something is not enabled by default, and as someone who is adding/mining content from private conversation, it would be quite problematic that those are just exported by default

vaibhav · November 18, 2024, 5:57pm

It looks like someone has already built a scheduler that takes the texts on flashcards into account and it seems to work better than FSRS.

Paper: https://aclanthology.org/2024.emnlp-main.784.pdf

GitHub: GitHub - Pinafore/karl-flashcards-web-app: The backend and web frontend for the KAR³L flashcard app

DerIshmaelite · November 18, 2024, 6:08pm

Do you mean by text analysis, it takes into account the interference of what the text of different cards could have upon the retention of other cards I suppose this does not work with various languages (like German) @L.M.Sherlock you might want to see this.

Expertium · November 18, 2024, 11:19pm

Ah, KAR3L. Yeah, I know about that one. According to this paper, it’s about as accurate as FSRS v4.

L.M.Sherlock · November 19, 2024, 12:23am

I knew it. I even reposted the tweet.

system · December 19, 2024, 12:24am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Proposal: open dataset for content-aware schedulers Suggestions	4	329	February 4, 2026
The problem of content agnosticism Learning Effectively	2	103	February 13, 2026
Big update in FSRS4Anki v3.0.0 Scheduling	24	3546	May 1, 2023
Question About research on anki FSRS	7	187	August 10, 2025
Anki in schools Suggestions	50	6581	December 7, 2024

Feasibility of collecting user submitted datasets with card content included, to further improve FSRS parameters

Related topics