Feasibility of collecting user submitted datasets with card content included, to further improve FSRS parameters

Reading @Expertium blog’s benchmark page, the following section caught my attention:

Consider the content of the cards: text, sound, and images. It would require adding another machine learning algorithm (or even several algorithms) just for text/audio/image recognition, and we wouldn’t be able to train it since Dae (the main Anki dev) can’t give us a dataset that has all of the content of cards. That is against Anki’s privacy policy, only scheduling data is available publicly.

It’s clear Damien won’t be able to provide the dataset with card content included, for obvious reasons, but what if users were asked to voluntarily send it themselves? Could that help gather a critical mass necessary for training?

2 Likes

@dae @L.M.Sherlock

1 Like

I don’t see any blocker aside from use of copyrighted content in the training process. Not entirely sure. Making it discoverable is necessary.

Where will the data be sent? Or, you’re saying AnkiWeb should have a opt-in for sharing card data? Don’t you think it’s a privacy issue if someone unknowingly opts into that?

To any server chosen by Expertium/L.M.Sherlock/dae. As to how, either through a form or through an option inside Ankiweb/Anki/fsrs4anki helper.

In such case it would be off by default and a warning would be displayed to ensure a user is fully aware before proceeding, and if that’s not enough, users could be required to check consent boxes as well.

1 Like

@dae what do you think? If this is feasible I’m more than happy to offer my help. Take your time if you need to research the matter before responding, just wanted to make sure this thread didn’t get lost in your notifications.

I’m skeptical it will yield useful results, and don’t have the time to update AnkiWeb to handle this. As users would have to opt-in anyway, I suggest you implement it as an add-on that will upload the data somewhere instead.

2 Likes

Thank you for your feedback, dae, it’s appreciated. I’m about as skeptical, but I think it’s worth a try. If @L.M.Sherlock and @Expertium have no provider or plan of their own, I’ll look for appropriate providers to handle the data and then work on writing the add-on.

1 Like

Yes GDPR really require something is not enabled by default, and as someone who is adding/mining content from private conversation, it would be quite problematic that those are just exported by default :slight_smile:

It looks like someone has already built a scheduler that takes the texts on flashcards into account and it seems to work better than FSRS.

Paper: https://aclanthology.org/2024.emnlp-main.784.pdf

GitHub: GitHub - Pinafore/karl-flashcards-web-app: The backend and web frontend for the KAR³L flashcard app

3 Likes

Do you mean by text analysis, it takes into account the interference of what the text of different cards could have upon the retention of other cards :question: I suppose this does not work with various languages (like German) @L.M.Sherlock you might want to see this.

Ah, KAR3L. Yeah, I know about that one. According to this paper, it’s about as accurate as FSRS v4.

1 Like

I knew it. I even reposted the tweet.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.