Hey everyone,
We want to build an open dataset of review histories that includes the flashcard content, collected on a voluntary, opt-in basis.
This would greatly help research on spaced repetition schedulers, especially content-aware schedulers. Modern general-purpose review schedulers like FSRS treat cards as independent from each other. Content-aware schedulers could not only predict memory strength more accurately but also enable new user experiences. For example, they could let users add multiple restatements of the same question without having to explicitly declare their relatedness, or prevent questions that reveal each other’s answers from appearing in the same review session.
Some possible approaches:
- Adding an opt-in feature directly to Anki
- Developing a plugin that lets users contribute their data
- Recruiting medical students to export and share their decks (shared decks have fewer privacy concerns)
AnkiWeb’s privacy policy allows the use of review histories for research (which led to the anki-revlogs-10k dataset), but not card content. Content-aware schedulers need both, so we want data collection to happen voluntarily, with users explicitly opting in.
Our goal is a public dataset that anyone can use for research and for building better schedulers. We’re gauging interest and exploring whether an opt-in feature in Anki could work.
This is currently a side project by:
- Giacomo Randazzo (me): built a spaced repetition system called Rember (recently stopped working full-time on it) and wrote my master thesis on memory models for SRS.
- Aidan Campbell: recently graduated from Dartmouth College with a bachelors in Mathematics and working as a quantitative research assistant at the Dartmouth Institute for Health Policy and Clinical Practice while applying to Medical School this cycle.
We’re both technical and would handle the implementation if there’s interest.
9 Likes
Here’s a similar dataset, but with a simpler scenario—memorizing English vocabulary. Feel free to check it out if you’re interested.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VAGUL0
MaiMemo Open-Source Spaced Repetition Memory Behavior Dataset
Introduction
To advance research in the field of memory, MaiMemo (Momemo) Vocabulary App released an open-source dataset of 220 million memory behavior records in early 2022: Replication Data for: A Stochastic Shortest Path Algorithm for Optimizing Spaced Repetition Scheduling.
Data Source
Online learning users of the MaiMemo Vocabulary App between December 1, 2021, and December 31, 2021.
Collection Method
Each review of a word by a learner generates one memory behavior record. Key fields include word ID, learner ID, timestamp, and feedback.
After completing daily learning tasks, all memory behavior data for the day is uploaded to the server in log form.
On the server, a log synchronization system structures the learners’ logs and writes them into a database.
Records are grouped by learner and word, and the intervals between reviews are calculated. Feedback and intervals are then concatenated in chronological order to form feedback sequences and interval sequences.
Data Preprocessing
- Anonymized learner IDs
- Replaced word IDs with word spellings
- Removed timestamps, retaining only review intervals
- Excluded data where the first feedback was “Remember” or “Uncertain”
- Excluded sequences containing feedback marked as “Familiar” or “Vague”
- Excluded data where reviews did not follow the algorithm’s scheduled time
- Excluded words containing special characters in their spellings
4 Likes
That sounds like an interesting project.
Since some decks are protected by copyright, wouldn’t users be unable to share the content? e.g. if a user creates a deck based on textbooks the copyright belongs to the textbook authors, the user does not have the right to share it. Most users don’t know about copyright or fair use so they’re likely to share content incorrectly.
Also won’t collecting data including content make file sizes too large? Typical Anki user data probably ranges from tens MB to about 100MB, excluding media files. e.g. 80MB x 10K = 800GB. And some users may use the first field as an identifier rather than for learning, or they may use long texts like Cloze.
To avoid these issues it seems to me that specifying decks is needed. e.g. collect only data matching Anking decks or popular language learning decks, or the first field exactly matches words.
3 Likes
@j.huang Thanks for the pointer! I’d like to collect data beyond language learning. There are already content-aware schedulers for narrower domains, like MathAcademy or several language learning apps (see comments in the HN thread for my blog post). A general-purpose dataset would enable broader research and cross-domain models.
@Shigeyuki Good points. Copyright is a concern. We’d need basic filters to detect possible violations, and we’ll have to clarify the legal aspects before collecting any data. Data size seems manageable technically. Starting with large public decks like AnKing might be the cleanest path, avoiding most privacy and copyright issues.
2 Likes