Consider the whole review record in a single day when assigning difficulty

Currently, the difficulty assignment of FSRS only depends the first response in a day on the card. For example, let’s say I passed the first day of three cards by three different ways below (assume I have 2 learning steps; 1=again, 2=hard, 3=good):
1->3->3
1->1->1->3->3
1->1->1->3->1->1->1->1->3->2->3
obviously the pattern indicates very different difficulty on three cards (the 1st card is likely not too hard, while the 3rd is insane); but currently FSRS assigns same difficulty for them.

The solution is to consider the whole review record on one day when assigning difficulty. As an simple example, the initial difficulty after the first day may be determined by:

D0(G, n) = w4 - (G-3)w5 + n*wnew

where n is the # of times pressing “again” in the first day, and wnew is a new parameter to be fitted. Similar considerations may be applied to reviews too.

In my opinion, the biggest issue of FSRS model-wise is that the estimate of difficulty is delayed. e.g. maybe a card should be D=100% starting from the first day, but with the current FSRS model, it is more likely to be 70% at first and only goes to 100% after several reviews. Similar issue exists for abrupt difficulty changes during review, e.g. due to interference. This would not only affect its accuracy, but (from my ungrounded guess) also lead to unoptimal schedule (e.g. to compensente the underestimated difficulty at the beginning, the model would punish largely on Hard and does not award much on Good, leading to the “ease hell” in SM2; indeed once I had w7=0.0000 on a deck, which means pressing Good would not change difficulty at all…). The purposed change here would likely to make difficulty estimations less delayed thus mitigate the problem.

1 Like

We tried it, it barely affects RMSE/log loss. @L.M.Sherlock you could benchmark this once you’re done re-running the benchmark.

Thanks for the info. One quick question: did you consider the number of total reviews/total “Again” responses when doing RMSE binning? Otherwise I think it’s like cheating in the same way you have illustrated here The Metric · open-spaced-repetition/fsrs4anki Wiki · GitHub

e.g. I learnt 100 new cards today, 50 being 1->3->3 and 50 being 11131111323. In reality maybe the first 50 cards have a 90% chance of recall tomorrow, while the second 50 have a 20% chance. The current model, when perfectly calibrated, would predict 55% on all 100 cards, which is technically correct (RMSE=0) but not what we want.

Only one review per day is taken into account, same is true for binning.

ok I guess that answers why no RMSE change is observed (as I explained above). It would be great to consider metrics that can reflect multiple reviews in a single day (maybe total # of “Again” so far / in the first day / in the last review etc.) in RMSE binning and see if the result changes. If the situation still doesn’t change then maybe I’m just thinking too much.