FSRS>=5 is a significant regression for a minority of users who use again a lot on new and learn cards

I am now sure that the switch from FSRS v4 to v5 or v6, while an improvement on average, has unfortunately been a significant regression for a minority of users.

This regression affects users who use FSRS for the initial learning phase and therefore click “Again” very often on some cards.

As a result, the difficulty of these (not all!) cards increases sharply and cannot be significantly lowered later on by pressing easy.

And I’m sure that FSRS v6 can’t “get used to” this learning style through optimization, because I’ve simulated it with Anki FSRS Visualizer .

And this high difficulty is definitely wrong, because both my daughter and I always have a 100% retention rate within these far-too-short intervals, and not the 85% set as the desired retention (DR).

There are workarounds, like either resetting the cards or doing short-term reviews only in custom study sessions, but that’s something we Anki geeks do. I would like to recommend FSRS to many friends, but this is too complicated for average users (e.g. 11 year old daughter), they only want to press one of the four grading buttons. I find it extremely important that FSRS can handle this learning style without workarounds.

I’m sure that because of this problem, many average users find Anki better without FSRS. I think that’s a real shame, because I’m sure FSRS could otherwise be very helpful to many average users as well.

@expertium is working on the problem. Thanks! Still, it was important for me to share this because I’ve unfortunately read many claims on the forum that you just need to give FSRS more data through more reviews and the problem will solve itself. Unfortunately, that is not the case for FSRS6.

The reset workaround is repeatedly described here on the forum as incorrect usage, with the argument that it would deprive FSRS of the data it needs to optimize itself and make the problem disappear. Like I said, I’m sure that for one, this doesn’t work, and for another, L. M. Sherlock also recommends the reset workaround in one of the threads listed below.

Further info:

3 Likes

I could try adding extra parameters to difficulty in FSRS-7 so that difficulty is updated a little differently for same-day reviews, but since most changes to D don’t improve the metrics, I doubt that this one will improve the metrics either.

3 Likes

Thank you for your attention and answer! :smiling_face:

In one of the threads mentioned, it has been suggested to ignore short-term reviews again. In particular, the difficulty would only be calculated once the cards are no longer in new or learn. That would solve this problem.

Somewhere in the threads it says that taking the reviews in new and learn into account improves the predictions a little. But a small improvement isn’t worth keeping this regression.

the difficulty would only be calculated once the cards are no longer in new or learn

FSRS doesn’t use stuff like “New” or “Learn” internally, only interval lengths and grades. I’ll try using a new parameter, and I’ll also try not updating D at all if the interval is short; though I doubt that the former will result in an improvement and I really doubt that the latter will result in an improvement.

2 Likes

There is a problem with just looking at the average RMSE/logloss. If some change benefits very few people but doesn’t affect most of the users at all, the effect on the average metrics will be negligible. But, the effect on the scheduling of those users would be tremendous. Even if they make up 0.1% of the total users, a very large number of people will benefit, considering the huge userbase of Anki.

A change with similar justification was made in the past too.

The fix makes FSRS-5 slightly worse than before. But it actually improve the metrics on @brishtibheja's affected deck.

~ Fix/consider short-term params when clipping PLS by L-M-Sherlock · Pull Request #150 · open-spaced-repetition/fsrs-optimizer · GitHub

We can’t do that all the time (and I’m not sure if it should’ve been done back then either) because if we keep making the average worse, then we are doing the exact opposite of improving the algorithm.
Also, I am almost certain that that change was statistically significant. Even a change of 0.0002 (in logloss) on 1000 users can be statistically significant, I just ran the Wilcoxon test on two results from two slightly different versions of FSRS-7. So a change of 0.0005 on 10k users is pretty much guaranteed to be statistically significant.
In other words, that was a regression.

Output of my code:

Mean of the baseline=0.3491
Mean of the comparison=0.3493
Percentage of users who would be better off using the baseline: 57.6%

Comparison is worse
Wilcoxon signed-rank test=1.9810494200752556e-07
p<0.001
2 Likes

Thank you very much!

  1. Do you know if the dataset you are using for testing contains a meaningful number of users who have some cards that are very difficult to learn at first for them? E.g., learning vocabulary is quite easy compared to studying long lists.

  2. I would even say that it would be worth it, even if it results in slightly worse predictions on average, provided it improves a lot for the described minority of cards of a minority of users. The worst possible user experience is also a parameter that must be considered. Just to make it very clear, I construct the following extreme example: Imagine the perfect car for most people, but a minority of users with a certain behaviour are instantly killed.

BTW:

I’m concerned that my questions might come across as doubting the developers’ competence, which is absolutely not the case. I have tremendous respect and gratitude for the developers, and I fully recognize that they have vastly more knowledge and expertise about these problems than I do.

My approach of asking skeptical questions is intended to help identify things that might have been overlooked—not because I think the developers haven’t thought things through, but because sometimes an outside perspective can spot something that gets missed. Of course, most of the time I end up asking about things that the team has already thoroughly considered :(.

My hope is that occasionally I might contribute a useful insight or perspective. However, if I’m creating too much noise and not enough signal, please don’t hesitate to let me know. I would much rather scale back my questions than be disruptive. My goal is to be helpful, not to get in the way of the important work you’re doing.

  1. Idk. The data is anonymized (only scheduling info, no text or media), btw. Identifying users who struggle initially could probably be done, but I’m not sure how exactly.
  2. I see your point, but as I said, if we keep doing the whole “let’s allow the average to become worse…and let’s do it again…and let’s do it again” thing, we will be doing the exact opposite of improving FSRS.
2 Likes

That depends on how you define improvement. Imagine an intervention which would on average lengthen the lifespan of humans, but a minority with a certain trait would only live to 30 years old. This could be an improvement by one definition, but not by another.

Every FSRS version has been an improvement for some people (>50%) but not for others. If we only made changes that improve the metrics for 100.0% of users, no new versions of FSRS would be released.

EDIT: out of curiosity, I just compared FSRS-6-recency (latest version + recent reviews are assigned larger weights when calculating logloss, which improves results a little) with FSRS v1 that nobody other than Jarrett himself used. FSRS v1 is better for 0.8% of users, interestingly.

3 Likes

Wow, thank you, very interesting! There must be a way to merge these formulas with extra parameter(s) in order to let the optimizer find out the best parameters for the current deck!?

True, 100% cannot be the goal. But not making it significantly worse for even 5% of users for a small improvement on average is important.

I think that the test dataset has too much weight because it might mainly contain Anki early adopters, who could have quite different behavior than the many more normal people who could use Anki and FSRS if it were more fool-proof.

But use what instead of the test dataset? I say also add plausible parameters which do NOT improve the predictions for the test dataset, because we cannot know if they improve the optimization for a lot of people for whom the dataset is not representative.

The users were selected mostly randomly, other than “have at least 5000 reviews in total”. The dataset should be representative of the “general population” of Anki users.

3 Likes

Thank you very much for the info. Seeing how complex Anki is, I am sure that the average Anki user today is quite a geek (me included). I think it is very important to make Anki more attractive for non-geeks, aka fool-proof, to achieve network effects. Of course, I cannot know if the test data set is perfectly fine for Anki’s future broad userbase, but still, I say, because we can never know for sure, let’s allow some plausible extra parameters. With today’s available computing power, even if they don’t improve the predictions for the test data on average. At least for me on AnkiDroid, the optimization is very fast.

If the extra parameters don’t improve predictions, how should we decide when to add them?

2 Likes

Why was 50% chosen btw, instead of something like e.g. 95%?

It wasn’t really “chosen”. Historically, we weren’t looking at the ratio of users for whom FSRS got better/worse, we were looking at average logloss/RMSE (mostly logloss now). Looking at ratios is something we started doing much later.
Actually, now that I think about it, I don’t think we ever explicitly made a decision to release or not release a new version of FSRS based on that ratio, so I guess my wording was misleading.

Looking at ratios now, each new version was typically an improvement for ~80-85% of users. The table below is based on logloss.


(FSRS-6-recency isn’t a separate version, but I still added it as such)
Also, FSRS-7 won’t be an improvement on non-same-day reviews. It will be a big improvement on same-day reviews, though. On non-same-day reviews it will be 50:50 with FSRS-6.

4 Likes

Why we not implement simple solutions until we figure out how to let everything to FSRS?

e.g., with toggle option if needed:

  1. Back the use of learning steps (no effect by FSRS until in review state)
  2. Auto reset after change card state from learn to review
  3. Add 0.0250 to difficulty parameter, this worked well with me until now using Anki 25.02

I believe, It can be done through custom scheduling to let FRSR only effect review cards as I said here

Regarding parameter 8: we could make it so that it can’t go below some non-zero value, if it doesn’t make metrics worse.
Regarding not letting FSRS affect cards in the “learn” stage: no, definitely no. This will make metrics worse, and it will also make FSRS depend on Anki’s silly distinction, which not only makes 0 sense but will also make it harder to implement FSRS in other apps, since they will have to adopt Anki’s silly “learn”/“review” distinction to use FSRS.

3 Likes

Okay, adding plausible parameters that don’t improve things for even a single user in the test data would probably be too extreme. But what if you add the plausible parameters that improve the predictions for even just a fairly small group in the test data? Say a parameter that is neutral for the vast majority of users, but it improves the predictions for a small group.

2 Likes

Sure, if I can identify a particular change that has a tiny effect on the average logloss, but has a very large positive effect on a small minority of users. If I ever find myself in that oddly specific situation, sure.

4 Likes