Clarify what optimal retention means

Such a function already exists and is called “Maximum answer seconds”.

2 Likes

No, it’s not better. Again, as I already said, it’s better to use real data from the user because it will actually reflect this specific user’s habits. As noisy as this data is, it is still more relevant for the user than some pre-defined values.
A few hours ago have suggested using the median to LMSherlock. It should help to mitigate the effect of outliers.

My minumum retention is suggested at below 0.75

Well, it’s called minimum recommended retention for a reason. You can set your desired retention higher than that, just not lower.

You can’t just say “But it’s real data, so it has to be better” - that’s not how data science works… I do hope it’s just a communication issue and that this optimizer is actually build on a few sound statistics, sigh.

You can set your desired retention higher than that, just not lower.

… If the 0.75 is based on faulty/noisy data and a correct optimizer would suggest 0.8, using 0.75 would be exactly the thing you warn against… That’s the point… (And in some industrial settings, this difference is a matter of safety, so it won’t hurt to understand it)

1 Like

You can’t just say “But it’s real data, so it has to be better” - that’s not how data science works

It’s exactly how it works with the optimizer. We found that even with as little as 8 reviews, it’s possible to optimize parameters to provide a better fit than the default parameters. In Anki 24.04.2 beta, there is no “hard” limit anymore, and the limit is now determined based on a complex rule. There is also a new, third type of action that is neither “full optimization” nor “keep the defaults”: a partial optimization where only the first 4 parameters change and the rest are set to default.
I’m saying all of this to make a point: yes, you can do better than a “one size fits all” approach even with limited data.
That being said, with the optimizer, things are much, much easier because we can evaluate how well the predicted probability of recall matches reality. We can compare the output of the algorithm to the ground truth. That is impossible to do when calculating optimal retention, there is no ground truth; so we have to rely purely on common sense when designing this feature.
If you want to convince me or LMSherlock that using the same values of review times for every user is better than relying on each user’s data, you’ll have come up with a really good argument, something better than “that data is noisy and using the median instead of the mean is not good enough”. Maybe we could make it so that initially some default values for time are used, but then as the number of reviews increases, the default values smoothly transition to user-specific values; that is feasible, albeit complicated.
If you want to convince me or LMSherlock that time should not be factored in at all when running the simulations for calculating minimum retention, and that only the number of reviews should be used, I don’t think that will ever happen. That’s like “But air friction data is noisy, so let’s just discard it and assume that air friction is negligible when building our spaceship” level.
I’m pretty sure that both of us will end up thinking “That guy is an idiot” by the end of this discussion, so I’d rather not continue it.

2 Likes

Just one more thing: Deck Options - Anki Manual.

I forgot about this yesterday. By default, if the answer time exceeds 60 seconds, it will be stored as, well, 60 seconds. So unless the user set the max. answer time to 9999 or something, he won’t have too crazy outliers. In your case, even if you walked away without closing the review screen, you won’t have review times measured in hours, unless you set your max. answer time limit extremely high.
Still, using the median should be better than the mean.

2 Likes

I’m pretty sure that both of us will end up thinking “That guy is an idiot” by the end of this discussion, so I’d rather not continue it.

Absolutely agree.

But you still should make sure in your analysis that you actually check if “air friction” is neglictable and don’t just assume it.
I.e. proof that there is no sensitivity issue, when reviews wildly vary between 4 and 60 seconds.
If you need help from an expert in control theory and are willing to step down from your high horse, I am willing to help with your “complex” little rules, if necessary.

Just came back from to this thread after doing some field analysis by the way.
I have 5 different “modi operandi” with different review times for the same cards.
Basically from “trying to not die as a pedestrian but trying to get some reviews in” to “full focus anki only”

I am willing to help with your “complex” little rules, if necessary.

Maybe you misunderstood me. I was talking about the optimizer and the minimum number of reviews. That stuff is already figured out, and implemented in Anki 24.04.2 beta.

The median has been implemented: Feat/use median in calculating recall cost, forget cost and learn cost by L-M-Sherlock · Pull Request #109 · open-spaced-repetition/fsrs-optimizer · GitHub
All reviews where time=0 ms are filtered out, all reviews where time>20 minutes are filtered out, and then the median time for Again/Hard/Good/Easy is calculated. I believe this is sufficiently robust.
However, this will only make it into Anki in the next release, idk when that will happen.

Part of the reason I almost always set my Desired Retention a few percentages higher than what that feature tells me.

You can now try the new version of “Compute minimum recommended retention (experimental)”, with the median instead of the mean, if you install Anki 24.06 beta.
EDIT: here’s the Github issue where I discuss implementing another outlier filter, feel free to participate: A better outlier filter for "Compute minimum recommended retention" · Issue #112 · open-spaced-repetition/fsrs-optimizer · GitHub

Hey I know you want to reduce complexity, but an option for using mean-average would be nice. I set my maximum time to a reasonable number and don’t believe I’d have outliers. The only problem is, as I told you before, that the average-card may take around 10 secs too answer (calculated from median) but there can still be many cards that take way more time than that.

@L.M.Sherlock sorry about ping, do you have thoughts on this? For an example, think of Kanji/Hanzi writing decks. 雨 and 安 won’t take even 10 secs for me to write down and answer but 茨城 (ibaraki, a place) would take a lot of time. I believe in these cases, the times are not symmetrically distributed. They’d form a curve around two different points. My concern was, what if 44 per cent of my cards have times like 20-25 secs but rest have 8-10 secs (on pressing Good ofc because with Again I’m just instantly failing it). The median won’t truly reflect the times I take IMO. Would love to hear what you have to say. Are my concerns misplaced?

Allowing the user to choose between the mean and median is definitely introducing unnecessary complexity.
Your concern about a distribution with two modes could be valid, but in that case I don’t know what we can do.

If providing users with more options is too much then wouldn’t using just mean suffice? It’s an advanced option. Users can always use maximum timer seconds so as to not have outrageous outlier.

The median is more robust to outliers. The mean would require extra outlier filters.

Idk if you still care, but recently me and LMSherlock made another change. Previously, we switched from using the mean review time to using the median review time, for the sake of robustness. Now we also use this:


Here t_smooth is a smoothed review time per button (Again/Hard/Good/Easy), t_user is the median review time per button of this particular user, and t_default is the median review time per button across all 20k users in our dataset. a is a weighting coefficient that depends on the number of reviews. The value of a is different for each button, since the number of times the user pressed “Good” is going to be different from the number of times the user pressed “Again”. As the number of reviews increases, a approaches 1, and t_smooth approaches t_user.

It’s probably not immediately obvious what this does, so let me explain. When the number of reviews is low, the median review time is too uncertain (in the “large error bars” sense), so it would be wiser to rely on the default value instead. When the number of reviews is large, it’s better to rely on user-specific data than default values. And instead of using a hard cutoff (for example, 400 reviews or 1000 reviews), it’s possible to use this approach for a smooth and gradual transition from the defaults to user-specific values as the number of reviews increases.

This way, anyone can use “Compute minimum recommended retention”, regardless of how many reviews they have. This change will most likely make it into the next major release of Anki.

3 Likes

This topic is basically just “Me giving updates on Compute minimum recommended retention (CMRR)” at this point.

Anyway, Anki 24.11 is out: Release 24.11 · ankitects/anki · GitHub
It has FSRS-5, which uses data from same-day reviews to refine it’s predictions, and CMRR now takes into account the time spent on same-day reviews, which was previously unused. The number of simulations used to calculate the final value of desired retention has also been increased to further improve accuracy. Last but not least, the range of output values has been extended from 0.75-0.95 to 0.70-0.95.

The “experimental” part of the name has been removed.

I was also wondering, the feature advice to set a “Minimum Recommended Retention”, which is compared to the Desired Retention. But, there is a lot of different factors that can still make the True Retention lower than the Desired Retention (i.e. : A lot of a cards with low sability, so their “Target R” is way lower than the actual Desired Retention when they’re prompted, and globally speaking, more cards are failed than expected)

To take my example : My previous Desired Retention was 80%, and my true retention was around 75-78%, with RMSE ~4.50% ON FSRS4.

Now, I switched to FSRS5 and after optimization my RMSE is 3.74%, my Desired Retention is set to .75 since it was what the “Compute Minimum Recommended Retention” advised, but my True Retention is now somewhere around 71% and dropping (65-70% those past days). One of the biggest reason is because I increased my new cards per day to 40. Which means, even though the FSRS model match my history very well, I do indeed fail way more than the Desired Retention (but it’s perfect normal since a huge proportion of my cards have Target R to 30-50% when I get them)

So, better RMSE, lower Desired Retention, but True Retention way lower than the “sweet spot”.

So I was considering setting my Desired Retention “high enough” so my True Retention can be on par with the “Optional one”, since I would expect the simulator to really expect that the Desired Retention = the retention in average.

But I might be wrong since my True Retention is lower for a different reason than the model being wrong. Also, increasing Desired Retention from .75 to .8 would not change that much in terms of retention for such low stability cards.