Replace CMRR with workload-vs-DR graph (+more)

Ok, so what would the graph look like? If DR is not on the x-axis, I’m afraid it won’t be intuitive

  • Workload (time or reviews) on y-axis and number of cards remembered on x-axis.
  • the user should choose the point they find to be the best compromise
  • hover over it to get the value of DR in a tooltip, which will also show the exact values of workload and knowledge at that point.

That means that the x-axis shows something that is not an Anki setting. That’s not intuitive. If the user has to click/hover to find out the DR value, it will create confusion.

EDIT: remember, this is a graph specifically to replace Compute Minimum Recommended Retention. It should be helpful for figuring out a good value of desired retention.

Well, I realised that the problem of workload not increasing if the reviews limit is reached (which I previously too) makes it impossible to plot an interpretable graph with a low review limit. Because my argument about new cards/day holds only in the setting of low reviews/day, it is not practical anyway.

So, let’s do a workload-vs-DR graph. But, allow the user to plot multiple graphs on top of each other and display the total knowledge in a tooltip so that they compare with different values of new cards/day.

I think “memorised” is a term we can reuse here. It’s used in the sim graphs.

Here you go.

3 Likes

I spoke with A_Blokee about this briefly and he pointed me to this thread. I’ve spent a lot of time analyzing this specific problem of how to calculate the optimal DR for students, and I think I have a lot of solutions for this specific problem.

I’d also like to apologize that I’m really unfamiliar with both rust and typescript, otherwise I would have just fixed the code myself. I have looked a little bit into the code, and found some places to make corrections. My apologies for asking your team to do more work and not doing it myself.

I also apologize that I’m going to make a very long post that’s basically hijacking this thread, but I think this is probably the place to post all of this information and the conclusions of my analysis over the past few weeks.

First, as of right now, the the “Help Me Decide (Experimental)” feature in the latest beta (25.08b4 (d13c117e)⁩) is using an incorrect formula to display the “Time / Memorized Ratio”. It is clear that each data point on that is displaying a value equivalent to (t_end - t_start) / (sum(R)_end). However, this is not the correct formula to use. The correct formula to use should be (sum(R)_end - sum(R)_start) / (t_end - t_start). The reasons are long and complicated and involve multivariate calculus, but that is the correct formula to use. An alternate formula that should also give a similar curve would be (sum(RS)_end - sum(RS)_start) / (t_end - t_start). The reasons for this second equation are way more longer and it’s not even a “correct” formula as much as “probably good enough” and “probably should match” (probably). If anybody would like an explanation as to why these formulae are correct (in the first case) and probably-close-to-correct and a sanity-check against improper simulations in the second case, I can explain it, but this post is already long enough. If both the ΔR/ΔT and Δ(RS)/ΔT graphs give the same (or similar) optimal DR, and it’s in the part of the calibration curve for which FSRS-6 is highly accurate for that user (i.e. >50%), and if the simulation is even remotely accurate in terms of predicting how much time each type of review takes, then that is the optimal DR for that user for that preset which maximizes knowledge per time in app, period. Any a priori assumptions about it needing to be above 70% are not necessarily founded. In actuality, I strongly suspect that a number somewhere between 60% and 75% is optimal for most users.

Because the optimal DR is calculable, and the limits of the optimal DR calculation are known, there’s no reason to even show this graph to the user in the first place. (Although I personally do enjoy it, the average user does not enjoy looking at simulation results as much as I do.) It can just be calculated for the user, and they can just click “Optimize DR”, possibly with a very big scary warning if the value is somehow lower than what would be reasonable for a typical user. Perhaps allow the user to view the graph if they really want to. (I would personally like to!) However, if it is still to exist in the UI, then it would be nice if its domain were expanded from 70%-99% to 1%-99%, so that the user can actually see the optimal value.

Regarding (sum(R)_end - sum(R)_start) / (t_end - t_start), I agree that this is the correct way to calculate how much has been memorized. @A_Blokee I suggest you implement this. Sum(R) at the start is not 0 if the user has already existing cards.

Regarding DR<70%, It doesn’t matter whether it is optimal if nobody wants to use it because of overly long intervals.
Also, maximizing total knowledge leads to worse time efficiency (knowledge/studying time) and vice versa, there is a trade-off*. I believe it’s better to let users decide on their own what kind of balance they want.

Yeah, but I think we should put effort into making the UI more user-friendly rather than just giving the user one number. We’ve tried this approach before, and, well, here we are.
We could display optimal (however you define it) DR on the graph as a suggestion, though.

Also, in the future there will be a new mode (or scheduling policy, whatever you wanna call it): adaptive DR for every card.

*given a fixed deck size and a fixed number of new cards/day

here it is where the ratio is (sum(R)_end - sum(R)_start) / (t_end - t_start)

Flip it: (t_end - t_start) / (sum(R)_end - sum(R)_start). Then the Y axis will be readable, and seconds as units will make sense

That has issues with negative numbers / near 0 values

Wait what
How is (sum(R)_end - sum(R)_start) negative?

because the DR is lower than my current one

Oh, ok. That makes things more complicated :sweat_smile:

1 Like

What’s wrong with long intervals? Isn’t that the point? Less studying for more knowledge.

maximizing total knowledge leads to worse time efficiency (knowledge/studying time)

While this is… a reasonable rule of thumb, it is… not 100% accurate 100% of the time. Maximizing ΔR/Δt will necessarily always give the best time efficiency to increase your knowledge over the period of time of the simulation.

The problem is that with rather short simulations, it is possible to use strategies that maximize time efficiency in the short-term at the expense of long-term sustainability. e.g. in one given day, the #1 way to maximize ΔR/Δt is to just turn DR down to 1% and throw in a huge number of new cards. However, this is not sustainable because you are not renewing the cards when they drop down to low DR.

In general, the best long-term strategy should be the sum of all of the best short-term strategies, but this concept is being broken here.

This brings up the second equation I gave above:
(sum(RS)_end - sum(RS)_start) / (t_end - t_start).

This is just a very crude approximation of calculating the integral of the forgetting curve for a given card from a given point in time up through infinity. While very crude, it probably is good enough and saves us from having to actually compute the integral of the forgetting curve of a given card.

If you give 2 curves, one with Δ(Σ(R)) and one for Δ(Σ(RS)), and then the simulation length is sufficiently long, they should give the same optimal DR, or something very close to it. (I don’t really want to actually calculate the integral of the forgetting curve for a given card, but that is one possible change.) This will prevent the user from ever engaging in methods which sacrifice long-term sustainability for short-term increases in R. It also functions as a sanity check to make sure the parameters of the simulation have both short-term and long-term goals aligned.

You’ve already seen the issues with ΔR being 0 and/or negative and hence the need to invert the graph. Unfortunately, this is just a limitation of the mathematics, but that is by design and correct. We are saved by the fact that time progresses linearly in a single direction, so Δt is always safe in the denominator. It does have the added benefit that “higher efficiency is higher”. (Also, units are now memorization/unit time.)

saves us from having to actually compute the integral of the forgetting curve of a given card.

We can do that

While this is… a reasonable rule of thumb, it is… not 100% accurate 100% of the time. Maximizing ΔR/Δt will necessarily always give the best time efficiency to increase your knowledge over the period of time of the simulation.

Here’s some data based on simulations.
This is with default FSRS-6 parameters, default review times, and assuming that FSRS is perfectly accurate. Of course, the real scheduling benchmark will remove these caveats and assumptions.
SSP-MMC is basically adaptive DR for every card.

S_max refers to the assumption in SSP-MMC that once stability reaches some high enough value, the card will never be forgotten, so R=100%. Here it was set to 10 years. In the final benchmark I’ll set it to 25 years.

My main point is that there is always a trade-off. Whichever scheduling policy you use—fixed intervals, fixed DR or SSP-MMC—you can give up some total knowledge in exchange for some time efficiency, and vice versa.

In that case let I be the integral at a given point in time of the forgetting curve between time t and infinity for a given card (and hence units of memorability x time). Maximizing Δ(Σ(I))/Δt should align with Δ(Σ(R))/Δt for a sufficiently long simulation.

This is an extremely nuanced point, but there is a very minor but also important part in regards to this that I think is worth mentioning: For the flat DR policy, there is some number which will maximize ΔR/Δt for long simulations. This is, necessarily, the optimal DR to maximize knowledge per unit of time doing reps. For each user, each preset, it will be some different value. We’ll call it DR_opt.
When you are above DR_opt, it is exactly as you say. You can decrease your DR, and you will sacrifice total retention in favor of efficiency. However, when you are below DR_opt, that no longer holds. When you go below DR_opt, you lose both retention and efficiency. There is no trade-off anymore, it is now simply a losing strategy. So it’s not that there is always a trade-off.

There are a large number of users who don’t care about their retention rate at all aside from the fact that they want to maximize their efficiency in app. They don’t have an upcoming test or anything, they just want to cram as much vocab into their brain for the least amount of work.

Because of this, the user should always have the option to set the DR to DR_opt, even if it’s below 70%. They don’t know what DR_opt is or any of this math. But that is what they want. Personally speaking, I would like to see DR_opt in the simulation by having the x-axis go below 70%. I understand why this is bad from the POV of giving users numbers that they can’t later choose. However from the POV of wanting to know/calculate DR_opt, it is vital.

As I said before, I strongly suspect that DR_opt is somewhere between 60 and 75 for most users.

(I also personally have this vague belief that it’s probably more optimal to just do DR_opt until you’ve gained however much R, and then to slowly bump up DR until you have the amount of ΣR that you want to pass whatever test with whatever retention. This is not based on any actual analysis, just conjecture that I’ve had while thinking about this topic over the past 2 weeks.)

Regarding SSP-MMC, I have a good number of thoughts about that topic as well, as well as the simulation at GitHub - open-spaced-repetition/SSP-MMC-FSRS: Stochastic-Shortest-Path-Minimize-Memorization-Cost for FSRS and it’s results. There’s probably a better place to have that discussion than this thread. (I was surprised to see that it did not beat flat DR. I suspect there is an error in the implementation.) I have not yet had time to mess around with that yet.

2 Likes

It does beat fixed DR. The table in the repo, unfortunately, only shows DR in increments of 0.03, but it’s still possible to conclude that it “beats” the Pareto frontier of DRs.
We have two objectives that we want to maximize - total knowledge at the end of the simulation and knowledge gained per time unit of studying. If these two values for SSP-MMC are both greater than the two values for any fixed DR, then SSP-MMC “beats” the Pareto frontier. It “dominates”, to use more proper terminology.

Notice that SSP-MMC dominates DR=0.85: both total knowledge at the end and total knowledge/average time per day are higher for SSP-MMC.
Note that SSP-MMC dominates both in the first table, where the duration of the simulation is 1 year, and in the second one, where it’s 10 years.

Additionally, I added hyperparameters to SSP-MMC so that it can be fine-tuned. That’s how I got the graph in my previous comment, with the green curve. That change is not in the repo yet, though. The repo will undergo a lot of changes in the next few months.

It could be. I told this to the FSRS creator a long time ago that in traditional settings many students often don’t revise as often as they do with Anki. Especially, in the niche community that I’m in I think I’m revising much more frequently than other people prepping for the same test as me.

Testing with lower DR might be interesting. And then slowly increasing the DR before the exams, to emulate how people study in real life.

I think a bigger problem is motivation issues for us users, not long intervals. But it’s still worth trying out. For those of us who are learning for competitive exams, optimal matters much more than “Oh, I don’t like long intervals, it doesn’t matter that there’s only 0.1% acceptance rate in my exam that I’ve spent 3 years on, let me be slightly inoptimal for the sake of shorter intervals.”

One more thing is, I’ve actually seen people in the community testing with very low DRs long before we had any idea about all this. (They read some research papers about how this is more optimal, it’s in the forums).

4 Likes

But that’s exactly the kind of people who don’t want low retention and want high retention. Am I misunderstanding something?