Question about FSRS difficulty optimization

Hi everyone,

I have been looking at how FSRS fits its parameters and noticed that the optimiser seems to treat the card-difficulty values stored in the logs as ground truth, even though those values were produced by earlier (and possibly biased) versions of the scheduler. That bias might be why some users end up with the difficulty of most cards collapsing toward a single value after an optimisation run.

Framed mathematically, this looks like a bi-level problem.

• Lower level: infer the true, time-varying latent difficulty of each card from the review history.

• Upper level: choose the FSRS parameters that minimise the loss given those latent states.

A preliminary idea would be to solve the inner inference step with something Markovian, then update the outer parameters via a cutting-plane or other iterative scheme that alternates between “refit latent states” and “refit FSRS weights”.

Has anyone tried modelling optimisation this way, or is there a practical reason it has been avoided?

Thanks for your insights,

Arthur

Is what I said completely off? Because learning from difficulty values that were themselves generated by a potentially biased model seems fundamentally broken. If after optimization I end up with 150 cards stuck on the same difficulty, that’s just absurd. Learning from past errors makes sense, but learning from errors as if they were truth? That doesn’t. Am I missing something obvious here?

Arthur

Difficulty values stored in the logs are not used in the training/optimizing stage at all. The training is based only on the rating (which button is pressed) and interval (how long since the previous review) of each review in the log.

However, if you enable the “Ignore cards with reviews before (date)” option, then for each card that is ignored in training, that card’s memory state is based on the first review after the cutoff date, plus any later reviews. If that initial review was done with FSRS enabled, then the previously-calculated difficulty will factor into the card’s new memory state. (If the initial post-cutoff review was done with SM-2, then the SM-2 ease factor will be used instead.)

In short, if you train on your whole card set, then previously-calculated difficulties are not used. But if you exclude some cards from training, then previously-calculated difficulties may be used for some of the excluded cards.

2 Likes

I’ve been trying out a patch (on a personal branch) to base the initial memory state for ignored cards on the last pre-cutoff-date review (instead of the first post-cutoff-date review). This would make it possible to ensure that previous FSRS-calculated memory states are not used, by setting the cutoff date to a time before you enabled FSRS.

I’ll post a pull request for this patch shortly.

OP, if you mean “ease”, no, FSRS doesn’t use ease from SM-2. FSRS only uses interval lengths and grades.
As for the difficulty (D) formulas in FSRS, for whatever reason, (almost) every idea that me and Jarrett have tried to improve those formulas did not improve accuracy of FSRS. There was one idea that improved accuracy by 1%, but that’s it. I even tried a neural-network-based approach for D and it barely did anything. Maybe D just sucks, idk.

Interesting. I’ve overwritten the ease values of my whole revlog so that it is as if I’ve been using FSRS since before it existed. I also exclude reviewss from before 2022 or 2021 so there some cards of my cards ought be being affected by the difficulty value in the revlog on the cutoff data?

Let’s be absolutely clear:

If the distribution of card difficulties in your dataset is broken, you’re cutting the legs out from under the whole model even if you think you don’t “use” difficulty directly for optimization.

Difficulty isn’t some cosmetic field: it’s a core input at every review, driving the update of stability and, therefore, the entire scheduling process. If your dataset collapses difficulty into clusters say, hundreds of cards stuck at 10 from early training phases, and others at much lower levels then your optimizer is working with fundamentally corrupted data. No amount of parameter tuning will patch over a distribution that doesn’t reflect the reality of the learner’s experience.

Frankly, it’s absurd to ignore this structural flaw.

When you optimize FSRS on a real dataset, you’re not just fitting parameters to a static world; you’re managing a dynamic system of latent variables. If you don’t recalculate the full trajectory of card difficulties when you update parameters, you’re literally training on a dataset polluted by outdated, sometimes completely wrong, difficulty estimates. Over time, this leads to catastrophic effects: clusters of cards “stuck” at high difficulty (typically from early phases when the system wasn’t in equilibrium) never normalize as more cards are added. The result is blocks of medium and maximum difficulty cards, and an optimizer that can only patch around the mess.

This isn’t just an academic detail. It neuters the optimizer. You end up with a pseudo-stable system, where hundreds of cards sit at difficulty 10 for months or years totally disconnected from reality. It destroys both learning speed and interpretability.

What’s needed is a hardcore, bi-level optimizer one that, when needed, can recursively update all card difficulties using the full review history and the current parameter set. Yes, it’s more computationally demanding, but you only need to run it occasionally to “reset” the system and clean up the latent space. Once that’s done, normal FSRS operation can resume as usual.

Bottom line:

This is a dynamic system, not a static snapshot. Optimizing the loss at a single point in time is fundamentally broken if you don’t update the latent beliefs. If you care about the long-term health and speed of your scheduling, you cannot ignore this. The marginal gain in model integrity is massive.

The values of D saved in the revlog don’t matter at all during optimization. The optimizer never sees them. It calculates new values using the latest parameters.

(Also, please don’t use AI to write arguments because most of the time, the text doesn’t make much sense or is very repetitive. It’s better to write the arguments in your own words. Even if you write it in your own language, the reader can use a translator.)

The use of AI is not the subject of this topic. You have to understand that this image is not normal.

I’m just explaining that the distributing of your beliefs is used to compute stability, even if you think you’re not using it during optimization, they need to be optimized some time if it becomes wrong.

When you optimize the parameters, the stability and difficulty are calculated again for all cards based on the latest parameters. So, neither optimization nor scheduling depends upon whatever value of D was stored in the revlogs (assuming that the revlogs are not truncated/ignored as mentioned by @mbrubeck above).

The only thing that depends on the values of D stored in the revlog is this graph. If anything, it is an issue in the add-on. The add-on should use ts-fsrs to recalculate the values of memory states using the latest parameters to generate this graph.

1 Like

Please can you be more specific on the optimisation process. And is the default difficulty graph also affected by the add on problem you are describing ?


It looks like the same to me. Because D → S → R so I’m wondering if your are updating all the beliefs during optimization and how you are doing that ? Are you going down a path and calculating all the difficulties, finding best parameters fit for D transition ? Is the transition function parametric ?

You’re maybe changing the initialization parameter to feat it, and what it the solution space is not convex ?

In my mind, this feels like a bi-level optimization problem:

You want to minimize the model’s loss by finding the best possible transitions in card difficulty (D), but the transitions themselves depend on the very parameters you’re optimizing.

Are you actually trying to optimize the initialization and transitions of difficulty for each card? Even if the loss landscape is non-convex?

Because if you’re not optimizing both levels jointly and the problem is non-convex it seems easy to get stuck with multiple modes in the difficulty distribution, which is exactly what I’m seeing in my dataset.

Just want to make sure I’m not missing anything. Thanks!

You should take a look at what the optimizer actually does – The mechanism of optimization · open-spaced-repetition/fsrs4anki Wiki · GitHub .

You are very concerned about the use of historical D figures in optimization, but I don’t think there’s any basis for that concern.

Stats > Difficulty only compiles what the currently computed/stored D is for each of your cards. No historical data is displayed.

Thanks for the explanations, it’s much clearer now. FSRS is clearly the best approach out there, and the marginal gain for learners and even for humanity at large is huge. Bravo for that.

Out of curiosity (and because I like to explore all optimization avenues), I’ll test a few alternative approaches:

  • Random restarts and parameter reinitialization to check for local minima,
  • A more bilevel/EM-like optimization (alternating between refitting latent states and parameters),
  • And maybe some “transformation tricks” inspired by stochastic programming to push the boundaries.

I’ll report back if I find anything significant or useful for the community. Thanks again for your work seriously impressive.

Arthur