Higher rmse in fsrs 5

I had 4.3 rmse and 96.5 true retention on mature cards (with 95 desired retention) but now on fsrs 5 I have 6.18% rmse why is that? and does that within the normal range?

Make sure to optimize parameters. That being said, it’s possible (albeit unlikely) that you will have higher RMSE with FSRS-5.

1 Like

I also got higher RMSE, ~2.3% → ~3.7% right after updating and optimizing. A couple of weeks later and after optimizing again it’s now down to 3.33%

Same with me. Is this even rare?


In the benchmark, FSRS-5 performs better than FSRS-4.5 67.7% of the time. So in 2/3 of all cases, FSRS-5 is better. However, the benchmark way of calculating RMSE is more strict than Anki’s, so I expect it to happen less often in Anki.

Since this is mentioned…uhm, literally nowhere aside from my blog, I’ll explain:

Anki: all data is used for training, all data is used for evaluation aka testing. Training and evaluation can be done separately.

Benchmark:

  1. Data is split into 5 parts, let’s call them A-B-C-D-E. A contains the oldest reviews, E contains the most recent reviews.
  2. An algorithm is trained on A and evaluated on B. Let’s call the error that is obtained at this step error 1.
  3. An algorithm is trained on A and B and evaluated on C to obtain error 2.
  4. An algorithm is trained on A, B and C and evaluated on D to obtain error 3.
  5. An algorithm is trained on A, B, C and D and evaluated on E to obtain error 4​.
  6. The final error is a simple average of four errors, and the final parameters are from step 5.

Training and evaluation cannot be separated.
This is a much more strict and rigorous procedure that tests the algorithm on unseen data, which is the whole point of splitting the dataset into the training set and the test set.
This also means that the benchmark metrics are, on average, worse than metrics that you see in Anki (RMSE and logloss).

2 Likes

I’m sorry if these are dumb questions. It has been a while since I have done any low-level machine learning stuff.

Wouldn’t using the same data for training and testing mean that the Anki optimiser could end up over-fitting and make FSRS actually perform worse than before with the new “optimal” parameters?

I guess that part of the problem is each individual user’s dataset is relatively small, especially with a new deck. Splitting that already small dataset could leave the optimiser with not enough to work with.

Can an individual user ever realistically have enough reviews for it to be worth splitting into separate Training and Test sets for the Anki optimiser?

1 Like

I did but unfortunately its still 6.18%

Hey rossgb, I get where you’re coming from with the overfitting concern, but I think there’s a bit of a misunderstanding here about how it applies to FSRS. Let me explain:

FSRS isn’t your typical machine learning model. It’s based on cognitive science stuff, not just pure data crunching. The goal is to predict how well you’ll remember things in the future, as accurately as possible.

Using all the data we’ve got for training is actually a good thing here. “Overfitting” isn’t really the main issue because the model’s structure already limits what it can learn. It’s not like a neural network that can go crazy complex.

When Anki uses the same data for evaluation, it’s just checking how well the model fits what we already know. This doesn’t mean it’ll do worse in the future. Usually, if it’s good with the data we have, it’ll probably be better at predicting future stuff too.

For models like FSRS, using all the data we’ve got for training makes sense. It’s not overfitting – it’s making the most of the limited info we have to get better predictions. The evaluation stuff just helps us see how well it’s doing, not decide if it’s overfitting.

Hope that clears things up! Let me know if you’ve got any more questions.

5 Likes

I feel like it’s the hundredth time I’m hearing this explanation from you :sweat_smile:

Thanks for explaining Jarrett.

I admit I haven’t really looked into exactly how the FSRS equations work because it feels like it could end up with me having to read a lot of the surrounding literature to understand what is going on.

To kind of bring things back around to Dian’s question:

The FSRS models may not have enough flexibility to completely break from over-fitting like a neural network but could it be over-fitting very slightly?

(I understand you think any over-fitting is small enough to not really matter compared to the advantages of using all the data for training.)

Going from the optimiser saying:

FSRS 4.5 = 4.30% RMSE
FSRS 5   = 6.18% RMSE

seems like a very small change where it would be hard notice any difference subjectively in the short-term.

I could imagine a world where testing with extra data not used for training would reveal it actually went something more like:

FSRS 4.5 = 8% RMSE
FSRS 5   = 7% RMSE

but because the FSRS 4.5 model was slightly more over-fitted in this case than the FSRS 5 model, and we tested with the data we used for training it seemed to get worse, even though it actually got better.

i.e. We shouldn’t really worry about very small changes in RMSE reported by the Anki Optimiser going from FSRS 4.5 → FSRS 5 because without “clean” test data the number is not that accurate. We should just let the optimiser do its thing.

N.B. I might be misusing the term “over-fitting” here, but I’m not sure what else to call it.

2 Likes

thx. :smiling_face::white_heart: