Currently with FSRS 5 I get the following statistics on my Profile1:
Log loss: 0.4479, RMSE(bins): 2.76%. Smaller numbers indicate a better fit to your review history.
I also have a Profile2 that I used 2-3 years ago with SM2. Although SM2 is not based on recalling probability, is there a way to compare or generate RMSE in these deck collection?
No, you can’t generate RMSE on an SM-2 scheduling history. SM-2 wasn’t doing anything to predict your recall, so there’s nothing to compare it to.
RMSE (bins) can be interpreted as the average difference between the predicted probability of recalling a card (R) and the measured (from the review history) probability. For example, RMSE=0.05 means that, on average, FSRS is off by 5% when predicting R.
That was made just for benchmarking purposes, so you’d have to copy-paste the Python code to use it on your own. The original SM-2 doesn’t predict probabilities, so in the benchmark LMSherlock added some extra formulas on top of it.
Theoretically, it’s possible to add the same probability-related formulas to Anki’s version of SM-2, hook it up to the optimizer and run optimization the same way as we do with FSRS…but why?
Btw, according to the benchmark, FSRS-5 outperforms SM-2 trainable in 97.4% of cases, so even if you level the playing field by making both optimizable and using the same optimizer, FSRS is still clearly better.
Actually, on second thought, it would be fun and it would make it clear that SM-2 isn’t as good as FSRS. We could show both FSRS and SM-2 metrics side-by-side in “Evaluate”, so that people will be like “SM-2 has higher numbers, so it’s worse, ok, got it”.
Is it worth the effort to implement? Debatable. Will Jarrett do it? I wouldn’t bet on it.