Measures to prevent new FSRS parameters from being worse than the old ones

My bad, I misunderstood. I thought you meant there is no downside to frequent optimization in the current implementation. Yes, with my idea of keeping the best of two sets there would be no downside to optimizing requently. I really hope that LMSherlock and Dae will implement it.

@L.M.Sherlock is this expected? If not, it might be worth digging into in the future when you’re not as busy.

How frequently do you re-optimize? The more frequent, the more likely you get a worse result.

https://www.reddit.com/r/Anki/comments/19aph7t/should_i_save_fsrs_weights_when_the_numbers_are/
Here’s another post from a concerned user, I’ve seen quite a few of these posts. Since LMSherlock hasn’t found a rigorous solution to this problem, the best we can do is implement my idea of keeping the best of two. And to answer Dae’s question:

I agree with LMSherlock that only RMSE (bins) should be used to decide which parameters to keep.

After thinking about this for a while, I don’t think this is a problem with the algorithm, rather, this is just what happens when you are trying to estimate something (mean, median, standard deviation, etc.) from a sample.

Imagine that you are sampling some random value X from a normal distribution with mean μ and standard deviation σ. As your sample size increases, you should be able to estimate μ and σ more accurately, in the sense that the difference between your estimate and the true value will become smaller and smaller as the sample size grows. But while this is true in general, it’s not necessarily true for every single new Xn that you add to your sample. Sometimes adding a new Xn will make your estimates worse. It’s not because your formulas are bad, it’s just due to the random nature of the process itself.

The code is very simple:

data = []
means = []
for i in range(101):
    random_value = np.random.normal(0, 1, 1)
    data.append(float(random_value))
    means.append(np.mean(data))

Notice that the blue line doesn’t monotonically get closer to the black line. It gets close eventually, but not every time I add one more Xn to the sample. In other words, not every new datapoint makes the estimate closer to the ground truth. It’s basically the same thing with FSRS and it’s metrics, since forgetting is random.

EDIT: I thought about it some more and it still doesn’t make sense for the old parameters to perform better than the new ones on the same dataset.
Suppose we have obtained the old parameters on a dataset with n reviews and the new ones on a dataset with n+m reviews. If old parameters perform better than new ones on the dataset with n+m reviews, which is not what they were optimized on, then that’s strange. Old parameters were optimized on a (somewhat) different dataset, they can’t possibly provide a better fit to the new dataset than the new parameters, unless the optimization procedure itself is junk.

I thought about it some more and it still doesn’t make sense for the old parameters to perform better than the new ones on the same dataset.

Exactly, that’s the issue, though I’ve since come to understand that the algorithm doesn’t optimize for RMSE in the first place, but for log loss. RMSE and log loss don’t correlate perfectly, so even if the optimized log loss is better, the RMSE can be worse. That’s part of the issue, especially if as a user you care more about RMSE than about log loss. RMSE can’t be used for gradient descent (the optimization algorithm), so that’s why the proposed solution is to just discard the new result if the RMSE is (significantly) worse.

Old parameters were optimized on a (somewhat) different dataset, they can’t possibly provide a better fit to the new dataset than the new parameters, unless the optimization procedure itself is junk.

That was indeed my point, though back then I didn’t realize that it wasn’t optimizing for RMSE in the first place. Note, however, that getting a worse result doesn’t mean that the algorithm is junk. Gradient descent can get stuck at a local minimum (a solution that appears to be optimal, but isn’t). There are some methods to make the chances higher to end up in a lower minimum, but to find the global minimum (the best possible set of parameters) can take insane amounts of time, depending on how big the search space is and how many local minima there are. I have no idea how much time it would take for this particular problem, but I’ve personally worked on a problem where you’re not guaranteed to find the solution within your lifetime. The trick is to find or design an algorithm that takes the right shortcuts to get to a decent solution in minimal time. Right now optimization is incredibly fast and has decent chances of being better the more reviews you have done.

I know RMSE can get worse, but I’m not sure if logloss can get worse, I’ll have to test it. If it can, then something is very wrong.
Also, I plotted RMSE vs logloss for 20k collections, you might be interested.

LMSherlock shared this graph with me


If you optimize after every 1000 reviews, there is a ~75% chance that logloss will be better. Though, I think this is just one collection, so the number could be different for other people.
Also, here are his thoughts:

The stochastic gradient descent is stochastic. There is not theory assuring the algorithm could find the global minimum. And you can see that the difference is very small between last weights and current weights.
And log loss is related to the average retention.

1 Like

I’ve logged the issue on Keep previous FSRS parameters if they get worse when optimizing · Issue #2972 · ankitects/anki · GitHub so it doesn’t get forgotten about.

Based on the user feedback here, I suspect we have more than just chance at play - some changes in 23.12 appear to have made the output consistently less accurate for certain users - perhaps due to changes in the parameter clamps?

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.