So is it better to use 17 values parameters with fsrs v5 in such a case?
When you use FSRS-4.5 parameters in Anki 24.10, the two new parameters are appended “behind the scenes” (you won’t see them) and their default values are used.
Given that the v4.5 parameters are often not optimal for v5, combined with the fact that it’s not guaranteed that you’ll find the global minimum when optimizing, it makes sense that you can end up with a higher RSME after optimizing for v5, even if the v5 global minimum is better than the v4.5 global minimum. Therefore I’m very interested in this random seed solution, because I too have found that the python optimizer can often (not always) come up with a (much) better set of parameters than the fsrs-rs one, plus for me the v5 paramaters for some decks were also worse initially than the v4.5 ones were.
The only thing to think about is if the seed should be randomized before every optimization, because then you might first get the message that the parameters are already optimal and a better set of parameters if you click it again. That might lead to some confusion, so maybe it should be randomized and then fixed until some condition is met.
Running the optimizer with N different seeds would mitigate the problem and improve accuracy, but it would make it N times slower, so it’s unlikely to be implemented.
Though I should add that if you optimize frequently, you’ll eventually get better parameters just by chance.
Ok, so 50/50 chance I’d say, no apparent bias? So that would mean that it’s a matter of either just optimizing frequently and hope that you settle into a more optimal set of parameters over time, or use a random seed every time the optimize button is pressed to make it possible to find it by pressing the button a few times. But you guys might not want the optimization step to give different results every time. Maybe it could be based on something like the date, so the seed changes only once a day. Or save the random seed until you leave the settings pane, etc. Various ways to make it more stable without making it completely static. Just sharing some ideas.
Or based on the number of reviews, so that the seed only changes when new reviews have been done. That makes it unnoticeable to end users. The question is if it makes sense statistically speaking. If it increases the chance of getting good parameters in a shorter amount of time.
Or based on the number of reviews, so that the seed only changes when new reviews have been done.
That’s an interesting idea. @L.M.Sherlock, what do you think?
EDIT: I think using the number of reviews as seed will be strictly better than the current implementation.
- It will increase the chances of stumbling upon better parameters and decrease the chances of what happened to MaleMonologue happening again.
- It will not create a situation where the user pressed “Optimize” and got the “parameters appear to be optimal” message, then presed it again immediately and got new parameters, which could happen if the seed is completely random (or based on the UNIX timestamp, I believe that’s what random number generators usually use for seeding).
- To users, it will look like FSRS got better at utilizing new review data without any drawbacks.
- Won’t make it harder to reproduce issues since the number of reviews before the user exported their collection is obviously the same as after.
EDIT 2: the benchmark can use a fixed seed for the sake of consistency, while Anki will use the number of reviews for seeding to increase the chances of obtaining better parameters.
Is it real?
My gut feeling is that if you use the same seed and only add a small amount of reviews, the chances are slim that you’ll end up with significantly better parameters. With another seed it might bump you into a more optimal minimum. So I agree with @Expertium, but I can’t prove it statistically. Either way, I can’t think of any downsides, as any seed is as good as any other seed until you try it, right? Without a random seed it might take more reviews before you’ll stumble upon a better optimum.
What do you mean?
I was trying to say that if the seed changes after every review, it will increase the chances of finding better parameters every time the user does at least 1 review. To the user it will look like he is getting “FSRS parameters currently appear to be optimal” less frequently.
It doesn’t increase the chances. It may decrease it. The chance is 50/50.
Why?
As I understand it, different seeds result in slightly different final parameters, so trying different seeds is beneficial.
When you control everything else, trying different seeds may help.
I agree on the 50/50 part, but I assume that if the dataset that you’re trying to optimize for is only slightly different (so for example +10 reviews in a set of 10k reviews), the same seed might have a high chance of leading to the same[1] local minimum as last time, so at best only slightly better parameters compared to last time. If you change the seed, it might knock you to a better local minimum, even though the dataset is only slightly different. And yes, it might be worse today (in which case you keep the current parameters), but it might be significantly better tomorrow, whereas the same seed might not have knocked you to a much better solution in only 2 days. But this only holds true if my initial assumption is correct.
[1] With “same” I don’t mean that the parameters are equal, but that the general shape of the solution space is the same and that you end up at the same relative location.
I suppose we need to conduct an experiment. I want you to do the following:
- Choose any collection (or your own).
- Select n, where n < total number of reviews. For example, if your collection has 10,000 reivews, you can set n=1,000.
- Optimize parameters based on n reviews.
- Increment n by 1.
- Optimize parameters again. If the new ones are better for the new dataset, keep them, otherwise keep old parameters.
Repeat the procedure described above until n=total number of reviews. In order to speed it up, you can increment n by 10 or 50 instead of 1, if you want to.
I want you to do it twice: with a fixed seed and with n used as seed. Then we’ll see whether there is a difference.
It can be tested by taking a dataset, removing the last N days of reviews and then adding one day at a time and optimizing each time and accepting the best parameters every time. Try that with a random seed and compare that with a static seed and repeat the experiment enough times to draw a conclusion. My suspicion is that it doesn’t make much of a difference with a small dataset, but it might make a big difference if you have a year or more of data and simulate, say, 30 extra days with daily optimizations.
Edit: I was typing this before I noticed that @Expertium posted a comment 2 minutes before I posted mine.
I’ve done some testing with a small dataset of ~30,000 revlog entries, where I optimized every 100 reviews, keeping the old parameters if they were better (based on log loss, not RMSE for now). One test was with a static seed, the other with a random seed. Long story short: for this small test, there was only a small difference in favor of the static seed.
- For each nth optimization I’ve compared the log loss between the two tests and the static seed was better 51.5% of the time compared to the random seed at the same point in time.
- The average log loss was 0.05% worse for the random seed (I can also do median and other metrics if needed).
- Based on the log loss values (sadly I threw away the parameters), it seems I got 67 unique sets of parameters for the random seed (so they were better than the previous one 67 times out of ~300 optimizations) and 65 for the static seed.
I already predicted that it would average out over time (that’s because of the normal distribution that @L.M.Sherlock showed), but from my testing I couldn’t even detect a difference with frequent optimizations, contrary to my expectations. So my initial conclusion is that there is no reason to implement this, unless it’s a real random seed, because then you can retry a few times on the same day, but that’s bad for the UX. Instead, it would be nice to see if the FSRS-rs implementation could match the performance of the Python implementation, since apparently it’s worse in 55.4% of the cases.
Of course there’s room for improvement in my tests, as they should be repeated enough times to see if there is a significant difference. It should also be based on RMSE and not log loss. If anyone would like to repeat to confirm my findings, that would of course be very welcome.
I dont know if this is related to this topic, but just a heads up.