I’ve done some testing with a small dataset of ~30,000 revlog entries, where I optimized every 100 reviews, keeping the old parameters if they were better (based on log loss, not RMSE for now). One test was with a static seed, the other with a random seed. Long story short: for this small test, there was only a small difference in favor of the static seed.
- For each nth optimization I’ve compared the log loss between the two tests and the static seed was better 51.5% of the time compared to the random seed at the same point in time.
- The average log loss was 0.05% worse for the random seed (I can also do median and other metrics if needed).
- Based on the log loss values (sadly I threw away the parameters), it seems I got 67 unique sets of parameters for the random seed (so they were better than the previous one 67 times out of ~300 optimizations) and 65 for the static seed.
I already predicted that it would average out over time (that’s because of the normal distribution that @L.M.Sherlock showed), but from my testing I couldn’t even detect a difference with frequent optimizations, contrary to my expectations. So my initial conclusion is that there is no reason to implement this, unless it’s a real random seed, because then you can retry a few times on the same day, but that’s bad for the UX. Instead, it would be nice to see if the FSRS-rs implementation could match the performance of the Python implementation, since apparently it’s worse in 55.4% of the cases.
Of course there’s room for improvement in my tests, as they should be repeated enough times to see if there is a significant difference. It should also be based on RMSE and not log loss. If anyone would like to repeat to confirm my findings, that would of course be very welcome.