21 parameter seems a very low number to model something in a world of billion parameter models. I read that parameters are tested in simulations on how much they improve predictions. But what if a parameter only improves the predictions only for a very small percentage of users, but their absolute number still is hundreds or thousands of users who get better predictions ?
I suspect that parameters are quite cheap. Therefore, why not in doubt add a parameter for anything which might improve predictions?
A parameter isn’t something you can just add. I mean, it is, but a useful parameter is very different.
In the space of theoretical "parameters” that one could add, approximately 100% of them will do absolutely nothing to improve a given model. The number of parameters that actually do anything are extremely small. I’m surprised it’s as high as 21.
I’ll add that there is a diminishing return on benefit from adding more parameters while possibly being costly in that more parameters could cause overfitting (which gives you a worse result). For most Anki collections there isn’t that much training data to train on in the grand scheme of things (at least if you want to optimize for a single individual’s review history).
You want the model to generalize the patterns that exists in the data, not match the data perfectly. As an example, imaging matching a polynomial function against n data points. You could for n datapoints find a polynomial of order n-1 that matches the data points exactly (ie. f(x_i) = y_i for each data point. That doesn’t make it useful to generalize a pattern (see eg. this figure where you see that fitting a polynomial to a bell curve gives huge deviations between the data points).
So the main point is that:
The fewer data points you have, the fewer parameters you can optimize without overfitting.
Most review histories have a relatively small number of datapoints (especially compared to the billions of parameter models you see in eg. LLMs or image classifiers).
I’ve actually trained a neural net with 1k parameters and another guy on github trained a NN with 9k parameters for our benchmark (GitHub - open-spaced-repetition/srs-benchmark: A benchmark for spaced repetition schedulers/algorithms, LSTM), and it outperforms FSRS, so overfitting isn’t that big of an issue. Plus, regularizaiton exists for a reason.
He also made a neural net with 2.7 million parameters, but that one is different - it’s not optimized on the data of every individual user, it’s “pretrained” on 5 thousand users and evaluated on the other 5, and that is repeated twice to cover the entire dataset. I’m actually surprised that it works so well. This is a different approach compared to all other algorithms that we have benchmarked. Other algorithms are trained on each user individually.
Bigger issues are:
More parameters = slower optimization.
FSRS isn’t a neural net. I can’t add a thousand parameters by changing one line of code. And the philosophy of FSRS is keeping it interpretable, so even if I made a hybrid neural FSRS, Jarrett wouldn’t approve. Well, I wouldn’t either, unless the benefits were absolutely massive.