You mean Anki learning step? if it short like 1h 1d, maybe work well with FSRS
I mean āLearnā vs āReviewā (and also āRelearnā, I forgot thatās a thing)

Is it problem if we let Anki only effect learning and relearning with short steps, and leave review cards to FSRS?
It will make FSRSās predictions worse, we already know that since around FSRS-5.
Yes, looking only at the duration between reviews is much better.
Thanks for letting me know, sorry for this method.
We then can go with auto reset after changing card state from learning to review, it will work well enough and easy to do.
Making learn/review/relearn an input feature (like interval lengths and grades are right now) for FSRS is one thing. Not letting FSRS see the cardās previous review history isā¦look man, just trust me when I say this, itās not a good idea.
Okay, then letās wait until we get a novel method.
Thanks for your hard work.
Okay, itās great that your analyses would flag this case. I didnāt know it was so rare here. In LLMs, this is the case for most parameters, since they also store information about small towns, for example, that is only ever needed by an absolutely tiny number of users who will ever ask about these towns.
I mean, realistically, I wonāt find myself in that situation. If changing something doesnāt affect metrics for 99% of users and makes metrics better for 1% of users, it will look like a reduction of logloss on average, so I will implement it anyway without having to think too hard.
Oversimplified: letās say for users 1 and 2 logloss doesnāt change after I added a new formula to FSRS, and for user 3 it changes by -0.003. Then the average is
(0+0-0.003)/3=-0.001, so I will be like āalright, letās keep that new formulaā.
Btw, itās possible that some change has a very small or 0 impact on the average, but it improves worst-case performance for, say, 1% or 0.5% of users with the highest logloss and makes performance worse for top users with the lowest logloss, so on average it cancels out. So it makes the entire distribution more narrow. This is not something that I test right now, and idk if that happens often. I think no, it probably doesnāt happen very often. It would require performance on above-average and below-average users to cancel out in a very precise way.
Itās pure speculation at this point, but making the distribution of logloss more narrow could be a good thing. It means FSRS would be more consistent. Winners win less, but losers lose less, so to speak.
In any case, thanks again so much, expertium, for all the work that goes into all of this.
I plotted the distribution of logloss for different FSRS versions
ā¦it looked better in my head
Vertical lines are averages. Also, distributions become narrower
version=FSRS v1, stand. dev.=0.3122, IQR=0.3333
version=FSRS v2, stand. dev.=0.2803, IQR=0.3184
version=FSRS v3, stand. dev.=0.2632, IQR=0.3024
version=FSRS v4, stand. dev.=0.1852, IQR=0.2587
version=FSRS-4.5, stand. dev.=0.1757, IQR=0.2496
version=FSRS-5, stand. dev.=0.1717, IQR=0.2432
version=FSRS-6-recency, stand. dev.=0.1631, IQR=0.2366
How can we interpret this graph? Freqency seems to have gone down a lot; is that a good thing?
(Iām assuming logloss getting narrower is certainly a good thing)
Frequency is just āhow many users have logloss of around this muchā, donāt look at its absolute values, absolute values of frequency donāt matter.
Averages have been decreasing (vertical lines shifting to the left), thatās definitely good. Distributions getting narrower is less-obviously-good-but-still-good, it means fewer users get really bad results and FSRS is getting more consistent.
I was hoping that if I do density estimation, it would look better. Uh, well, not really
This is basically āhow to avoid histograms if youāre a nerdā. At least here you donāt have to worry about the width of bins, like on the previous graph. And this gives a better feel that distributions are getting narrower.
Notice that older versions have a thicker right tail. Thatās bad, that means more people get extremely poor predictions. Right tail should be as thin as possible. Newer versions have thinner right tail, thatās good.
That is solid improvement for all users. Very reassuring. Thank you!
Why not just do a test?
I will benchmark this change:
new_d = torch.where(short_term, state[:, 1], self.next_d(state, X[:, 1]))
For short-term reviews, the difficulty will not change.
If you are willing to share your collection with me, I could evaluate the change with your case.
The difference is so negligible:
Model: FSRS-6-dev
Total number of users: 2061
Total number of reviews: 68679005
Weighted average by reviews:
FSRS-6-dev LogLoss (mean±std): 0.3334±0.1511
FSRS-6-dev RMSE(bins) (mean±std): 0.0474±0.0299
FSRS-6-dev AUC (mean±std): 0.7082±0.0820
Weighted average by log(reviews):
FSRS-6-dev LogLoss (mean±std): 0.3496±0.1612
FSRS-6-dev RMSE(bins) (mean±std): 0.0626±0.0396
FSRS-6-dev AUC (mean±std): 0.7051±0.0876
Weighted average by users:
FSRS-6-dev LogLoss (mean±std): 0.3515±0.1630
FSRS-6-dev RMSE(bins) (mean±std): 0.0649±0.0407
FSRS-6-dev AUC (mean±std): 0.7044±0.0895
parameters: [0.2116, 1.0897, 2.9447, 12.7109, 6.5001, 0.7207, 3.0567, 0.0142, 1.7844, 0.1558, 0.7581, 1.5011, 0.0523, 0.3266, 1.7133, 0.3781, 1.9568, 0.7399, 0.1184, 0.1267, 0.1799]
Model: FSRS-6
Total number of users: 2061
Total number of reviews: 68679005
Weighted average by reviews:
FSRS-6 LogLoss (mean±std): 0.3333±0.1509
FSRS-6 RMSE(bins) (mean±std): 0.0475±0.0299
FSRS-6 AUC (mean±std): 0.7081±0.0821
Weighted average by log(reviews):
FSRS-6 LogLoss (mean±std): 0.3496±0.1610
FSRS-6 RMSE(bins) (mean±std): 0.0627±0.0396
FSRS-6 AUC (mean±std): 0.7048±0.0880
Weighted average by users:
FSRS-6 LogLoss (mean±std): 0.3515±0.1629
FSRS-6 RMSE(bins) (mean±std): 0.0650±0.0407
FSRS-6 AUC (mean±std): 0.7041±0.0899
parameters: [0.2122, 1.0908, 2.9459, 12.7045, 6.4391, 0.679, 3.0999, 0.0213, 1.8084, 0.1802, 0.7802, 1.496, 0.0565, 0.3234, 1.7089, 0.3869, 1.9502, 0.7046, 0.1261, 0.1282, 0.1813]
Thank you, though I will still benchmark it myself using FSRS-7. I wonder if results will be similar. FSRS-7 uses fractional interval lengths instead of integer interval lengths, so maybe. And more importantly, FSRS-7 predicts p(recall) for same-day reviews.

