FSRS>=5 is a significant regression for a minority of users who use again a lot on new and learn cards

You mean Anki learning step? if it short like 1h 1d, maybe work well with FSRS

I mean ā€œLearnā€ vs ā€œReviewā€ (and also ā€œRelearnā€, I forgot that’s a thing)
image

3 Likes

Is it problem if we let Anki only effect learning and relearning with short steps, and leave review cards to FSRS?

It will make FSRS’s predictions worse, we already know that since around FSRS-5.

2 Likes

Yes, looking only at the duration between reviews is much better.

2 Likes

Thanks for letting me know, sorry for this method.
We then can go with auto reset after changing card state from learning to review, it will work well enough and easy to do.

Making learn/review/relearn an input feature (like interval lengths and grades are right now) for FSRS is one thing. Not letting FSRS see the card’s previous review history is…look man, just trust me when I say this, it’s not a good idea.

Okay, then let’s wait until we get a novel method.
Thanks for your hard work.

2 Likes

Okay, it’s great that your analyses would flag this case. I didn’t know it was so rare here. In LLMs, this is the case for most parameters, since they also store information about small towns, for example, that is only ever needed by an absolutely tiny number of users who will ever ask about these towns.

I mean, realistically, I won’t find myself in that situation. If changing something doesn’t affect metrics for 99% of users and makes metrics better for 1% of users, it will look like a reduction of logloss on average, so I will implement it anyway without having to think too hard.
Oversimplified: let’s say for users 1 and 2 logloss doesn’t change after I added a new formula to FSRS, and for user 3 it changes by -0.003. Then the average is
(0+0-0.003)/3=-0.001, so I will be like ā€œalright, let’s keep that new formulaā€.

Btw, it’s possible that some change has a very small or 0 impact on the average, but it improves worst-case performance for, say, 1% or 0.5% of users with the highest logloss and makes performance worse for top users with the lowest logloss, so on average it cancels out. So it makes the entire distribution more narrow. This is not something that I test right now, and idk if that happens often. I think no, it probably doesn’t happen very often. It would require performance on above-average and below-average users to cancel out in a very precise way.
It’s pure speculation at this point, but making the distribution of logloss more narrow could be a good thing. It means FSRS would be more consistent. Winners win less, but losers lose less, so to speak.

4 Likes

In any case, thanks again so much, expertium, for all the work that goes into all of this.

2 Likes

I plotted the distribution of logloss for different FSRS versions
…it looked better in my head


Vertical lines are averages. Also, distributions become narrower

version=FSRS v1, stand. dev.=0.3122, IQR=0.3333
version=FSRS v2, stand. dev.=0.2803, IQR=0.3184
version=FSRS v3, stand. dev.=0.2632, IQR=0.3024
version=FSRS v4, stand. dev.=0.1852, IQR=0.2587
version=FSRS-4.5, stand. dev.=0.1757, IQR=0.2496
version=FSRS-5, stand. dev.=0.1717, IQR=0.2432
version=FSRS-6-recency, stand. dev.=0.1631, IQR=0.2366

4 Likes

How can we interpret this graph? Freqency seems to have gone down a lot; is that a good thing?

(I’m assuming logloss getting narrower is certainly a good thing)

Frequency is just ā€œhow many users have logloss of around this muchā€, don’t look at its absolute values, absolute values of frequency don’t matter.
Averages have been decreasing (vertical lines shifting to the left), that’s definitely good. Distributions getting narrower is less-obviously-good-but-still-good, it means fewer users get really bad results and FSRS is getting more consistent.

3 Likes

I was hoping that if I do density estimation, it would look better. Uh, well, not really

This is basically ā€œhow to avoid histograms if you’re a nerdā€. At least here you don’t have to worry about the width of bins, like on the previous graph. And this gives a better feel that distributions are getting narrower.
Notice that older versions have a thicker right tail. That’s bad, that means more people get extremely poor predictions. Right tail should be as thin as possible. Newer versions have thinner right tail, that’s good.

4 Likes

That is solid improvement for all users. Very reassuring. Thank you!

1 Like

Why not just do a test?

I will benchmark this change:

new_d = torch.where(short_term, state[:, 1], self.next_d(state, X[:, 1]))

For short-term reviews, the difficulty will not change.

2 Likes

If you are willing to share your collection with me, I could evaluate the change with your case.

2 Likes

@Expertium

The difference is so negligible:

Model: FSRS-6-dev
Total number of users: 2061
Total number of reviews: 68679005
Weighted average by reviews:
FSRS-6-dev LogLoss (mean±std): 0.3334±0.1511
FSRS-6-dev RMSE(bins) (mean±std): 0.0474±0.0299
FSRS-6-dev AUC (mean±std): 0.7082±0.0820

Weighted average by log(reviews):
FSRS-6-dev LogLoss (mean±std): 0.3496±0.1612
FSRS-6-dev RMSE(bins) (mean±std): 0.0626±0.0396
FSRS-6-dev AUC (mean±std): 0.7051±0.0876

Weighted average by users:
FSRS-6-dev LogLoss (mean±std): 0.3515±0.1630
FSRS-6-dev RMSE(bins) (mean±std): 0.0649±0.0407
FSRS-6-dev AUC (mean±std): 0.7044±0.0895

parameters: [0.2116, 1.0897, 2.9447, 12.7109, 6.5001, 0.7207, 3.0567, 0.0142, 1.7844, 0.1558, 0.7581, 1.5011, 0.0523, 0.3266, 1.7133, 0.3781, 1.9568, 0.7399, 0.1184, 0.1267, 0.1799]

Model: FSRS-6
Total number of users: 2061
Total number of reviews: 68679005
Weighted average by reviews:
FSRS-6 LogLoss (mean±std): 0.3333±0.1509
FSRS-6 RMSE(bins) (mean±std): 0.0475±0.0299
FSRS-6 AUC (mean±std): 0.7081±0.0821

Weighted average by log(reviews):
FSRS-6 LogLoss (mean±std): 0.3496±0.1610
FSRS-6 RMSE(bins) (mean±std): 0.0627±0.0396
FSRS-6 AUC (mean±std): 0.7048±0.0880

Weighted average by users:
FSRS-6 LogLoss (mean±std): 0.3515±0.1629
FSRS-6 RMSE(bins) (mean±std): 0.0650±0.0407
FSRS-6 AUC (mean±std): 0.7041±0.0899

parameters: [0.2122, 1.0908, 2.9459, 12.7045, 6.4391, 0.679, 3.0999, 0.0213, 1.8084, 0.1802, 0.7802, 1.496, 0.0565, 0.3234, 1.7089, 0.3869, 1.9502, 0.7046, 0.1261, 0.1282, 0.1813]
5 Likes

Thank you, though I will still benchmark it myself using FSRS-7. I wonder if results will be similar. FSRS-7 uses fractional interval lengths instead of integer interval lengths, so maybe. And more importantly, FSRS-7 predicts p(recall) for same-day reviews.

3 Likes