FSRS>=5 is a significant regression for a minority of users who use again a lot on new and learn cards

Expertium · September 22, 2025, 7:50am

Alright, I benchmarked using another parameter so that D is updated less during same-day reviews. It made results worse.
I’ll try disabling D updates for same-day reviews entirely, but if updating D less makes results worse, not updating D at all will almost certainly make results worse.

Expertium · September 22, 2025, 8:44am

Btw, you’d be surprised, but removing that parameter (or setting it to 0 permanently) improves results (at least with FSRS-6, I haven’t tested it with FSRS-7).
It seems that there is a fundamental mismatch between how people expect D to behave and what behavior of D results in more accurate predictions.

7Gamil · September 22, 2025, 2:18pm

So that parameter increases interval but lowers the accuracy.
I tested it now, got RMSE: 7% instead of 8.5% by setting that parameter to 0, 1.5%+ not a big deal with long intervals between cards especially in multiple again cards.

Maybe in some cases there will be a huge difference in RMSE, still we like long interval than accuracy cuz less work.

Expertium · September 22, 2025, 2:30pm

You mean RMSE? Logloss is not a percentage. Look at the number that is not a percentage if you are using “Evaluate”

A_Blokee · September 22, 2025, 2:50pm

Weird thing I noticed when I tried this earlier. For some collections same day difficulty being changed more from same day reviews was optimized for when I clamped it to (0, 2).

FSRS-6 With change (last parameter is short term multiplier):

{"metrics": {"RMSE": 0.457292, "LogLoss": 0.607761, "RMSE(bins)": 0.043811, "ICI": 0.030133, "AUC": 0.617244}, "user": 21, "size": 25240, "parameters": {"0": [0.1269, 0.2151, 0.2282, 0.3773, 6.7776, 0.4657, 2.6753, 0.0096, 1.38, 0.1329, 0.3798, 1.5245, 0.0715, 0.3655, 1.944, 0.2396, 1.5716, 0.138, 0.0006, 0.0045, 0.2339, 1.4005]}}
{"metrics": {"RMSE": 0.275367, "LogLoss": 0.275673, "RMSE(bins)": 0.068607, "ICI": 0.015878, "AUC": 0.708023}, "user": 22, "size": 2650, "parameters": {"0": [0.8131, 1.9299, 1.4357, 10.2695, 6.5831, 0.6632, 3.0603, 0.02, 1.7465, 0.1873, 0.6817, 1.4881, 0.0444, 0.2384, 1.74, 0.6693, 1.6788, 0.4969, 0.2268, 0.2783, 0.3484, 0.7606]}}

FSRS-6

{"metrics": {"RMSE": 0.457351, "LogLoss": 0.60792, "RMSE(bins)": 0.044631, "ICI": 0.030507, "AUC": 0.616364}, "user": 21, "size": 25240, "parameters": {"0": [0.1263, 0.2143, 0.2272, 0.3763, 6.7978, 0.5124, 2.7831, 0.0089, 1.3768, 0.1246, 0.3738, 1.523, 0.0721, 0.3651, 1.9492, 0.2408, 1.5551, 0.1422, 0.0007, 0.0053, 0.234]}}
{"metrics": {"RMSE": 0.275343, "LogLoss": 0.275726, "RMSE(bins)": 0.068582, "ICI": 0.016181, "AUC": 0.707729}, "user": 22, "size": 2650, "parameters": {"0": [0.8125, 1.9248, 1.4359, 10.2709, 6.5785, 0.6628, 3.0769, 0.0226, 1.7433, 0.1902, 0.6783, 1.4876, 0.0439, 0.2378, 1.7383, 0.6698, 1.6755, 0.4957, 0.2264, 0.278, 0.3471]}}

It ended up having no effect on the metrics at all

Model: FSRS-6-dev
Total number of users: 235
Total number of reviews: 6315625
Weighted average by reviews:
FSRS-6-dev LogLoss (mean±std): 0.3323±0.1576
FSRS-6-dev RMSE(bins) (mean±std): 0.0547±0.0367
FSRS-6-dev AUC (mean±std): 0.7026±0.0677

Weighted average by log(reviews):
FSRS-6-dev LogLoss (mean±std): 0.3525±0.1685
FSRS-6-dev RMSE(bins) (mean±std): 0.0652±0.0390
FSRS-6-dev AUC (mean±std): 0.7016±0.0836

Weighted average by users:
FSRS-6-dev LogLoss (mean±std): 0.3542±0.1699
FSRS-6-dev RMSE(bins) (mean±std): 0.0665±0.0393
FSRS-6-dev AUC (mean±std): 0.7006±0.0867

parameters: [0.1992, 1.1353, 2.899, 12.7241, 6.4456, 0.6845, 3.1086, 0.0216, 1.8317, 0.1873, 0.7779, 1.4912, 0.0581, 0.3191, 1.6989, 0.4104, 1.9663, 0.7013, 0.1209, 0.1148, 0.183, 1.0788]

Model: FSRS-6
Total number of users: 235
Total number of reviews: 6315625
Weighted average by reviews:
FSRS-6 LogLoss (mean±std): 0.3323±0.1577
FSRS-6 RMSE(bins) (mean±std): 0.0546±0.0367
FSRS-6 AUC (mean±std): 0.7027±0.0679

Weighted average by log(reviews):
FSRS-6 LogLoss (mean±std): 0.3525±0.1685
FSRS-6 RMSE(bins) (mean±std): 0.0651±0.0390
FSRS-6 AUC (mean±std): 0.7017±0.0836

Weighted average by users:
FSRS-6 LogLoss (mean±std): 0.3542±0.1699
FSRS-6 RMSE(bins) (mean±std): 0.0665±0.0393
FSRS-6 AUC (mean±std): 0.7007±0.0867

parameters: [0.1981, 1.1208, 2.899, 12.7241, 6.4435, 0.6813, 3.1037, 0.0217, 1.8285, 0.1844, 0.7759, 1.491, 0.0579, 0.319, 1.6961, 0.4095, 1.9617, 0.7046, 0.1173, 0.1175, 0.18]

Expertium · September 22, 2025, 5:47pm

I benchmarked not updating D if the interval is <24 hours. It made results worse.
I’ll continue experimenting with D (and with FSRS-7 overall) to see if I can improve it somehow.

Helge · September 22, 2025, 8:53pm

First, a test that would confirm my hypothesis if its result is positive: Please test with excluding all “Agains” that occurred before Hard, Good, or Easy was pressed for the first time.

Now the explanation: I’ve found at least one cause of the problem discussed in this thread: I, but certainly other users as well, have been giving FSRS bad information. I see the main cause in Anki’s user interface. This affects Again-gradings of a card before the user has pressed one of the three pass buttons for the first time. These Agains are bad data because they can mean very different things. For me, for non-trivial card answers, it has often meant: I’ve memorized a bit more (I packed my bag - Wikipedia) and want to be tested on this bit more as soon as my working memory (The Magical Number Seven, Plus or Minus Two - Wikipedia) was overwritten by working on a different card. Pressing the “Again” button for this wish seems logical. But I haven’t really memorized the rest of the card yet!

The requirement for FSRS must be that the user can completely reproduce the card’s answer, at least directly after looking at it, because stability is only defined for this case! If the user couldn’t reproduce the answer at the beginning of an interval, a grading after this interval is misleading data.

L.M.Sherlock · September 23, 2025, 4:14am

It doesn’t work well.

$ python evaluate.py --fast
Model: FSRS-6-dev
Total number of users: 2520
Total number of reviews: 85813985
Weighted average by reviews:
FSRS-6-dev LogLoss (mean±std): 0.3373±0.1538
FSRS-6-dev RMSE(bins) (mean±std): 0.0494±0.0303
FSRS-6-dev AUC (mean±std): 0.6909±0.0799

Weighted average by log(reviews):
FSRS-6-dev LogLoss (mean±std): 0.3554±0.1671
FSRS-6-dev RMSE(bins) (mean±std): 0.0662±0.0411
FSRS-6-dev AUC (mean±std): 0.6838±0.0852

Weighted average by users:
FSRS-6-dev LogLoss (mean±std): 0.3576±0.1694
FSRS-6-dev RMSE(bins) (mean±std): 0.0687±0.0424
FSRS-6-dev AUC (mean±std): 0.6824±0.0875

parameters: [0.15935, 0.86735, 2.05955, 12.807, 6.43815, 0.6355, 3.1391, 0.0275, 1.7619, 0.19505, 0.74155, 1.4732, 0.0553, 0.31065, 1.68385, 0.37005, 1.9603, 0.71675, 0.0472, 0.1285, 0.16915]

Model: FSRS-6
Total number of users: 2520
Total number of reviews: 85813985
Weighted average by reviews:
FSRS-6 LogLoss (mean±std): 0.3310±0.1506
FSRS-6 RMSE(bins) (mean±std): 0.0471±0.0291
FSRS-6 AUC (mean±std): 0.7069±0.0817

Weighted average by log(reviews):
FSRS-6 LogLoss (mean±std): 0.3460±0.1614
FSRS-6 RMSE(bins) (mean±std): 0.0624±0.0393
FSRS-6 AUC (mean±std): 0.7044±0.0876

Weighted average by users:
FSRS-6 LogLoss (mean±std): 0.3479±0.1633
FSRS-6 RMSE(bins) (mean±std): 0.0647±0.0406
FSRS-6 AUC (mean±std): 0.7033±0.0900

parameters: [0.21315, 1.09365, 3.0333, 12.9575, 6.43915, 0.67735, 3.0969, 0.0216, 1.80755, 0.17895, 0.7807, 1.4956, 0.0569, 0.3254, 1.70705, 0.3918, 1.95725, 0.70325, 0.126, 0.12735, 0.1826]

Here is the change:


        # Create tensor features with shape (sequence_length, 2)
        # Each row contains [time_interval, rating] for that step
        # Exclude "Again" ratings that occurred before the first Hard/Good/Easy rating
        df["tensor"] = [
            self._create_filtered_tensor(t_item[:-1], r_item[:-1])


            for t_sublist, r_sublist in zip(t_history_list, r_history_list)
            for t_item, r_item in zip(t_sublist, r_sublist)
        ]

        return df

    def _create_filtered_tensor(self, t_history: list, r_history: list) -> torch.Tensor:
        """
        Create tensor while excluding "Again" ratings (rating=1) that occurred
        before the first Hard/Good/Easy rating (rating=2,3,4)
        """
        if not t_history or not r_history:
            return torch.tensor(([], []), dtype=torch.float32).transpose(0, 1)

        # Find the first successful rating (2, 3, or 4)
        first_short_term_success_idx = 0
        for i, (time_interval, rating) in enumerate(zip(t_history, r_history)):
            if rating in [2, 3, 4] or time_interval > 0:
                first_short_term_success_idx = i
                break

        # Filter out "Again" ratings before the first short-term success
        filtered_t = []
        filtered_r = []

        for i, (time_interval, rating) in enumerate(zip(t_history, r_history)):
            # Keep all non-"Again" ratings, and "Again" ratings only after first success
            if rating != 1 or i >= first_short_term_success_idx:
                filtered_t.append(time_interval)
                filtered_r.append(rating)

        # Create tensor from filtered data
        if filtered_t and filtered_r:
            return torch.tensor(
                (filtered_t, filtered_r), dtype=torch.float32
            ).transpose(0, 1)
        else:
            return torch.tensor(([], []), dtype=torch.float32).transpose(0, 1)

Helge · September 23, 2025, 5:58am

Thank you for testing this, Jarrett!

My proposal was too radical. Ignoring the initial Agains completly for all users is overkill. Cards with a lot of Agains at first are more difficult, but for some users not as difficult as FSRS6 predicts, especially for users who use I pack my back learning. I propose inserting a parameter which determines how much influence the initial Agains have.

Expertium · September 23, 2025, 10:52am

There kind of is already. The formula for calculating initial D (after the very first review) is different from the formula used for all consecutive D updates.

Helge · September 23, 2025, 1:43pm

OK, but the problem are many Agains, in extreme cases like 10 or even more for a long list, which drive the difficulty much too high.

But actually modifying FSRS only mitigates this problem. Anki could provide an option to not even log any Again presses which occur bevor the first Pass-Grading.

Expertium · September 23, 2025, 2:42pm

But Jarrett already tested it, it makes FSRS perform worse.

Helge · September 23, 2025, 6:40pm

Well, I guess I didn’t describe well enough what I meant. And maybe the feature is too niche to ever make it into Anki.

The alternative to my idea is to teach the users having the problem in this thread that they shouldn’t press “again” if they never knew the full answer. How does stability make any sense if you didn’t know the answer at the start of the interval?

I will try again to describe it: If you remove *all* initial Agains it performs worse, yes. That is because most cards of most users are probably not so difficult, but instead vocabulary or other short answers. On these, if you can’t remember it right away, that means that the card is more difficult.

But if you play “I pack my bag” with long, difficult cards, the first “Agains” are not real gradings because you never knew the full answer.

But I give up, because I have the feeling that this does not matter for most cards of most users. I will stick with the workarounds. Thanks for listening, testing, your feedback, etc!

L.M.Sherlock · September 24, 2025, 3:39am

It is not a trivial change. We should figure out how to add the new parameter into FSRS.

And it’s helpful to make progress for your case if we have a specific dataset to evaluate this kind of changes.

Helge · September 24, 2025, 6:01am

I think real data of this will be hard to obtain, because users who learn like this will probably be annoyed by the short intervals and use one of the workarounds or turn off FSRS.

Helge · September 24, 2025, 6:06am

But now that I think about it, wouldn’t data with the reset workaround work?

Helge · September 24, 2025, 8:55am

I think the forgetting curve for lists is significantly different from that of short answers. I don’t have a solution, but it doesn’t surprise me at all anymore that FSRS is only of limited use for lists, because the test data probably consists mostly of short answers, e.g., vocabulary.

I’ll write down my thoughts, but there is empirical research on this in psychology: Google Scholar

If a card’s answer contains a list, you can think of each item as its own card. In particular, each item has its own retrievability probability of less than one. The retrievability of the entire card is the product of all of these and is therefore much lower, e.g., with a .95 retrievability for the individual items and a ten-item list: .95^10 = .35.

With a lower retrievability for the individual items, the retrievability of the list decreases radically:

.7^10 = .03

Below 0.7, the total retrievability is then almost zero.

Helge · September 24, 2025, 9:07am

The following research seems highly relevant to me on first look:

https://pubs.aip.org/aip/jmp/article/63/7/073303/2843660

“Cognitive psychologists developed experimental paradigms involving randomly collected lists of items that make possible quantitative measures of performance in memory tasks, such as recall and recognition. Here, we describe a set of mathematical models designed to predict the results of these experiments. The models are based on simple underlying assumptions and surprisingly agree with experimental results quite well, in addition to that they exhibit quite interesting mathematical behavior that can partially be understood analytically.”

J. Math. Phys. 63, 073303 (2022)

https://doi.org/10.1063/5.0088823

I found this through this search:Mathematical+models+of+memory+decay on Google scholar

Helge · September 24, 2025, 11:36am

And this exponentiation of the retrievability predicts what this threat is about: first it takes many Agains to drive it up, but at a certain point it steeply rises, recall is very easy and intervals at least seem too short.

I think the retrievability of lists is much more difficult to predict because the curve is so much steeper.

Example for a 10 item list:

vaibhav · September 24, 2025, 1:11pm

@Helge, in his last comment, Jarrett expressed the need for some data to test changes. I believe that you are an affected user. So, it would help if you could share your data for testing purposes.

If privacy is a concern, you can go to Tools -> FSRS Helper -> Export Dataset for Research and share the generated file here as a Google Drive link. The file doesn’t contain any personal information or the text of your cards.

Topic		Replies	Views
[Feature Request] FSRS Should Ignore Learning Cards Reviews FSRS	37	405	July 29, 2025
Several FSRS-related suggestions FSRS	167	2047	October 29, 2024
How to use the next-generation spaced repetition algorithm FSRS on Anki? FSRS	403	48301	May 23, 2024
FSRS5: Super low difficulty after optimization FSRS	21	817	December 13, 2024
New progress in implement the custom algorithm Scheduling	52	6790	May 1, 2023

FSRS>=5 is a significant regression for a minority of users who use again a lot on new and learn cards

Related topics