Alright, I benchmarked using another parameter so that D is updated less during same-day reviews. It made results worse.
I’ll try disabling D updates for same-day reviews entirely, but if updating D less makes results worse, not updating D at all will almost certainly make results worse.
Btw, you’d be surprised, but removing that parameter (or setting it to 0 permanently) improves results (at least with FSRS-6, I haven’t tested it with FSRS-7).
It seems that there is a fundamental mismatch between how people expect D to behave and what behavior of D results in more accurate predictions.
So that parameter increases interval but lowers the accuracy.
I tested it now, got RMSE: 7% instead of 8.5% by setting that parameter to 0, 1.5%+ not a big deal with long intervals between cards especially in multiple again cards.
Maybe in some cases there will be a huge difference in RMSE, still we like long interval than accuracy cuz less work.
You mean RMSE? Logloss is not a percentage. Look at the number that is not a percentage if you are using “Evaluate”
Weird thing I noticed when I tried this earlier. For some collections same day difficulty being changed more from same day reviews was optimized for when I clamped it to (0, 2).
FSRS-6 With change (last parameter is short term multiplier):
{"metrics": {"RMSE": 0.457292, "LogLoss": 0.607761, "RMSE(bins)": 0.043811, "ICI": 0.030133, "AUC": 0.617244}, "user": 21, "size": 25240, "parameters": {"0": [0.1269, 0.2151, 0.2282, 0.3773, 6.7776, 0.4657, 2.6753, 0.0096, 1.38, 0.1329, 0.3798, 1.5245, 0.0715, 0.3655, 1.944, 0.2396, 1.5716, 0.138, 0.0006, 0.0045, 0.2339, 1.4005]}}
{"metrics": {"RMSE": 0.275367, "LogLoss": 0.275673, "RMSE(bins)": 0.068607, "ICI": 0.015878, "AUC": 0.708023}, "user": 22, "size": 2650, "parameters": {"0": [0.8131, 1.9299, 1.4357, 10.2695, 6.5831, 0.6632, 3.0603, 0.02, 1.7465, 0.1873, 0.6817, 1.4881, 0.0444, 0.2384, 1.74, 0.6693, 1.6788, 0.4969, 0.2268, 0.2783, 0.3484, 0.7606]}}
FSRS-6
{"metrics": {"RMSE": 0.457351, "LogLoss": 0.60792, "RMSE(bins)": 0.044631, "ICI": 0.030507, "AUC": 0.616364}, "user": 21, "size": 25240, "parameters": {"0": [0.1263, 0.2143, 0.2272, 0.3763, 6.7978, 0.5124, 2.7831, 0.0089, 1.3768, 0.1246, 0.3738, 1.523, 0.0721, 0.3651, 1.9492, 0.2408, 1.5551, 0.1422, 0.0007, 0.0053, 0.234]}}
{"metrics": {"RMSE": 0.275343, "LogLoss": 0.275726, "RMSE(bins)": 0.068582, "ICI": 0.016181, "AUC": 0.707729}, "user": 22, "size": 2650, "parameters": {"0": [0.8125, 1.9248, 1.4359, 10.2709, 6.5785, 0.6628, 3.0769, 0.0226, 1.7433, 0.1902, 0.6783, 1.4876, 0.0439, 0.2378, 1.7383, 0.6698, 1.6755, 0.4957, 0.2264, 0.278, 0.3471]}}
It ended up having no effect on the metrics at all
Model: FSRS-6-dev
Total number of users: 235
Total number of reviews: 6315625
Weighted average by reviews:
FSRS-6-dev LogLoss (mean±std): 0.3323±0.1576
FSRS-6-dev RMSE(bins) (mean±std): 0.0547±0.0367
FSRS-6-dev AUC (mean±std): 0.7026±0.0677
Weighted average by log(reviews):
FSRS-6-dev LogLoss (mean±std): 0.3525±0.1685
FSRS-6-dev RMSE(bins) (mean±std): 0.0652±0.0390
FSRS-6-dev AUC (mean±std): 0.7016±0.0836
Weighted average by users:
FSRS-6-dev LogLoss (mean±std): 0.3542±0.1699
FSRS-6-dev RMSE(bins) (mean±std): 0.0665±0.0393
FSRS-6-dev AUC (mean±std): 0.7006±0.0867
parameters: [0.1992, 1.1353, 2.899, 12.7241, 6.4456, 0.6845, 3.1086, 0.0216, 1.8317, 0.1873, 0.7779, 1.4912, 0.0581, 0.3191, 1.6989, 0.4104, 1.9663, 0.7013, 0.1209, 0.1148, 0.183, 1.0788]
Model: FSRS-6
Total number of users: 235
Total number of reviews: 6315625
Weighted average by reviews:
FSRS-6 LogLoss (mean±std): 0.3323±0.1577
FSRS-6 RMSE(bins) (mean±std): 0.0546±0.0367
FSRS-6 AUC (mean±std): 0.7027±0.0679
Weighted average by log(reviews):
FSRS-6 LogLoss (mean±std): 0.3525±0.1685
FSRS-6 RMSE(bins) (mean±std): 0.0651±0.0390
FSRS-6 AUC (mean±std): 0.7017±0.0836
Weighted average by users:
FSRS-6 LogLoss (mean±std): 0.3542±0.1699
FSRS-6 RMSE(bins) (mean±std): 0.0665±0.0393
FSRS-6 AUC (mean±std): 0.7007±0.0867
parameters: [0.1981, 1.1208, 2.899, 12.7241, 6.4435, 0.6813, 3.1037, 0.0217, 1.8285, 0.1844, 0.7759, 1.491, 0.0579, 0.319, 1.6961, 0.4095, 1.9617, 0.7046, 0.1173, 0.1175, 0.18]
I benchmarked not updating D if the interval is <24 hours. It made results worse.
I’ll continue experimenting with D (and with FSRS-7 overall) to see if I can improve it somehow.
First, a test that would confirm my hypothesis if its result is positive: Please test with excluding all “Agains” that occurred before Hard, Good, or Easy was pressed for the first time.
Now the explanation: I’ve found at least one cause of the problem discussed in this thread: I, but certainly other users as well, have been giving FSRS bad information. I see the main cause in Anki’s user interface. This affects Again-gradings of a card before the user has pressed one of the three pass buttons for the first time. These Agains are bad data because they can mean very different things. For me, for non-trivial card answers, it has often meant: I’ve memorized a bit more (I packed my bag - Wikipedia) and want to be tested on this bit more as soon as my working memory (The Magical Number Seven, Plus or Minus Two - Wikipedia) was overwritten by working on a different card. Pressing the “Again” button for this wish seems logical. But I haven’t really memorized the rest of the card yet!
The requirement for FSRS must be that the user can completely reproduce the card’s answer, at least directly after looking at it, because stability is only defined for this case! If the user couldn’t reproduce the answer at the beginning of an interval, a grading after this interval is misleading data.
It doesn’t work well.
$ python evaluate.py --fast
Model: FSRS-6-dev
Total number of users: 2520
Total number of reviews: 85813985
Weighted average by reviews:
FSRS-6-dev LogLoss (mean±std): 0.3373±0.1538
FSRS-6-dev RMSE(bins) (mean±std): 0.0494±0.0303
FSRS-6-dev AUC (mean±std): 0.6909±0.0799
Weighted average by log(reviews):
FSRS-6-dev LogLoss (mean±std): 0.3554±0.1671
FSRS-6-dev RMSE(bins) (mean±std): 0.0662±0.0411
FSRS-6-dev AUC (mean±std): 0.6838±0.0852
Weighted average by users:
FSRS-6-dev LogLoss (mean±std): 0.3576±0.1694
FSRS-6-dev RMSE(bins) (mean±std): 0.0687±0.0424
FSRS-6-dev AUC (mean±std): 0.6824±0.0875
parameters: [0.15935, 0.86735, 2.05955, 12.807, 6.43815, 0.6355, 3.1391, 0.0275, 1.7619, 0.19505, 0.74155, 1.4732, 0.0553, 0.31065, 1.68385, 0.37005, 1.9603, 0.71675, 0.0472, 0.1285, 0.16915]
Model: FSRS-6
Total number of users: 2520
Total number of reviews: 85813985
Weighted average by reviews:
FSRS-6 LogLoss (mean±std): 0.3310±0.1506
FSRS-6 RMSE(bins) (mean±std): 0.0471±0.0291
FSRS-6 AUC (mean±std): 0.7069±0.0817
Weighted average by log(reviews):
FSRS-6 LogLoss (mean±std): 0.3460±0.1614
FSRS-6 RMSE(bins) (mean±std): 0.0624±0.0393
FSRS-6 AUC (mean±std): 0.7044±0.0876
Weighted average by users:
FSRS-6 LogLoss (mean±std): 0.3479±0.1633
FSRS-6 RMSE(bins) (mean±std): 0.0647±0.0406
FSRS-6 AUC (mean±std): 0.7033±0.0900
parameters: [0.21315, 1.09365, 3.0333, 12.9575, 6.43915, 0.67735, 3.0969, 0.0216, 1.80755, 0.17895, 0.7807, 1.4956, 0.0569, 0.3254, 1.70705, 0.3918, 1.95725, 0.70325, 0.126, 0.12735, 0.1826]
Here is the change:
# Create tensor features with shape (sequence_length, 2)
# Each row contains [time_interval, rating] for that step
# Exclude "Again" ratings that occurred before the first Hard/Good/Easy rating
df["tensor"] = [
self._create_filtered_tensor(t_item[:-1], r_item[:-1])
for t_sublist, r_sublist in zip(t_history_list, r_history_list)
for t_item, r_item in zip(t_sublist, r_sublist)
]
return df
def _create_filtered_tensor(self, t_history: list, r_history: list) -> torch.Tensor:
"""
Create tensor while excluding "Again" ratings (rating=1) that occurred
before the first Hard/Good/Easy rating (rating=2,3,4)
"""
if not t_history or not r_history:
return torch.tensor(([], []), dtype=torch.float32).transpose(0, 1)
# Find the first successful rating (2, 3, or 4)
first_short_term_success_idx = 0
for i, (time_interval, rating) in enumerate(zip(t_history, r_history)):
if rating in [2, 3, 4] or time_interval > 0:
first_short_term_success_idx = i
break
# Filter out "Again" ratings before the first short-term success
filtered_t = []
filtered_r = []
for i, (time_interval, rating) in enumerate(zip(t_history, r_history)):
# Keep all non-"Again" ratings, and "Again" ratings only after first success
if rating != 1 or i >= first_short_term_success_idx:
filtered_t.append(time_interval)
filtered_r.append(rating)
# Create tensor from filtered data
if filtered_t and filtered_r:
return torch.tensor(
(filtered_t, filtered_r), dtype=torch.float32
).transpose(0, 1)
else:
return torch.tensor(([], []), dtype=torch.float32).transpose(0, 1)
Thank you for testing this, Jarrett!
My proposal was too radical. Ignoring the initial Agains completly for all users is overkill. Cards with a lot of Agains at first are more difficult, but for some users not as difficult as FSRS6 predicts, especially for users who use I pack my back learning. I propose inserting a parameter which determines how much influence the initial Agains have.
There kind of is already. The formula for calculating initial D (after the very first review) is different from the formula used for all consecutive D updates.
OK, but the problem are many Agains, in extreme cases like 10 or even more for a long list, which drive the difficulty much too high.
But actually modifying FSRS only mitigates this problem. Anki could provide an option to not even log any Again presses which occur bevor the first Pass-Grading.
But Jarrett already tested it, it makes FSRS perform worse.
Well, I guess I didn’t describe well enough what I meant. And maybe the feature is too niche to ever make it into Anki.
The alternative to my idea is to teach the users having the problem in this thread that they shouldn’t press “again” if they never knew the full answer. How does stability make any sense if you didn’t know the answer at the start of the interval?
I will try again to describe it: If you remove *all* initial Agains it performs worse, yes. That is because most cards of most users are probably not so difficult, but instead vocabulary or other short answers. On these, if you can’t remember it right away, that means that the card is more difficult.
But if you play “I pack my bag” with long, difficult cards, the first “Agains” are not real gradings because you never knew the full answer.
But I give up, because I have the feeling that this does not matter for most cards of most users. I will stick with the workarounds.
Thanks for listening, testing, your feedback, etc!
It is not a trivial change. We should figure out how to add the new parameter into FSRS.
And it’s helpful to make progress for your case if we have a specific dataset to evaluate this kind of changes.
I think real data of this will be hard to obtain, because users who learn like this will probably be annoyed by the short intervals and use one of the workarounds or turn off FSRS.
But now that I think about it, wouldn’t data with the reset workaround work?
I think the forgetting curve for lists is significantly different from that of short answers. I don’t have a solution, but it doesn’t surprise me at all anymore that FSRS is only of limited use for lists, because the test data probably consists mostly of short answers, e.g., vocabulary.
I’ll write down my thoughts, but there is empirical research on this in psychology: Google Scholar
If a card’s answer contains a list, you can think of each item as its own card. In particular, each item has its own retrievability probability of less than one. The retrievability of the entire card is the product of all of these and is therefore much lower, e.g., with a .95 retrievability for the individual items and a ten-item list: .95^10 = .35.
With a lower retrievability for the individual items, the retrievability of the list decreases radically:
.7^10 = .03
Below 0.7, the total retrievability is then almost zero.
The following research seems highly relevant to me on first look:
https://pubs.aip.org/aip/jmp/article/63/7/073303/2843660
“Cognitive psychologists developed experimental paradigms involving randomly collected lists of items that make possible quantitative measures of performance in memory tasks, such as recall and recognition. Here, we describe a set of mathematical models designed to predict the results of these experiments. The models are based on simple underlying assumptions and surprisingly agree with experimental results quite well, in addition to that they exhibit quite interesting mathematical behavior that can partially be understood analytically.”
J. Math. Phys. 63, 073303 (2022)
https://doi.org/10.1063/5.0088823
I found this through this search:Mathematical+models+of+memory+decay on Google scholar
And this exponentiation of the retrievability predicts what this threat is about: first it takes many Agains to drive it up, but at a certain point it steeply rises, recall is very easy and intervals at least seem too short.
I think the retrievability of lists is much more difficult to predict because the curve is so much steeper.
Example for a 10 item list:
@Helge, in his last comment, Jarrett expressed the need for some data to test changes. I believe that you are an affected user. So, it would help if you could share your data for testing purposes.
If privacy is a concern, you can go to Tools -> FSRS Helper -> Export Dataset for Research and share the generated file here as a Google Drive link. The file doesn’t contain any personal information or the text of your cards.
