I think it would be better to make the 2 button an option of FSRS.
In SM2, 2 button is maybe not recommended because it causes EaseHell, so I think it is only after FSRS becomes the default algorithm that 2 button can be made the default.
Shigeyuki you are misremembering. In SM2, Ease Hell was happening because Again+Hard decreased Ease whereas Good didnât increase Ease. It was sometimes solved by pressing Easy button more, but more commonly people adopted 2 button use while setting the Ease to 130 percent, which is the minimum limit. Oh, this means with the default Ease value, people will encounter Ease hell if they donât press Easy button. Maybe wait until FSRS is default, then.
My intention with the four buttons was always that users would predominately use Good and Again, with Hard and Easy being reserved for the times when the user felt the standard scheduling was either too aggressive or too conservative. A breakdown of 85% good, 10% again, and 5% hard/easy is in line with my assumptions, and less than 5% for some users would not surprise me.
If we take away hard and easy, that 5% (or whatever it happens to be) will end up being scheduled more conservatively or aggressively than they would otherwise. What Iâm left wondering is whether the losses there are outweighed by the improvements in RMSE and reduced time spent answering, or not.
I think that even if it doesnât improve the metrics, it wonât hurt to have a 2-button mode like the Fail/Pass add-on as a built-in feature. It can be in Tools â Preferences â Review.
Those who want to use that mode will be able to use it without an add-on, and on other devices (other than desktop) as well. Those who donât want to use it wonât use it. Net improvement in user satisfaction.
Iâm not sure thatâs the best way to solve this. One other option, for example, would be to have [again] [good] [âŠ], with hard/easy accessed via the last button. It better conveys that theyâre expected to be used for outliers than what we have now, while not removing the ability to use them in the cases where they are required. That could potentially become a default in the future, with an option to show them expanded instead, for users who use them more frequently.
But in any case, I donât want to be making changes to the reviewing screen until weâve got more of its code shared between the clients, so we donât need to reimplement this 4 different times.
have [again] [good] [âŠ], with hard/easy accessed via the last button. It better conveys that theyâre expected to be used for outliers than what we have now, while not removing the ability to use them in the cases where they are required.
That sounds more clunky, IMO
I canât help but feel like the main issue here might be the contrast between subjective and objective information. Subjective evaluations are by nature inconsistent and unreliable, especially in a computing context. And whether something was difficult or easy to remember is highly subjective, influenced by things like mood, motivation, momentary distractions, and so on. Whether you got something right or wrong, however, can be determined fairly objectively, as long as youâre not lying to yourself.
I find it interesting to note that the mainline SuperMemo site and apps (the ones that are based around pre-made courses) use a three-grade system: pass, fail, and âalmostâ. In other words, they use two failing grades, and just a single passing grade. And I suspect their system is likely more fit-for-purpose compared to the ones seen in Anki or the âIncremental Readingâ version of SuperMemo, as whether or not an answer was incorrect or partially correct is something that can be determined objectively.
The testing effect is all about what you remember when youâre being tested. The stuff you remember will be reinforced. The stuff you fail to remember wonât. In this context, a partially correct answer is unequivocally better than not remembering the answer at all, even if theyâre both incorrect and a singular fail/âagainâ by Ankiâs grading system.
But itâs not clear that easily remembering the correct answer is always unequivocally better than (or even different from) remembering the correct answer with some or a lot of effort, at least from the perspective of the testing effect. Subjectively, they feel different, and intuitively we feel that this difference should matter. But the data shown in this thread seems to suggest that our intuitions are wrong.
In trying to choose one of the better intervals, people end up making the interval themselves worse. Irony!
that 5% will end up being scheduled more conservatively or aggressively than they would otherwise.
In this hypothetical scenario, you are changing the grading behaviour without changing the intervals outputted. The important line to add is âall other things being equalâ. I see reduced deviation of the interval from the ideal value when 2 buttons are used. Everything else is not equal.
I would add the onboarding experience of a new user, and if I were forced to limit my arguments to one, this would be it.
I would also change what the losses are. It might be ânot being able to instantly graduate a learning cardâ. For the average user, the loss certainly isnât worse intervals as you believe.
Dare I sayâŠbased?
I kinda came to agree with sorata. Even if using four buttons and two buttons results in the same RMSE in an ideal scenario where everyone is consistent, the survey clearly shows that people arenât very consistent.
~20 percent gap must be noticeable in terms of usage in four buttons. I do agree with this type of inconsistency
According to my latest ablation experiment, treating 4 grades as 2 grades will increase ~8% errors relatively.
commit: add FSRS-5-binary · open-spaced-repetition/srs-benchmark@18e4a1e · GitHub
Yes, but youâre doing this also on 4-button usersâ data?
Itâs done on all users, otherwise I cannot put it in the benchmark table to compare them.
Yes, but that also makes it lose some meaning because people use 4 buttons in various different ways. For example, hard
as again
.
Youâre right. But I mean, on average, the effect of 4-button mode is positive (without considering the time spent on grading).
Iâm not sure I understand you correctly. Are you saying net-net 4 buttons are positive? Read here: Pass/Fail Grading as Default - #132 by Expertium
We think a user is better at self-grading when there are only two options regardless of time-spent grading and such. So a better thing to do would be to compare 2-button and 4-button user group which Expertium did.
Letâs consider the simplest case - the four initial stabilities:
If the user only can use two-button, w[1] and w[3] will be merged into w[2].
A card whose first rating should be easy
will be rated good
. So itâs estimated initial stability will be 9.12 instead of 25.44 (on average). Then it will be scheduled earlier. And its retrievability will be higher than the desired retention.
A card whose first rating should be hard
will be rated good
. So itâs estimated initial stability will be 9.12 instead of 3.47 (on average). Then it will be scheduled later. And its retrievability will be lower than the desired retention.
If the user picks the optimal retention, the scheduler of two-button mode will depart from the optimal scheduling.
Yes, itâs not surprising since weâre basically corrupting data. I did something similar a long time ago.
This isnât the same as comparing people who deliberately use 4 buttons with people who deliberately use 2.
I could see the usefulness of a granular first grade being easier to verify than granular evaluations of later grades. It seems easy enough to calculate a forgetting curve that ensures the Desired Retention for each initial grade at the first repetition, since the result can be objectively quantified. (And the different grades do seem to make a very meaningful difference, seeing how different the first four optimisation parameters tend to be).
The way I see it, later grades become more and more ânebulousâ, exponentially so. Iâm imagining that for n successful (non-fail) repetitions, youâd require something like 3^n times more data (for the three passing grades) to provide an accurate model of a personâs memory (and grading habits). Because the only thing that can actually refute or verify an assumption with regards to the estimated forgetting curve following a certain sequence of hard/good/easy grades is a sufficient number of pass/fail grades.
The complexity just seems too much considering the limited dataset of each given student. But for the first grade, it should be easy, no?
(I recall earlier SuperMemo algorithms, predating SM-17, using the initial grade to classify an itemâs difficulty, and then only making minor adjustments after that. I donât know how useful WoĆșniakâs early assumptions are in FSRSâs case. But intuitively, it makes sense to me, as the first grade seems easy enough to verify, as indicated above by L.M.Sherlock.)