Pass/Fail Grading as Default

I think it would be better to make the 2 button an option of FSRS.
In SM2, 2 button is maybe not recommended because it causes EaseHell, so I think it is only after FSRS becomes the default algorithm that 2 button can be made the default.

3 Likes

Shigeyuki you are misremembering. In SM2, Ease Hell was happening because Again+Hard decreased Ease whereas Good didn’t increase Ease. It was sometimes solved by pressing Easy button more, but more commonly people adopted 2 button use while setting the Ease to 130 percent, which is the minimum limit. Oh, this means with the default Ease value, people will encounter Ease hell if they don’t press Easy button. Maybe wait until FSRS is default, then.

2 Likes

My intention with the four buttons was always that users would predominately use Good and Again, with Hard and Easy being reserved for the times when the user felt the standard scheduling was either too aggressive or too conservative. A breakdown of 85% good, 10% again, and 5% hard/easy is in line with my assumptions, and less than 5% for some users would not surprise me.

If we take away hard and easy, that 5% (or whatever it happens to be) will end up being scheduled more conservatively or aggressively than they would otherwise. What I’m left wondering is whether the losses there are outweighed by the improvements in RMSE and reduced time spent answering, or not.

4 Likes

I think that even if it doesn’t improve the metrics, it won’t hurt to have a 2-button mode like the Fail/Pass add-on as a built-in feature. It can be in Tools → Preferences → Review.
Those who want to use that mode will be able to use it without an add-on, and on other devices (other than desktop) as well. Those who don’t want to use it won’t use it. Net improvement in user satisfaction.

2 Likes

I’m not sure that’s the best way to solve this. One other option, for example, would be to have [again] [good] [
], with hard/easy accessed via the last button. It better conveys that they’re expected to be used for outliers than what we have now, while not removing the ability to use them in the cases where they are required. That could potentially become a default in the future, with an option to show them expanded instead, for users who use them more frequently.

But in any case, I don’t want to be making changes to the reviewing screen until we’ve got more of its code shared between the clients, so we don’t need to reimplement this 4 different times.

2 Likes

have [again] [good] [
], with hard/easy accessed via the last button. It better conveys that they’re expected to be used for outliers than what we have now, while not removing the ability to use them in the cases where they are required.

That sounds more clunky, IMO

3 Likes

I can’t help but feel like the main issue here might be the contrast between subjective and objective information. Subjective evaluations are by nature inconsistent and unreliable, especially in a computing context. And whether something was difficult or easy to remember is highly subjective, influenced by things like mood, motivation, momentary distractions, and so on. Whether you got something right or wrong, however, can be determined fairly objectively, as long as you’re not lying to yourself.

I find it interesting to note that the mainline SuperMemo site and apps (the ones that are based around pre-made courses) use a three-grade system: pass, fail, and “almost”. In other words, they use two failing grades, and just a single passing grade. And I suspect their system is likely more fit-for-purpose compared to the ones seen in Anki or the “Incremental Reading” version of SuperMemo, as whether or not an answer was incorrect or partially correct is something that can be determined objectively.

The testing effect is all about what you remember when you’re being tested. The stuff you remember will be reinforced. The stuff you fail to remember won’t. In this context, a partially correct answer is unequivocally better than not remembering the answer at all, even if they’re both incorrect and a singular fail/“again” by Anki’s grading system.

But it’s not clear that easily remembering the correct answer is always unequivocally better than (or even different from) remembering the correct answer with some or a lot of effort, at least from the perspective of the testing effect. Subjectively, they feel different, and intuitively we feel that this difference should matter. But the data shown in this thread seems to suggest that our intuitions are wrong.

4 Likes

In trying to choose one of the better intervals, people end up making the interval themselves worse. Irony!

that 5% will end up being scheduled more conservatively or aggressively than they would otherwise.

In this hypothetical scenario, you are changing the grading behaviour without changing the intervals outputted. The important line to add is “all other things being equal”. I see reduced deviation of the interval from the ideal value when 2 buttons are used. Everything else is not equal.

I would add the onboarding experience of a new user, and if I were forced to limit my arguments to one, this would be it.

I would also change what the losses are. It might be ‘not being able to instantly graduate a learning card’. For the average user, the loss certainly isn’t worse intervals as you believe.

1 Like

Dare I say
based?

3 Likes

I kinda came to agree with sorata. Even if using four buttons and two buttons results in the same RMSE in an ideal scenario where everyone is consistent, the survey clearly shows that people aren’t very consistent.

1 Like

~20 percent gap must be noticeable in terms of usage in four buttons. I do agree with this type of inconsistency

According to my latest ablation experiment, treating 4 grades as 2 grades will increase ~8% errors relatively.

commit: add FSRS-5-binary · open-spaced-repetition/srs-benchmark@18e4a1e · GitHub

1 Like

Yes, but you’re doing this also on 4-button users’ data?

It’s done on all users, otherwise I cannot put it in the benchmark table to compare them.

Yes, but that also makes it lose some meaning because people use 4 buttons in various different ways. For example, hard as again.

You’re right. But I mean, on average, the effect of 4-button mode is positive (without considering the time spent on grading).

1 Like

I’m not sure I understand you correctly. Are you saying net-net 4 buttons are positive? Read here: Pass/Fail Grading as Default - #132 by Expertium

We think a user is better at self-grading when there are only two options regardless of time-spent grading and such. So a better thing to do would be to compare 2-button and 4-button user group which Expertium did.

Let’s consider the simplest case - the four initial stabilities:

w0
w1
w2
w3

If the user only can use two-button, w[1] and w[3] will be merged into w[2].

A card whose first rating should be easy will be rated good. So it’s estimated initial stability will be 9.12 instead of 25.44 (on average). Then it will be scheduled earlier. And its retrievability will be higher than the desired retention.

A card whose first rating should be hard will be rated good. So it’s estimated initial stability will be 9.12 instead of 3.47 (on average). Then it will be scheduled later. And its retrievability will be lower than the desired retention.

If the user picks the optimal retention, the scheduler of two-button mode will depart from the optimal scheduling.

Yes, it’s not surprising since we’re basically corrupting data. I did something similar a long time ago.

This isn’t the same as comparing people who deliberately use 4 buttons with people who deliberately use 2.

2 Likes

I could see the usefulness of a granular first grade being easier to verify than granular evaluations of later grades. It seems easy enough to calculate a forgetting curve that ensures the Desired Retention for each initial grade at the first repetition, since the result can be objectively quantified. (And the different grades do seem to make a very meaningful difference, seeing how different the first four optimisation parameters tend to be).

The way I see it, later grades become more and more “nebulous”, exponentially so. I’m imagining that for n successful (non-fail) repetitions, you’d require something like 3^n times more data (for the three passing grades) to provide an accurate model of a person’s memory (and grading habits). Because the only thing that can actually refute or verify an assumption with regards to the estimated forgetting curve following a certain sequence of hard/good/easy grades is a sufficient number of pass/fail grades.

The complexity just seems too much considering the limited dataset of each given student. But for the first grade, it should be easy, no?

(I recall earlier SuperMemo algorithms, predating SM-17, using the initial grade to classify an item’s difficulty, and then only making minor adjustments after that. I don’t know how useful WoĆșniak’s early assumptions are in FSRS’s case. But intuitively, it makes sense to me, as the first grade seems easy enough to verify, as indicated above by L.M.Sherlock.)