Pass/Fail Grading as Default

Shigeyuki · August 1, 2024, 9:23pm

I think it would be better to make the 2 button an option of FSRS.
In SM2, 2 button is maybe not recommended because it causes EaseHell, so I think it is only after FSRS becomes the default algorithm that 2 button can be made the default.

sorata · August 2, 2024, 7:38am

Shigeyuki you are misremembering. In SM2, Ease Hell was happening because Again+Hard decreased Ease whereas Good didn’t increase Ease. It was sometimes solved by pressing Easy button more, but more commonly people adopted 2 button use while setting the Ease to 130 percent, which is the minimum limit. Oh, this means with the default Ease value, people will encounter Ease hell if they don’t press Easy button. Maybe wait until FSRS is default, then.

dae · August 2, 2024, 10:52am

My intention with the four buttons was always that users would predominately use Good and Again, with Hard and Easy being reserved for the times when the user felt the standard scheduling was either too aggressive or too conservative. A breakdown of 85% good, 10% again, and 5% hard/easy is in line with my assumptions, and less than 5% for some users would not surprise me.

If we take away hard and easy, that 5% (or whatever it happens to be) will end up being scheduled more conservatively or aggressively than they would otherwise. What I’m left wondering is whether the losses there are outweighed by the improvements in RMSE and reduced time spent answering, or not.

Expertium · August 2, 2024, 10:54am

I think that even if it doesn’t improve the metrics, it won’t hurt to have a 2-button mode like the Fail/Pass add-on as a built-in feature. It can be in Tools → Preferences → Review.
Those who want to use that mode will be able to use it without an add-on, and on other devices (other than desktop) as well. Those who don’t want to use it won’t use it. Net improvement in user satisfaction.

dae · August 2, 2024, 11:12am

I’m not sure that’s the best way to solve this. One other option, for example, would be to have [again] [good] […], with hard/easy accessed via the last button. It better conveys that they’re expected to be used for outliers than what we have now, while not removing the ability to use them in the cases where they are required. That could potentially become a default in the future, with an option to show them expanded instead, for users who use them more frequently.

But in any case, I don’t want to be making changes to the reviewing screen until we’ve got more of its code shared between the clients, so we don’t need to reimplement this 4 different times.

Expertium · August 2, 2024, 12:03pm

have [again] [good] […], with hard/easy accessed via the last button. It better conveys that they’re expected to be used for outliers than what we have now, while not removing the ability to use them in the cases where they are required.

That sounds more clunky, IMO

Smill · August 3, 2024, 12:32pm

I can’t help but feel like the main issue here might be the contrast between subjective and objective information. Subjective evaluations are by nature inconsistent and unreliable, especially in a computing context. And whether something was difficult or easy to remember is highly subjective, influenced by things like mood, motivation, momentary distractions, and so on. Whether you got something right or wrong, however, can be determined fairly objectively, as long as you’re not lying to yourself.

I find it interesting to note that the mainline SuperMemo site and apps (the ones that are based around pre-made courses) use a three-grade system: pass, fail, and “almost”. In other words, they use two failing grades, and just a single passing grade. And I suspect their system is likely more fit-for-purpose compared to the ones seen in Anki or the “Incremental Reading” version of SuperMemo, as whether or not an answer was incorrect or partially correct is something that can be determined objectively.

The testing effect is all about what you remember when you’re being tested. The stuff you remember will be reinforced. The stuff you fail to remember won’t. In this context, a partially correct answer is unequivocally better than not remembering the answer at all, even if they’re both incorrect and a singular fail/“again” by Anki’s grading system.

But it’s not clear that easily remembering the correct answer is always unequivocally better than (or even different from) remembering the correct answer with some or a lot of effort, at least from the perspective of the testing effect. Subjectively, they feel different, and intuitively we feel that this difference should matter. But the data shown in this thread seems to suggest that our intuitions are wrong.

sorata · August 3, 2024, 1:44pm

In trying to choose one of the better intervals, people end up making the interval themselves worse. Irony!

that 5% will end up being scheduled more conservatively or aggressively than they would otherwise.

In this hypothetical scenario, you are changing the grading behaviour without changing the intervals outputted. The important line to add is “all other things being equal”. I see reduced deviation of the interval from the ideal value when 2 buttons are used. Everything else is not equal.

I would add the onboarding experience of a new user, and if I were forced to limit my arguments to one, this would be it.

I would also change what the losses are. It might be ‘not being able to instantly graduate a learning card’. For the average user, the loss certainly isn’t worse intervals as you believe.

Expertium · August 4, 2024, 10:24pm

github.com/ankidroid/Anki-Android

feat: Hide 'Hard' and 'Easy' buttons setting in the new reviewer

ankidroid:main ← BrayanDSO:hard_easy

opened 09:33PM - 04 Aug 24 UTC

BrayanDSO

+13 -0

It is a frequently requested feature, helps a big number of users that only use …two buttons, and it is quite simple to implement. Apparently, Damien don't want to implement it in Anki for some kind of reason, but it takes a lot of less time to implement it than to read the more than 100 comments in the corresponding page at the forums to still reach to the same conclusion. About the PR, if any string related changes are required, please push them or do them in another PR, because I don't have time to bikeshedding and won't do them. ## How Has This Been Tested? Emulator 35: <details><summary>Screenshot</summary> ![Screenshot_20240804_182727](https://github.com/user-attachments/assets/8d2532db-b232-4b0c-8a48-68ca960b4cd7) </details> ## Checklist _Please, go through these checks before submitting the PR._ - [X] You have a descriptive commit message with a short title (first line, max 50 chars). - [X] You have commented your code, particularly in hard-to-understand areas - [X] You have performed a self-review of your own code - [X] UI changes: include screenshots of all affected screens (in particular showing any new or changed strings) - [ ] UI Changes: You have tested your change using the [Google Accessibility Scanner](https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor)

Dare I say…based?

Expertium · September 27, 2024, 3:29pm

I kinda came to agree with sorata. Even if using four buttons and two buttons results in the same RMSE in an ideal scenario where everyone is consistent, the survey clearly shows that people aren’t very consistent.

Deranged6605 · September 27, 2024, 9:55pm

~20 percent gap must be noticeable in terms of usage in four buttons. I do agree with this type of inconsistency

L.M.Sherlock · September 30, 2024, 6:39am

According to my latest ablation experiment, treating 4 grades as 2 grades will increase ~8% errors relatively.

commit: add FSRS-5-binary · open-spaced-repetition/srs-benchmark@18e4a1e · GitHub

sorata · September 30, 2024, 6:43am

Yes, but you’re doing this also on 4-button users’ data?

L.M.Sherlock · September 30, 2024, 6:44am

It’s done on all users, otherwise I cannot put it in the benchmark table to compare them.

sorata · September 30, 2024, 6:46am

Yes, but that also makes it lose some meaning because people use 4 buttons in various different ways. For example, hard as again.

L.M.Sherlock · September 30, 2024, 6:49am

You’re right. But I mean, on average, the effect of 4-button mode is positive (without considering the time spent on grading).

sorata · September 30, 2024, 7:18am

I’m not sure I understand you correctly. Are you saying net-net 4 buttons are positive? Read here: Pass/Fail Grading as Default - #132 by Expertium

We think a user is better at self-grading when there are only two options regardless of time-spent grading and such. So a better thing to do would be to compare 2-button and 4-button user group which Expertium did.

L.M.Sherlock · September 30, 2024, 7:58am

Let’s consider the simplest case - the four initial stabilities:

If the user only can use two-button, w[1] and w[3] will be merged into w[2].

A card whose first rating should be easy will be rated good. So it’s estimated initial stability will be 9.12 instead of 25.44 (on average). Then it will be scheduled earlier. And its retrievability will be higher than the desired retention.

A card whose first rating should be hard will be rated good. So it’s estimated initial stability will be 9.12 instead of 3.47 (on average). Then it will be scheduled later. And its retrievability will be lower than the desired retention.

If the user picks the optimal retention, the scheduler of two-button mode will depart from the optimal scheduling.

Expertium · September 30, 2024, 8:25am

Yes, it’s not surprising since we’re basically corrupting data. I did something similar a long time ago.

This isn’t the same as comparing people who deliberately use 4 buttons with people who deliberately use 2.

Smill · October 1, 2024, 3:41pm

I could see the usefulness of a granular first grade being easier to verify than granular evaluations of later grades. It seems easy enough to calculate a forgetting curve that ensures the Desired Retention for each initial grade at the first repetition, since the result can be objectively quantified. (And the different grades do seem to make a very meaningful difference, seeing how different the first four optimisation parameters tend to be).

The way I see it, later grades become more and more “nebulous”, exponentially so. I’m imagining that for n successful (non-fail) repetitions, you’d require something like 3^n times more data (for the three passing grades) to provide an accurate model of a person’s memory (and grading habits). Because the only thing that can actually refute or verify an assumption with regards to the estimated forgetting curve following a certain sequence of hard/good/easy grades is a sufficient number of pass/fail grades.

The complexity just seems too much considering the limited dataset of each given student. But for the first grade, it should be easy, no?

(I recall earlier SuperMemo algorithms, predating SM-17, using the initial grade to classify an item’s difficulty, and then only making minor adjustments after that. I don’t know how useful Woźniak’s early assumptions are in FSRS’s case. But intuitively, it makes sense to me, as the first grade seems easy enough to verify, as indicated above by L.M.Sherlock.)

Topic		Replies	Views
Option to have hard be a passing grade for the last, or single, (re)learning step Suggestions	13	118	August 21, 2024
'Pass/Fail' ONLY, 'Ease' reset to 250 DAILY — FSRS not for me, right? Help	7	994	February 14, 2024
[Major Feature Request] A new card type Suggestions	7	307	May 12, 2024
Pass/Fail Review: Automatic difficulty rating based on answer time (beta ver.) [Official Thread] Add-ons	0	56	January 2, 2025
FSRS Med School Anki (in house + AK step) FSRS	17	974	September 6, 2024

Pass/Fail Grading as Default

Related topics