Due Column - Changing Days (from Whole Numbers to Decimals in Scheduling)

Also, if I understand it correctly, not all filtered reviews will be same-day reviews, and we will actually be filtering out too many reviews.
Example: the user reviewed a card at 10 PM, then went to sleep, then reviewed it at 12 AM. His “Next day starts at” is anywhere between the two. Anki will count it as “not the same day” because of the rollover. But the absolute time difference is 12 hours. So if we used Anki’s integer intervals it wouldn’t be filtered out (delta_t=1), but if we use the absolute difference in time, it will be filtered out (delta_t<1).

You’re using the same data set for both tests, yeah? Can’t you just note the indices used in the FSRS-5 test and filter by those for the secs test?

@L.M.Sherlock rich70521 proposed a good idea. More for you to code :joy:

I understand that. What I’m saying is, the way people have been using Anki for years was using fixed relearning intervals, so any data you have will have that.

FSRS in general would probably look like it performs a lot worse if you were using a data set with fixed > 24 hour intervals. Like, if everyone looked at every one of their cards once per week. SM-2 at least provided some of that variable data for FSRS to work with. FSRS-secs is getting none with intraday intervals.

Yes, some people do use different learning/relearning steps, so it’s at least getting a little, but I bet the vast majority of people just use the 1m 10m learning and 10m relearning steps. And it probably has almost zero data on > 24 hour steps after hitting Again.

The training data just isn’t ideal to evaluate intraday steps at all. That doesn’t mean it isn’t going to perform well though. My guess is if you had people using it for a while and used their data using the model, it would look much better.

You can’t even compare them if that’s what’s going on. It’s not just unfair, it’s apples to oranges.

We need some kind of comparison, otherwise we’ll never decide whether the hypothetical FSRS-X-secs model is worth implementing.

Regarding fixed intervals - I think you vastly overestimate how many people just use 1m 10m. And besides, nobody sits down with a timer and reviews a card after exactly 10m. There will be plenty of variance.

The problem is we should consider the train_index and test_index in the same time. Because we include the same-day reviews in training, the TimeSeriesSplit will give different train_index from when we filter out the same-day reviews.

1 Like

Yeah, that’s a hard problem to solve, but you can’t just use an invalid comparison like this because it’s all you have. FSRS-secs might be performing better even if its numbers are worse. No way to know.

Also, and this is completely subjective, but I’ve been using FSRS-5 without relearning steps, letting the algorithm schedule the Again intervals, and I’ve noticed a huge difference in the way I treat a card I’m getting wrong. I think there was something about knowing I’d be seeing the card again in 10 minutes before that caused me to just mark it wrong and move on quickly. Now, because I know I might not see the card for a few days, instead of just deciding I was wrong and hitting Again, I take a beat and re-commit it to memory.

I’m getting a lot of these > 24 hour intervals correct the next time and it’s reduced my daily reviews by a lot. I don’t need to see every card I get wrong again the same day.

Regarding variance in data, here are some summary statistics of elapsed_seconds based on the first 100 users, only including values <86400 seconds aka 24 hours.

Mean: 11824 seconds
Median: 627 seconds
5th percentile: 29 seconds
95th percentile: 78229 seconds

Seems like plenty of variance to me. Also, the mean being almost 20 times greater than the median, lol. That’s a distribution with a VERY fat tail.

2 Likes

The best we (including you) could think of is not including same-day reviews in evaluation. I can’t think of anything better.

I’m guessing the mean is so high because the default used to be that every wrong card got an interval of 1 day. A lot of those would have been studied < 24 hours later. Is that right, or did you also filter for same day?

I just filtered anything where elapsed_seconds is >=86400

How about this: keep both delta_t in days (as determined by Anki) and delta_t in seconds, but only use delta_t in days to determine which reviews must be filtered?

What’s the minimum amount of data needed to feel comfortable evaluating these models?

You mean reviews per user? FSRS can be optimized with as little as 10 reviews, if we’re only using pretrain (changing the first 4 parameters). If we’re optimizing all parameters - idk, 50 or 100. Of course, the more the better. What about it?

I’m wondering if it would be feasible to have a group of people use different models over a period of time and evaluate their data, or if you need way more than a small group of people over a short period of time to even start getting good evaluations.

Why do that when we have tons of historical data from 10k users?

Just trying to prove you wrong here. The best we could actually do is use data that’s based on the model. Using data based on SM-2 and fixed relearning steps is sufficient, but data actually based on the model would be ideal.

Often times you don’t need a crazy amount to get a good estimate, which is why they only need to call a few thousand people when getting polling data for the whole population.

Not saying it would be perfect. That method would have its own issues, but we might see RMSE drop quite a bit. Also might be just way more work than its worth.

I will think about it tomorrow.

Human memory doesn’t change if you flip a switch in Anki. If some pattern exists in the data collected SM-2 and fixed learning steps (which, again, have plenty of variance as I showed above), it will exist in data collected with FSRS as well. So I don’t see any reason to not rely on historical data. Unless you think that memory itself (not intervals) behaves differently with FSRS, but that would be absurd.

You would get a much better idea of what the real RMSE is.

Here’s a situation I doubt is very rare in the data. Someone gets a card wrong, hits Again, then they see that card again in 10 minutes. They know it this time, but kinda feel like they want to see it again just to make sure they have it reinforced. There’s no harm at all in hitting Again again before with SM-2, because you’re just going to get a 1 day interval regardless when you finally decide to hit Good on that card.

Because of the nature of the forgetting curve, cards all basically have a retrievability of close to 100% after 10 minutes. So every time someone decided to mark one of those cards wrong, that’s adding the max amount of error to the RMSE. Getting something wrong near 100% retrievability, and getting something right near 0% retrievability both add the most error. They’re also affecting parameters the most.

It’s not like people were trying to provide clean data for these testing purposes. I’ll bet there are a ton of short relearning intervals that got marked wrong that really didn’t need to be. I know I used to do that a lot if I wanted to see a card again.

2 Likes