Discount factor in training data

I wonder if it would be a good idea to add a temporal discount factor to the training (i.e. more recent data has more weight in the score to be optimised).

This may take into account the possibility that the memory curve of a user for a collection is not static. For example, more cards they learn from the collection, the easier (or more difficulty) for them to learn new cards.

5 Likes

If you ever learn to write Hanzi/Kanji you’ll realise that it is basically drawing the same basic shapes again again in different but similar combinations. Wouldn’t be surprised if initial memory curve is too steep in the beginning.

Yes, Hanzi/Kanji is a great example!

The problem is that it’s unclear how to choose the value of such factor.

We may let users pick this factor. For example, we can add an option “Review half-life” with something like a year as its default value. Instead of just ignoring all reviews before a given date, this new option would allow a more fine-grained control on how old revlogs influence the parameters.

I don’t think you understand what’s the problem. If you change the factor, RMSE and log loss will change. This means that you cannot use log loss/RMSE to tell you what’s the best factor, since these metrics themselves would depend on it.

EDIT: to clarify, here’s an analogy. Imagine that you are reading statistics about the average wage. You read a publication from 2022. Then you read a publication from 2023, and it says that the average wage went down, but it also says that the way the average is calculated has been changed this year. Can you determine whether the reported average wage went down because people actually became poorer or because the methodology has changed? Nope.

The situation with choosing a factor for discounting past data is similar.

“But wouldn’t that also be a problem with choosing a date?” you may ask. Yes, it is a problem. But it’s much less severe because choosing a date is easy and intuitive, choosing some abstract number that affects the algorithm in some less-than-obvious way is not.

Thank you very much for the detailed explanation.

I was not talking about the difficulty of implementing a new algorithm to incorporate this factor (which, sure enough, is not negligible), nor the negative impact of this factor on the interpretability of the metrics. I was only thinking about the possible improvement to memory curve fitting by adding this factor; after all, the idea of discounting past data is frequently used in machine learning.

Admittedly, due to the fact that RMSE is not additive, how to discount data in RMSE is already a non-trivial question. On the other hand, log loss is additive, so (at least conceptually) introducing a discount factor would not cause mayhem to the rest of the algorithm.

I was only thinking about the possible improvement to memory curve fitting by adding this factor

But you can’t assess whether changing the factor improves curve fitting, that’s the crux. You would need some kind of meta-metric. Log loss is used to assess the goodness of fit of the algorithm. The meta-metric (if such a thing even exists) would be used to assess the validity of log loss with a given discount factor.

Very good point.