“Memorization in Machine Learning: A Survey of Results”, Dmitrii Usynin, Moritz Knolle, Georgios Kaissis2024-08-14 (; similar)⁠:

Quantifying the impact of individual data samples on machine learning models is an open research problem. This is particularly relevant when complex and high-dimensional relationships have to be learned from a limited sample of the data generating distribution, such as in deep learning.

It was previously shown that, in these cases, models rely not only on extracting patterns which are helpful for generalization, but also seem to be required to incorporate some of the training data more or less as is, in a process often termed ‘memorization’. This raises the question: if some memorization is a requirement for effective learning, what are its privacy implications?

In this work we consider a broad range of previous definitions and perspectives on memorization in ML, discuss their interplay with model generalization and their implications of these phenomena on data privacy.

We then propose a framework to reason over what memorization means in the context of ML training under the prism of individual sample’s influence on the model. Moreover, we systematize methods allowing practitioners to detect the occurrence of memorization or quantify it and contextualize our findings in a broad range of ML learning settings.

Finally, we discuss memorization in the context of privacy attacks, differential privacy and adversarial actors.