Skip to main content

Face Recognition Training App via Triplet Loss

N/A

We might be able to train people to better recognize faces the same way we teach NNs to recognize faces, with a triplet loss: present 3 faces at a time, and pick the mismatching one. And then we can use NNs to choose or generate the triplets as well, letting us scale indefinitely a curriculum of increasingly difficult, diverse, realistic face recognition problems.

This could be easily implemented as an automated adaptive-difficulty web app using public datasets and NN models, going beyond the prior small-scale research efforts, and so is worth trying if no one has already.

Many people struggle to recognize faces. I’m not too good at it either—if not as bad as someone who once told me that they focused on shoes at events because people didn’t typically bring many pairs of shoes. I’m nothing like face super-recognizers, who apparently can sometimes see someone’s face once and somehow recognize them for the rest of their life, no matter the makeup, hair length or color, clothing, or angle; I am regularly fooled, and when I look at the Google Images search results for actors or actresses especially, I am bemused that all these different people are actually the same person. Many people struggle to match even the same face.

Face recognition is known to be highly heritable and curiously uncorrelated with intelligence, but that doesn’t mean it is immutable, as few people do any kind of training of face recognition; a face recognition training system could stress a person more than a lifetime of casually meeting people at parties and going ‘have we met before?’ would. Some past studies doing small amounts of training on rather artificial data have shown benefits to training, like White et al 2013, DeGutis et al 2014 or Towler et al 2021 teaching focus on ears, although not in Dolzycka et al 2014.

These past training approaches have been limited by their small scale in both time, samples, diversity of samples, and adaptiveness of difficulty, so it’s unsurprising if small inputs have small outputs; but if we look at things like chick sexing, they often benefit from intensity (what one might call “massive input”). How can we make face recognition training more intense?

Neural net face recognizers like FaceNet are quasi-superhuman. So how do they do it? They convert faces into embeddings, which try to focus on things that help match faces across all conditions rather than transient details, and then compare the summary to all other known summaries; the closest summary is the best match.

They are usually trained using a similarity-based method: two photographs of the same human’s face should have embeddings which are ‘close’, while a photograph of a different human’s face, no matter how similar in every other respect, should be ‘further’. An easy way to do this is to use 3 images simultaneously, with a contrastive learning ‘triplet loss’: two from the same human as far apart as possible, and a third known to be from a different human but which is the closest in the dataset (according to the current NN) to the first two; then tweak the NN to push the first two closer together, and away from the third one, and thereby get a new better NN. Repeat many times with many humans and many photos of them, and eventually the NN will get very good at the task of ‘recognizing’ a human (ie. finding which photos in a dataset are most similar face-wise and most likely to be from the same human).

So… Could we train humans the same way? We cannot directly affect their ‘embeddings’, but we can easily set up a triplet scenario, with 2 matches and 1 distractor, and simply ask the human for ‘the odd one out’. Repeat many times, and maybe a human could learn much better face recognition?

The simplest approach is to train face comparison or matching: “which of these faces are the same?”

Concretely, we could create a static client-side web app using HTML/JS/CSS, which presents 3 photos at a time in triangle. (This should be easy enough for agentic LLMs as of mid-2026; acquiring and embedding the face photo data is the main challenge.) The user must pick one of them as the odd-one-out, using mouse clicks or keybindings like ASD/123/←↑→ for speed, within a decreasing time-limit. Then the answer is confirmed or the correct answer indicated, and the next one loaded. The triplets are pre-loaded and rendered off-screen in the background, so they load instantly, allowing the user to get through multiple problems per second and to get experience as fast as possible. Difficulty should be adaptive: a correct answer slightly increases the difficulty (ie. the odd-one-out is chosen to be ‘closer’ to the true pair and thus the differences are subtler), while an incorrect answer decreases it.

The user keeps training until they end the session, or they get so good the app runs out of hard-enough triplets.

Triplets can be constructed using existing public face datasets like Labeled Faces in the Wild/VGGFace2/CelebA, or generative models, and face recognition models. (They do not have to be proprietary SOTA datasets or models to provide an excellent teacher for humans who aren’t good at face recognition!) Simply take a face dataset, embed them, and construct triplets ahead of time, and hide the list of triplet IDs in the web page; the client-side JS requests whichever photograph ID→URLs it needs.

The adaptive difficulty can be precomputed: starting at an easy distance, simply compute all the ±x% “difficulty bands”, and precompute, say, 500 triplets; then the client simply goes through them doing random sampling-without-replacement. (Server-side logging of results would allow improving on the embedding distance heuristic by measuring empirical difficulty of triplets and bumping them up/down tiers.)

Fun variants might include trying to filter data down to specific races, as most people are even worse at ‘other race’ face recognition.

And one can validate gains using existing face tests like Cambridge Face Memory Test or UNSW Face Test.

A possible failure mode here is that the real photo datasets may be too ‘easy’ in some way; perhaps the photos are not naturally varied enough to make a real challenge, and it’s easy to tell by simple heuristics like brightness or resolution which two photographs pair up. Then face image generators (eg. StyleGAN) might be a good substitute; fix the face identity via conditioning or rejection sampling, and then deliberately vary everything else in the generated images, as such models allow one to easily control pose, expression, lighting, age, hair etc.

Would this first simple approach be enough? It might not be; you might become very good at discriminating, as intended, but then still not remember faces. A NN face recognizer is backed by a large computer database that stores everything reliably, so it’s no issue there, but you might simply forget a face as soon as you saw it. Then your trained perception would do you no good, as you will rarely encounter simultaneously confusing faces, such as identical twins or look-alikes, in real life.

At the cost of throughput, we could try to emphasize memory encoding more with a second, more complicated app approach. Here, we stress memory by adding a delay. First, show a ‘target’ face for n seconds. Then blank the screen for m seconds, and show 2 new faces. Now the user has to pick the matching one. This forces at least short-term storage.

If this worked, one could try to keep adding delays and additional twists, and train face recognition memory, like present increasingly large sets of freshly generated faces, and ask, “which human has been seen before during this session?” (Loosely inspired by Andrews et al 2015, with a spaced repetition twist.)

If a user hits the limit of within-session retention, and needs across-session state, that means it cannot be a static stateless web app anymore, which is a substantial complication.