“Grounded Language Acquisition through the Eyes and Ears of a Single Child”, Wai Keen Vong, Wentao Wang, A. Emin Orhan, Brenden M. Lake2024-02 (, , , )⁠:

[supplement, previously Orhan et al 2020] How do young children learn to associate new words with specific objects or visually represented concepts? This hotly debated question in early language acquisition has been traditionally examined in laboratories, limiting generalizability to real-world settings.

Vong et al 2024 investigated the question in an unprecedented, longitudinal manner using head-mounted video recordings from a single child’s first-person experiences in naturalistic settings. By applying machine learning, they introduced the Child’s View for Contrastive Learning (CVCL) model [ResNeXt + DINO], pairing video frames that co-occurred with uttered words, and embedded the images and words in shared representational spaces.

CVCL represents sets of visually similar things from one concept (eg. puzzles) through distinct subclusters (animal versus alphabet puzzles). It combines associative and representation learning that fills gaps in language acquisition research and theories.


Starting around 6–9 months of age, children begin acquiring their first words, linking spoken words to their visual counterparts. How much of this knowledge is learnable from sensory input with relatively generic learning mechanisms, and how much requires stronger inductive biases?

Using longitudinal head-mounted camera recordings from one child aged 6–25 months, we trained a relatively generic neural network on 61 hours of correlated visual-linguistic data streams, learning feature-based representations and cross-modal associations.

Our model acquires many word-referent mappings present in the child’s everyday experience, enables zero-shot generalization to new visual referents, and aligns its visual and linguistic conceptual systems.

These results show how critical aspects of grounded word meaning are learnable through joint representation and associative learning from one child’s input.

…We train CVCL on the SAYCam-S dataset of longitudinal egocentric video recordings from an individual child, which consists of clips over a 1.5-year period of the child’s life (6–25 months), with a total of 600,000 video frames paired with 37,500 transcribed utterances (extracted from 61 hours of video; data examples in Figure 1A, with additional details in the Supplementary Materials or SM S.4).

Thus, SAYCam-S provides an extended, first-person window into one child’s experiences, but it only captures about 1% of the child’s waking hours and misses other aspects of their experience (eg. action and embodiment). Despite these limitations, applying machine learning to the most realistic proxy experience to date can help illuminate the necessary ingredients for learning29, 30.