In recent years, we have witnessed large performance boosts in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (eg. 12 or 24 layers) on roughly 4 million images.
In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a Transformer model, and scale the transformer both up and down, with model sizes ranging 13–675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data.
As a result, LEMON achieves new state-of-the-arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.
Figure 1: Image captioning performance on COCO when upscaling model for each dataset size. The x-axis plots the number of parameters for each model size (eg. tiny, small, huge) in a logarithmic scale. The definition of model sizes is detailed in Table 2. Increasing the model size is not substantially beneficial at small pre-training dataset scales. However, when we use sufficiently large datasets, we see strong performance boost from a larger model.
Figure 2a: Image captioning performance in data upscaling for each model size: finetuned and evaluated on COCO. The x-axis shows the number of image-text pairs used in pre-training. The y-axis shows the evaluation score (CIDEr) on COCO “Karpathy” test split and nocaps validation set, respectively. The models are first pre-trained, then finetuned on COCO caption training split. Note that x-axis is plotted in a logarithmic scale.
Figure 2b: finetuned on COCO, evaluated on nocaps
…We adapt the pre-training task to be consistent with the captioning task, and then scale the width and depth of the transformer model with the number of parameters ranging from 13 (ie. tiny) to 675 (ie. huge) millions. Combining different models and pre-training data sizes, we summarize our results in Figure 1 & Figure 2, which characterize the linear-logarithmic scaling trend. Larger models tend to benefit more when we have more than 10 million data for pre-training. However, with only 3 million data, the performance starts to saturate early as the model size increases. Moreover, we also investigate other design choices of VLP, eg. model architectures and training objectives
…The final dataset, named as ALT200M, contains more than 200 million images, each corresponding to one alt-text. The word cloud of 200 most frequent words is visualized in Figure 3. As shown in Table 1, compared to CC-12M, ALT200M has nearly 16× more images. The vocabulary is almost doubled. We observe that 56% of unigrams sum up to only 0.1% of total occurrences, characterizing an extremely long tail of rarely occurring unigrams. The average length of the captions is 13.01, more than that of the COCO caption dataset (10.44). We also observe that our dataset contains much more shorter captions with only 2 or 3 unigrams. This indicates a shift in the distribution of captions from pre-training to finetuning.
…Captioning Results: …compared to the baseline trained on COCO only (row 8), after pre-training on ALT200M (row 12), the CIDEr score is improved by 16.3 for the in-domain part, and 45.3 for the out-of-domain part. This evidences that large-scale pre-training improves the model’s ability to recognize a wide range of long-tailed visual objects. We also present results of models pre-trained on CC3M and CC12M. Compared to the best reported results on these datasets (row 1, 2), our evaluated CIDEr scores (row 9, 10) are increased by 18.4 and 13.0, respectively. This demonstrates the performance improvement in our captioning results brought about by the proposed training scheme when the pre-training dataset is the same. On the leaderboard test set, our large and huge models (row 19, 20) both surpassed the top-ranking model (row 18) that is pre-trained on 1.8B image-text pairs, creating the new state-of-the-art of 114.3 in CIDEr. We also achieve the state-of-the-art on other image captioning benchmarks, including COCO Caption and Conceptual Captions, as summarized in Table 4 & Table 5.
Figure 6: Comparison of sample efficiency for different model sizes.Figure (a) shows the learning curve in pre-training, measured by the accuracy of cross-entropy loss for masked token prediction. Figures (b) and (c) show the results of finetuned intermediate checkpoints, evaluated on COCO “Karpathy” test set and nocaps validation set, respectively. The larger model can consistently achieve better results in downstream tasks with far fewer pre-training epochs, especially for out-of-domain data.
…Besides, we observe that the model capacity becomes the performance bottleneck as the amount of available data increases. Figure 1 plots the scaling trend w.r.t. the number of model parameters. When pre-training with 3M data, the “base” size appears to be sufficient, and there is no large benefit to using larger models. However, with more than 40M data, the larger models start to outperform the smaller ones by a large margin. When the data magnitude reaches hundreds of millions, and if the observed trend from “base” to “huge” can be kept, there is promise in training an even larger model to push the limits of VLP for captioning tasks…We observe that both models continue to improve after seeing more samples in pre-training, but the larger model learns much “faster”. To achieve similar results in the downstream COCO captioning task, the base model must see >2–8× more samples in pretraining. This factor is even greater when evaluating on the nocaps out-of-domain images. The result of the “base” model seeing 19 billion samples is still slightly worse than that of the “huge” model seeing 0.8 billion samples. This demonstrates the efficiency of large models in learning from large-scale data, as well as the robustness in generalization.