I just added weights for the first CLIP model trained from scratch on LAION-2B (english subset of 5B) in OpenCLIP (github.com/mlfoundations/ope…). A ViT-B/32 w/ an ImageNet-1k top-1 eval of 65.62%. Compute provided by @StabilityAI and a a h/t to @rom1504 for help with LAION-5B.
May 20, 2022 · 8:45 PM UTC
The 2B model was trained over 16 epochs instead of 32. So 2.5x sample count for same % of the LR schedule. Looking at the graph (green) in the previous tweet, the progress of the 2B B/32 was different than other 400M models...
... after jumping up fairly high at the start, the eval progress was quite slow until accelerating at epoch 10 (20 on a 32 epoch sched) and ultimately passing the 400M (or OpenAI) results by ~2.4%.
For anyone who's trained contrastive image-text models at this scale, is this indicative of a poor LR choice or the difference in samples seen? I used a fairly large global batch size here 46952 (112 * 416). @giffmana @_jongwook_kim ?