I've been plugging away on some ViT fine-tuning experiments this week while slowly recovering from a nasty cold. Part of the LAION-2B + CLIP exploration, I am fine-tuning the recently released weights to ImageNet. Some expectations were upended, but interesting weights inbound.
For first runs I fine-tuned directly from the image tower CLIP weights (loadable in timm now via the HF hub models). This went okay, but I was hoping for more. I squeezed 87.4 @ 224 L/14, 87.8 @ 336, 82.2 @ 224 B/32, 84.4 @ 384 B/32. H/14 @ 224 only 87.6. Not bad, but not wow.
One of my aims was to pass BEiT results, so digging in to their FT process more, I decided to try intermediate FT on ImageNet-22k. The best BEiT weights are via two stage adaptation. 83.3 @ 224 B/32, 85 @ 384 B/32, and now 87.9 @ 224 L/14. Looking much better!