I've been plugging away on some ViT fine-tuning experiments this week while slowly recovering from a nasty cold. Part of the LAION-2B + CLIP exploration, I am fine-tuning the recently released weights to ImageNet. Some expectations were upended, but interesting weights inbound.
For first runs I fine-tuned directly from the image tower CLIP weights (loadable in timm now via the HF hub models). This went okay, but I was hoping for more. I squeezed 87.4 @ 224 L/14, 87.8 @ 336, 82.2 @ 224 B/32, 84.4 @ 384 B/32. H/14 @ 224 only 87.6. Not bad, but not wow.
One of my aims was to pass BEiT results, so digging in to their FT process more, I decided to try intermediate FT on ImageNet-22k. The best BEiT weights are via two stage adaptation. 83.3 @ 224 B/32, 85 @ 384 B/32, and now 87.9 @ 224 L/14. Looking much better!
Further experiments are ongoing, I'm hoping for some magic with H/14 and intermediate FT. For the direct FT H/14 was peaking very early in the LR schedule. Increasing augreg helped to barely pass L/14 and pushed the peak back a bit but I couldn't extend gains to a longer sched.
@wightmanr For E2E FT it seems CLIP does not work amazingly out of the box on IN but people seem to get good performance after some additional self-distillation stage. Several papers on that but this one probably the first: arxiv.org/abs/2205.14141 have you tried anything like that?