I've been plugging away on some ViT fine-tuning experiments this week while slowly recovering from a nasty cold. Part of the LAION-2B + CLIP exploration, I am fine-tuning the recently released weights to ImageNet. Some expectations were upended, but interesting weights inbound.
For first runs I fine-tuned directly from the image tower CLIP weights (loadable in timm now via the HF hub models). This went okay, but I was hoping for more. I squeezed 87.4 @ 224 L/14, 87.8 @ 336, 82.2 @ 224 B/32, 84.4 @ 384 B/32. H/14 @ 224 only 87.6. Not bad, but not wow.
One of my aims was to pass BEiT results, so digging in to their FT process more, I decided to try intermediate FT on ImageNet-22k. The best BEiT weights are via two stage adaptation. 83.3 @ 224 B/32, 85 @ 384 B/32, and now 87.9 @ 224 L/14. Looking much better!
@wightmanr For E2E FT it seems CLIP does not work amazingly out of the box on IN but people seem to get good performance after some additional self-distillation stage. Several papers on that but this one probably the first: arxiv.org/abs/2205.14141 have you tried anything like that?
@sainingxie I have not tried that (yet). Do have the paper open in a tab somewhere and thought the idea was worth trying, seems simple enough. First things first I needed to see where I could get FT results for a LAION related study.
@sainingxie One immediate q I had reading that paper first pass was whether it'd make sense to apply L1 across multiple pairs of feat maps through the network. And also what'd happen if FT + L1 feat loss was done in same session.. ie loss = classification_loss + l1_feat_map_loss
@sainingxie Saining, on the CLIP + LAION topic, I was going to push for ConvNeXT + LAION-2B runs if we get the compute budget, have there been any experiments you're aware of that look promising? Any ideas re image model tower size vs text tower or hparam gotchas you might be aware of?