Ross Wightman @wightmanr, Twitter Profile

Ross Wightman @wightmanr

2 years ago

I've been told timm has a lot of hidden features. Yes, the docs need improving, that's a WIP! Curious about one of those features I've been using a lot lately in CLIP ViT fine-tuning? Every model in timm, when used with optimizer factory supports layer-wise LR decay.

4 16 119 0 22

Ross Wightman @wightmanr

2 years ago

Also known as discriminative LR decay, this applies a decaying LR to the model params as you move away from the head. It's very useful for fine-tuning from large pretrain dataset (or semi/unsupervised train -> supervised) without blowing away properties from pretrain.

1 1 20 0 1

Ross Wightman @wightmanr

2 years ago

I didn't just try to map parameter children / modules into a list (that isn't consistent across models). I sat down and wrote regex (ugh) for every single model to appropriately map stem / block / stage / heads to meaningful 'layers', either blocks or 'coarse' stages

2 1 10 0 0

Lucas Beyer (bl16) @giffmana

2 years ago

@wightmanr Ha, that's pretty much the same way we implemented this in big_vision! Though we haven't really found layerwise lr decay to be very useful yet even though it seems popular recently.

2 0 2 0 0

Ross Wightman @wightmanr

2 years ago

@giffmana I have not been able to achieve comparable results fine-tuning CLIP image tower or even in22k supervised -> 1k weights without using it. The 1k val is higher and the OOD test set scores are better, ie more robust. I've done some hparam search, maybe not exhaustive enough?

1 0 3 0 0

Lucas Beyer (bl16) @giffmana

2 years ago

@wightmanr On the other hand, I feel like we also haven't explored layerwise enough on our side yet, so I wouldn't reach any definite conclusion either way yet.

0 0 2 0 0