Disagree. As soon as you throw sparsity in (and depthwise/tiny-group conv is a form of sparsity) FLOPs detach from reality. That's why sparse nets are hard (arxiv.org/abs/2006.10901), and EffNetV2 actually UNDOES a lot of depthwise. EffNetV1 == MobileNetV3 == designed for CPU.
Disagree. As soon as you throw sparsity in (and depthwise/tiny-group conv is a form of sparsity) FLOPs detach from reality. That's why sparse nets are hard (arxiv.org/abs/2006.10901), and EffNetV2 actually UNDOES a lot of depthwise. EffNetV1 == MobileNetV3 == designed for CPU.
@giffmana This is true due to limitations in existing accelerator hardware. Our work, FAST, analyzes EfficientNet bottlenecks and shows a framework capable of automatically designing custom accelerators with 4x Perf/TDP on EfficientNet-B7 relative to TPU-v3. arxiv.org/abs/2105.12842
@DZhang50 I had not seen this paper yet, interesting approach. Not sure yet if I like or dislike this direction, as it risks locking us into a specific arch for a long time, although the approach itself certainly seems flexible.
@giffmana Thanks for the feedback! Under this approach, I think the goal would be to still have general-purpose ML accelerators, but also build a few optimized accelerators for specific popular workloads, eg EfficientNet and Transformers.