Currently, precision decisions in ML are made at the layer or entire model level, but the underlying tensor cores operate on relatively small chunks, so it should be possible to optimize “mixed precision” inside a single weight matrix by permuting the weights to put ok-for-low \
9
8
268
0
22
\precision weights together in the same tensor block. Maybe even getting parts down to 4 bits. More practically, permuting weights could probably allow the Ampere sparse matrix optimization to work better by distributing the zeros more evenly.