6
5
49
3K
6
Download Image
@techdevnotes hmm... GPQA/AIME: Sonnet 3.7's high scores use internal scoring with parallel test time compute, while o1 and Grok 3's high results use majority voting with N=64 samples.
@techdevnotes damn so grok 3 performs better across the board? holy fuck
@techdevnotes Lol @Kr00ney when shown that Grok 3 is better than Claude 3.7