Yu Zhang 🐈🐙 @yzhang_cs

@Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP yzhang.site Joined February 2023

Tweets

676
Followers

615
Following

661
Likes

4K

Zhipeng Huang @nopainkiller

a day ago

Big win for @Lei_Wang_1999 on Tilelang endorsement from the whale !

Daniel Han @danielhanchen

a day ago

Big win for @Lei_Wang_1999 on Tilelang endorsement from the whale !

13 148 1K 82K 478

Download Image

2 8 76 8K 24

Deepseek is using TileLang instead of Triton. TileLang is a rlly elegant language! Also reminds me of this surface-level blog I wrote when first learning about it. It only takes less than 100 lines of code to achieve 630 TFLOPS for softmax attn fwd in TileLang (1.3x of FA2)

DeepSeek @deepseek_ai

a day ago

9 66 664 123K 114

10 64 666 55K 433

Download Image

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) @teortaxesTex

3 days ago

man how is Kimi so different a model from a parallel universe

48 46 1K 125K 639

Download Image

Lei Wang @Lei_Wang_1999

3 days ago

Some h800 performance results of the Attention Sink in tilelang. Take a look if you want to make some variants :) github.com/tile-ai/tilela…

0 5 34 4K 13

Download Image

Mira Murati @miramurati

4 days ago

Sharing our second Connectionism research post on Modular Manifolds, a mathematical approach to refining training at each layer of the neural network

Thinking Machines @thinkymachines

4 days ago

Sharing our second Connectionism research post on Modular Manifolds, a mathematical approach to refining training at each layer of the neural network

113 453 3K 1.4M 2K

Download Image

91 259 3K 264K 708

Nathan Chen @nathancgy4

6 days ago

(1/6) triton kernels are a great way to understand ML models. but tutorials are scattered the learning method for me was jst to read real, high performance code so i wrote a blog which walkthroughs the design and intuitions behind FLA's softmax attention kernel 🧵also a thread

13 111 1K 97K 2K

Download Image

Tilde @tilderesearch

a week ago

Check out @nathancgy4's awesome Deltaformer PR and stay tuned for a post on the architecture soon!

Nathan Chen @nathancgy4

a week ago

Check out @nathancgy4's awesome Deltaformer PR and stay tuned for a post on the architecture soon!

1 12 90 38K 43

Download Image

0 2 19 3K 7

BigFudge @totallythisguy1

a week ago

@techeconomyana I really don't want this guy winning..

6 1 252 12K 0

Nathan Chen @nathancgy4

a week ago

SEED's paper on associative memory and DeltaFormer is still one of my favorites 🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu

Nathan Chen @nathancgy4

2 months ago

2 53 324 30K 304

Download Image

1 12 90 38K 43

Download Image

Yu Lin @basicprompts

2 weeks ago

Still one of the most insane lines I've read in any ML paper.

186 2K 18K 895K 4K

Download Image

kalomaze @kalomaze

2 weeks ago

swiglu-style gates working so well for attention (and not just in the ffn layers) is a beautiful thing. as it turns out, the "divine benevolence" might just be caused by better inductive biases for controlling where information goes.

5 24 350 29K 253

Download Image

Aran Komatsuzaki @arankomatsuzaki

2 weeks ago

Big day for AI agents! Tongyi Lab (@Ali_TongyiLab) just dropped half a dozen new papers, most focused on Deep Research agents. I’ll walk you through the highlights in this thread. (1/N)

17 99 632 67K 456

Download Image

Kimi.ai @Kimi_Moonshot

2 weeks ago

Our engineer wrote about the thinking and technical story behind Checkpoint Engine. 👉 moonshotai.github.io/checkpoint-eng…

Kimi.ai @Kimi_Moonshot

3 weeks ago

Our engineer wrote about the thinking and technical story behind Checkpoint Engine. 👉 moonshotai.github.io/checkpoint-eng…

65 271 2K 346K 987

Download Image

8 70 423 39K 136

Zhixuan Lin @zhxlin

2 weeks ago

#COLM2025 We introduce Adaptive Computation Pruning (ACP) for the Forgetting Transformer (FoX) — a provably safe pruning method that significantly speeds up our Forgetting Attention kernel, especially for long-context pretraining. Our simple Triton kernel with ACP is 1.7x to 2.4x…

5 48 300 23K 204

Download Image

JingyuanLiu @JingyuanLiu123

2 weeks ago

I was lucky to work in both China and the US LLM labs, and I've been thinking this for a while. The current values of pretraining are indeed different: US labs be like: - lots of GPUs and much larger flops run - Treating stabilities more seriously, and could not tolerate spikes…

Charuru Charuru @CharuruCha14310

2 weeks ago

2 0 38 516K 23

59 346 3K 518K 2K

Rui-Jie (Ridger) Zhu @RidgerZhu

3 weeks ago

Congrats to @SonglinYang4 that DeltaNet series finally scaled up! Also Glad to see Qwen Team uses 3:1 GatedDeltaNet:Attention hybrid ratio as our Hybrid Linear Attention analysis arxiv.org/abs/2507.06457 recommended😊

Qwen @Alibaba_Qwen

3 weeks ago

172 725 4K 905K 2K

Download Image

1 11 86 9K 25

Sasha Rush @srush_nlp

3 weeks ago

Video and blog post of @SonglinYang4 explaining DeltaNet used in the newest Qwen-Next model * sustcsonglin.github.io/blog/2024/delt… * youtu.be/d0HJvGSWw8A

2 41 305 43K 291

Horace He @cHHillee

3 weeks ago

Apologies that I haven't written anything since joining Thinking Machines but I hope this blog post on a topic very near and dear to my heart (reproducible floating point numerics in LLM inference) will make up for it!