@Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTPyzhang.siteJoined February 2023
Deepseek is using TileLang instead of Triton. TileLang is a rlly elegant language!
Also reminds me of this surface-level blog I wrote when first learning about it. It only takes less than 100 lines of code to achieve 630 TFLOPS for softmax attn fwd in TileLang (1.3x of FA2)
Deepseek is using TileLang instead of Triton. TileLang is a rlly elegant language!
Also reminds me of this surface-level blog I wrote when first learning about it. It only takes less than 100 lines of code to achieve 630 TFLOPS for softmax attn fwd in TileLang (1.3x of FA2) https://t.co/hCLu73npSW
(1/6) triton kernels are a great way to understand ML models. but tutorials are scattered
the learning method for me was jst to read real, high performance code
so i wrote a blog which walkthroughs the design and intuitions behind FLA's softmax attention kernel
🧵also a thread
SEED's paper on associative memory and DeltaFormer is still one of my favorites
🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu
SEED's paper on associative memory and DeltaFormer is still one of my favorites
🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu https://t.co/1UU5U9uIBx
swiglu-style gates working so well for attention (and not just in the ffn layers) is a beautiful thing. as it turns out, the "divine benevolence" might just be caused by better inductive biases for controlling where information goes.
Big day for AI agents!
Tongyi Lab (@Ali_TongyiLab) just dropped half a dozen new papers, most focused on Deep Research agents.
I’ll walk you through the highlights in this thread. (1/N)
#COLM2025 We introduce Adaptive Computation Pruning (ACP) for the Forgetting Transformer (FoX) — a provably safe pruning method that significantly speeds up our Forgetting Attention kernel, especially for long-context pretraining. Our simple Triton kernel with ACP is 1.7x to 2.4x…
I was lucky to work in both China and the US LLM labs, and I've been thinking this for a while. The current values of pretraining are indeed different:
US labs be like:
- lots of GPUs and much larger flops run
- Treating stabilities more seriously, and could not tolerate spikes…
I was lucky to work in both China and the US LLM labs, and I've been thinking this for a while. The current values of pretraining are indeed different:
US labs be like:
- lots of GPUs and much larger flops run
- Treating stabilities more seriously, and could not tolerate spikes…
Congrats to
@SonglinYang4
that DeltaNet series finally scaled up! Also Glad to see Qwen Team uses 3:1 GatedDeltaNet:Attention hybrid ratio as our Hybrid Linear Attention analysis arxiv.org/abs/2507.06457 recommended😊
Congrats to
@SonglinYang4
that DeltaNet series finally scaled up! Also Glad to see Qwen Team uses 3:1 GatedDeltaNet:Attention hybrid ratio as our Hybrid Linear Attention analysis arxiv.org/abs/2507.06457 recommended😊
Apologies that I haven't written anything since joining Thinking Machines but I hope this blog post on a topic very near and dear to my heart (reproducible floating point numerics in LLM inference) will make up for it!
Apologies that I haven't written anything since joining Thinking Machines but I hope this blog post on a topic very near and dear to my heart (reproducible floating point numerics in LLM inference) will make up for it!
15K Followers 7K FollowingI build tough benchmarks for LMs and then I get the LMs to solve them. SWE-bench & SWE-agent. Postdoc @Princeton. PhD @nlpnoah @UW.
2K Followers 1K FollowingBuilding new AI hardware at @Positron_AI. 2013 Thiel Fellow, hardware hacker, entrepreneur. Previously founded @REXComputing | https://t.co/vqJ6oJMqWG
1K Followers 5K Following@OpenAI, ex @Google, @AIAtMeta. Interested in Science, Psychology, Investing and generally everything.
Good Thoughts, Good Words, Good Deeds.
914 Followers 6K FollowingStaff Research Engineer in BioAI at @InstaDeepAI (part of @BioNTech_Group)
ML for de novo peptide sequencing.
https://t.co/KOjeWuazsk
547 Followers 2K Followingundergrad studying AI at Renmin Univ. of China, NLP researcher, intelligence explorer&trainer, interned@Tencent AI Lab. Carpe Diem🍀
18K Followers 80 Following财经作者,写作中国商业深度报道,包括AI/科技巨头/风险投资/人物,也是播客《张小珺商业访谈录》主持人、制作人。Financial writer covering China business world, also the producer and host of "Zhang Xiaojun Podcast."
889 Followers 79 Following🚀Bringing China's AI & tech trends, voices and perspectives to the global stage.
⚡️Powered by 知乎/https://t.co/OkIemRZdcj, China's leading knowledge community.
2K Followers 2K FollowingPhD student at Tsinghua NLP & AIR, studying agents that automate tasks ranging from daily activities to creative endeavors. Two drifters with the world to see.
57K Followers 857 FollowingFiguring out AI @allen_ai, open models, RLHF, fine-tuning, etc
Contact via email.
Writes @interconnectsai
Wrote The RLHF Book
Mountain runner
4K Followers 2K FollowingResearcher at @MSFTResearch. Prev: PhD at @Mila_Quebec, intern at @Apple MLR and FAIR Labs @MetaAI, math undergraduate at @PKU1898.