Have you ever wanted to train LLMs in pure C without 245MB of PyTorch and 107MB of cPython? No? Well now you can! With llm.c: github.com/karpathy/llm.c To start, implements GPT-2 training on CPU/fp32 in only ~1,000 lines of clean code. It compiles and runs instantly, and exactly matches the PyTorch reference implementation. I chose GPT-2 to start because it is the grand-daddy of LLMs, the first time the LLM stack was put together in a recognizably modern form, and with model weights available.
You can look at the raw training implementation here: github.com/karpathy/llm.c… You'll see that we allocate all the required memory a single time in the beginning in one large block of 1D memory. From there on during training, no memory gets created or destroyed, so we stay at constant memory footprint and its just dynamics, streaming the data batches through. The crux of it is manually implementing the forward and backward pass of all the individual layers, and then stringing them together. For example here is layernorm forward and backward pass. In addition to layernorm we need the encoder, matmul, self-attention, gelu, residual, softmax and cross-entropy loss.
@karpathy Short answer, No. Long answer, heyall, sheeyit, fawk noooooooooooooooooooooooooooooo...
@karpathy Really the problem is not the amount of memory used by PyTorch et al, but the huge amount of complexity (thus development friction, bugs and security holes) being "required" as a baseline to perform such a task.
@karpathy "Real men program in C" Brilliant code! 👏
@karpathy Respectfully, this doesn't work. 😅😂 Jk. 🫡 Thanks a lot. 😇