While instruction tuning is clearly necessary for producing usable interfaces like ChatGPT, the "magic" of language models comes from self-supervised learning on broad data, which enables emergent behavior like in-context learning and chain-of-thought.
Have large language models solved news summarization? Almost there. Our new study shows that text-davinci-002 is comparable to freelance writers.
Our group's been thinking about how AI is having its Linux 🐧 moment.
Open source models + community are driving amazing progress. There’s so much to do, and so many ways to get involved!
Check out these thoughts at the @HazyResearch blog
@joshalbrecht One issue is that even if you eval on your unpublished test set, then you necessarily have to send it to an API, which could cause leakage.
But this might not be enough either: if we want to measure cross-task generalization, we have to ensure that no examples of a task/domain are represented in the training data. This is essentially impossible.
A better solution would to have all the LM providers agree on a common repository of examples that should be excluded from any training run.
I worry about language models being trained on test sets. Recently, we emailed [email protected] to opt out of having our (test) data be used to improve models. This isn't enough though: others running evals could still inadvertently contribute those test sets to training.
Introducing Demonstrate–Search–Predict (𝗗𝗦𝗣), a framework for composing search and LMs w/ up to 120% gains over GPT-3.5.
No more prompt engineering.❌
Describe a high-level strategy as imperative code and let 𝗗𝗦𝗣 deal with prompts and queries.🧵
Introducing FlashConv, a new technique for training state space models. Runs up to 35X faster than FlashAttention and runs the new H3 language model 2.4X faster than Transformers! Research by @tri_dao and our own @realDanFu. together.xyz/blog/h3
Attention is all you need... but how much of it do you need?
Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023! 📣 w/ @tri_dao
📜 arxiv.org/abs/2212.14052 1/n
Call for reviewers for our #ICLR2023 workshop on Mathematical and Empirical Understanding of Foundation Models.
Fill out this form if you are interested
and we will aim to get back to you asap.
Paper deadline: 3 Feb
Tentative reviewing period: 10-24 Feb.
Ce Zhang (@DS3Lab and @togethercompute) has done some crazy stuff in distributed training. In this talk, he goes over the magic behind distribute training and inference on a GLOBAL scale over slow networks!
Tune in tomorrow at 3:30 pm Pacific!
You may have received an email today, asking you to split your NeurIPS paper into two separate PDFs: one "main" paper (~9 pages + refs) and another "supplement". Why are we doing this still?
Sign this petition to stop this practice
They all have distinct research styles and directions, and each has produced exciting and insightful results that have surprised me. Of course this is a super compressed summary - check out their work to learn more!
Dimitris Tsipras (@tsiprasd) has done seminal work in adversarial robustness. Recently, he has pivoted to language models - understanding in-context learning and making major contributions to the HELM benchmark.
Transformers can do in-context learning: arxiv.org/pdf/2208.01066…
John Thickstun (@jwthickstun) develops methods to control generative models without fine-tuning,
tackling challenging discrete modalities such as language & music and handling complex controls.
Sampling from autoregressive models using Langevin dynamics: arxiv.org/pdf/2105.08164…
Steve Mussmann (@MussmannSteve) develops theory (upper and lower bounds) for active learning that yields practical insights, for example, explaining the surprising success of uncertainty sampling.
Data subset selection via machine teaching: drive.google.com/file/d/1j7K7f5…
Mina Lee (@MinaLee__) studies how humans interact with language models for writing and other tasks. She brings a fresh human-centered perspective to the default automation framing of LMs.
Evaluating human-LM interaction: arxiv.org/pdf/2212.09746…
Ananya Kumar (@ananyaku) focuses on foundation models for robustness to distribution shift. He develops theory on the role of data in pretraining and how to best fine-tune; these insights lead to SOTA results.
Fine-tuning can distort features: arxiv.org/pdf/2202.10054…
Niladri Chatterji (@niladrichat) develops holistic theoretical understanding in the brave new world of deep learning, capturing optimization and generalization in non-convex and overparametrized settings.
Benign overfitting without linearity: arxiv.org/pdf/2202.05928…
I have 6 fantastic students and post-docs who are on the academic job market this year. Here is a short thread summarizing their work along with one representative paper: