OpenCompass @OpenCompassX

OpenCompass focus on the evaluation and analysis of large language models and vision language models. github: https://t.co/zF7ycuTXxs opencompass.org.cn/home China Joined April 2024

Tweets

73
Followers

261
Following

47
Likes

58

OpenMMLab @OpenMMLab

a month ago

🔥China’s Open-source VLMs boom—Intern-S1, MiniCPM-V-4, GLM-4.5V, Step3, OVIS 🧐Join the AI Insight Talk with @huggingface, @OpenCompassX, @ModelScope2022 and @ZhihuFrontier 🚀Tech deep-dives & breakthroughs 🚀Roundtable debates ⏰Aug 21, 5 AM PDT 📺Live: youtube.com/live/kh0WSMoVZ…

2 3 18 4K 3

Download Image

OpenCompass @OpenCompassX

2 months ago

🚀 Introducing #CompassVerifier: A unified and robust answer verifier for #LLMs evaluation and #RLVR! ✨LLM progress is bottlenecked by weak evaluation, looking for an alternative to rule-based verifiers? CompassVerifier can handle multiple domains including math, science, and…

0 1 4 750 3

Download Image

OpenCompass @OpenCompassX

7 months ago

🥳#CodeCriticBench assesses LLMs' critiquing ability in code generation and QA tasks. Covering 10 criteria, it features a 4.3k-samples dataset with three difficulty levels and balanced distribution. 😉CodeCriticBench is now part of the #CompassHub! 😚Feel free to download and…

0 0 3 234 0

Download Image

OpenCompass @OpenCompassX

7 months ago

🥳#StructFlowBench is a structurally annotated multi-turn benchmark that leverages a structure-driven generation paradigm to enhance the simulation of complex dialogue scenarios. 🥳StructFlowBench is now part of the #CompassHub! 😉Feel free to download and explore it—available…

0 1 3 702 0

Download Image

OpenCompass @OpenCompassX

7 months ago

😉#VBench is a comprehensive benchmark evaluates video generation quality. It comprises 16 dimensions in video generation, and also provides a dataset of human preference annotations. 🥳VBench is now part of the #CompassHub! Feel free to download and explore it—available for…

0 0 2 156 0

Download Image

OpenCompass @OpenCompassX

7 months ago

🥰VLM²-Bench is the first comprehensive benchmark that evaluates vision-language models' (#VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases. 🥳VLM²-Bench is now part of the…

0 0 6 280 0

Download Image

OpenCompass @OpenCompassX

8 months ago

We've uploaded the AIME 2025 exam, complete with questions and solutions, here: huggingface.co/datasets/openc…. Feel free to test your powerful LLM on this dataset.

0 0 4 209 1

OpenCompass @OpenCompassX

9 months ago

🌟 Exciting News! CompassArena now back with some major updates: - **Judge Copilot**: An LLM-as-a-Judge tool for model comparisons. 🤖 - **Enhanced Statistical Model**: Improved Bradley-Terry accuracy by addressing confounding variables. 📊 - **20+ New LLMs**: A global mix of…

0 2 5 194 2

Download Image

meng shao @shao__meng

9 months ago

大语言模型具备稳定推理能力吗？「来自上海 AI 实验室 @OpenCompassX 的研究，通过创新的评估方法揭示了一个关键问题：尽管大语言模型在单次测试中可能表现出色（如 OpenAI 最新模型单次准确率达 66.5%），但在需要持续稳定输出的场景中，几乎所有模型的表现都会大幅下降（降幅普遍超过…

1 3 10 1K 6

Download Image

OpenCompass @OpenCompassX

9 months ago

Welcome to submit your LMM into our new leaderboard.

Haodong Duan @KennyUTC

9 months ago

Welcome to submit your LMM into our new leaderboard.

0 2 8 2K 0

Download Image

0 0 1 240 1

Haodong Duan @KennyUTC

9 months ago

OpenCompass has established a leaderboard to evaluate complex reasoning capability of LMMs, consisting of four advanced multi-modal math reasoning benchmarks. Currently, Gemini-2.0-Flash took the 1st place. DM me to suggest more benchmarks and models to this LB.

0 2 8 2K 0

Download Image

OpenCompass @OpenCompassX

9 months ago

🚀 Shocking : O1-mini scores just 15.6% on AIME under strict, real-world metrics. 🚨 📈 Introducing G-Pass@k: A metric that reveals LLMs' performance consistency across trials. 🌐 LiveMathBench: Challenging LLMs with contemporary math problems, minimizing data leaks. 🔍 Our…

3 15 66 12K 27

Download Image

YuanLiuuuuuu @a33668874586

11 months ago

MMBench has been selected as one of the most influential papers at ECCV 2024, ranking second.🎉🎉🎉 paperdigest.org/2024/09/most-i…

0 1 1 228 0

Download Image

OpenCompass @OpenCompassX

11 months ago

🚀 Excited to announce the release of CompassJudger-1, a powerful Judge LLM for diverse tasks! We've released 4 model sizes. 📷Submit your LLM's performance using CompassJudger to our leaderboard now! 📷Models: ompassJudger: github.com/open-compass/C… 📷Leaderboard:…

0 3 8 1K 3

OpenCompass @OpenCompassX

11 months ago

ProSA : framework to evaluate and understand Prompt Sensitivity of LLMs by SACHIN KUMAR link.medium.com/BTUVPXsAQNb

0 0 1 130 0

OpenCompass @OpenCompassX

a year ago

Congratulations to @Alibaba_Qwen on the release of so many new models! 🚀🚀🚀 OpenCompass now supports Qwen-2.5. Stay tuned for more evaluation results, coming soon!📊📊📊 github.com/open-compass/o…