From the results in Table 3, the Qwen3-235B-A22B-Base model attains the highest performance scores across most of the evaluated benchmarks. We further compare Qwen3-235B-A22B-Base with other baselines separately for the detailed analysis., By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses., GPQA Diamond Set: A subset of 198 high-objectivity, challenging multiple-choice questions designed for advanced testing. Difficulty aligns with college-level or higher expertise in biology, physics, and chemistry., Discover the latest performance benchmarks leaderboard for top large language models. Compare Llama, Qwen, DeepSeek, and others on key metrics like LiveCodeBench, MMLU Pro, and GPQA to find the best model for your needs., Below, despite being the second-smallest model, Qwen3-235B outranks all models on all benchmarks, excepting DeepSeek v3 on the INCLUDE Multilingual tasks benchmark., Initial benchmark results indicate that Qwen3-235B-A22B meets or outperforms leading systems such as DeepSeek-R1, OpenAI’s O1 and o3-mini, Grok-3 and Google DeepMind’s Gemini-2.5 Pro across coding, mathematics and general-reasoning tasks..