Openai/694896a0-6814-800f-b484-4ed0ec0b6e9c
This conversation was exported from ChatGPT[1]
gpt-5-2-thinking edit
User: Why is ChatGPT 5.2 Pro not evaluated on benchmark sites? edit
Why is ChatGPT 5.2 Pro not evaluated on benchmark sites? If it is, which sites?
Assistant: “ChatGPT 5.2 Pro” is a bit of a shape-shifter as far as benchmarking goes. edit
“ChatGPT 5.2 Pro” is a bit of a shape-shifter as far as benchmarking goes.
Most benchmark/leaderboard sites don’t (and often can’t) evaluate the ChatGPT product experience directly. They usually evaluate an API model with a stable, scriptable interface, because benchmarks require thousands of automated runs with controlled settings.
Why you often don’t see “ChatGPT 5.2 Pro” on the big public leaderboards
- ChatGPT (the app) isn’t a clean benchmark target The ChatGPT experience includes extra layers that benchmarks try to avoid: system instructions, safety layers, tool routing, UI behaviors, feature toggles, and sometimes “smart defaults” that aren’t exposed as simple knobs. That makes results harder to reproduce and compare apples-to-apples across vendors.
- Automated evaluation via ChatGPT is generally a no-go OpenAI’s Terms of Use forbid automatically or programmatically extracting data or output from the services, which blocks the typical “benchmark harness logs into the website and blasts 50k prompts” approach. OpenAI[2] So serious evaluators generally stick to the API.
- GPT-5.2 Pro is API-accessible, but it’s a specialty endpoint OpenAI’s model page states GPT-5.2 Pro is available via the Responses API (not the older Chat Completions flow), which can be an extra integration hurdle for some benchmark rigs. OpenAI Platform[3]
- It’s expensive to benchmark at scale The official model listing shows GPT-5.2 Pro priced at $21 per 1M input tokens and $168 per 1M output tokens. Benchmarks can become hilariously pricey, fast. OpenAI Platform[4] Even if a site could run it, many choose not to continuously retest the most expensive tiers.
So is it evaluated anywhere? Yep, just not everywhere.
Where GPT-5.2 Pro does show up (examples)
SimpleBench SimpleBench’s public leaderboard includes “GPT-5.2 Pro (xhigh)” (and also “GPT-5.2 (high)”). SimpleBench[5]
ARC Prize leaderboard (ARC-AGI) ARC Prize’s leaderboard includes GPT-5.2 Pro results (they track performance vs cost per task for ARC-AGI style evaluations). ARC Prize[6] Note: their page is heavily visual/interactive, so some text scrapers show less detail, but the leaderboard is a real benchmarking venue.
Where you’ll often NOT see GPT-5.2 Pro
LMArena / Chatbot Arena (LMSYS) Arena currently lists GPT-5.2 variants like gpt-5.2, gpt-5.2-high, and gpt-5.2-search, but not a “Pro” entry. LMArena[7] That’s consistent with the cost + access + “which exact settings did you run?” problem.
Bonus: sites that track GPT-5.2 generally (may not label “Pro” separately) Artificial Analysis has a GPT-5.2 model page and comparisons, but it commonly shows GPT-5.2 variants (like xhigh) rather than always breaking out “Pro” as its own named entry. Artificial Analysis[8] Epoch AI’s ECI aggregates benchmark scores and reports GPT-5.2 on its capability index (again, not always split into every tier the way OpenAI markets them). Substack[9]
If you want a clean mental model: benchmark sites are usually grading the underlying API model (gpt-5.2-pro) under very specific settings, not the full ChatGPT “Pro” experience you use in the app. That’s why the label you’re looking for often seems to “vanish” like a magician’s coin, even when the model itself is being tested. 🪄
References edit
- ↑ "ChatGPT 5.2 Pro benchmarks". ChatGPT. Retrieved 2025-12-22.
- ↑ "OpenAI". openai.com. Retrieved 2025-12-22.
- ↑ "OpenAI Platform". OpenAI Platform. Retrieved 2025-12-22.
- ↑ "OpenAI Platform". OpenAI Platform. Retrieved 2025-12-22.
- ↑ "SimpleBench". simple-bench.com. Retrieved 2025-12-22.
- ↑ "ARC Prize". ARC Prize. Retrieved 2025-12-22.
- ↑ "LMArena". lmarena.ai. Retrieved 2025-12-22.
- ↑ "Artificial Analysis". Artificial Analysis. Retrieved 2025-12-22.
- ↑ "Substack". substack.com. Retrieved 2025-12-22.