Editing Openai/694896a0-6814-800f-b484-4ed0ec0b6e9c


{{DISPLAYTITLE:ChatGPT 5.2 Pro benchmarks}}

''This conversation was exported from ChatGPT''<ref>{{cite web|title=ChatGPT 5.2 Pro benchmarks|url=https://chatgpt.com/share/694896a0-6814-800f-b484-4ed0ec0b6e9c|publisher=ChatGPT|access-date=2025-12-22}}</ref>

== gpt-5-2-thinking ==
=== User: Why is ChatGPT 5.2 Pro not evaluated on benchmark sites? ===
Why is ChatGPT 5.2 Pro not evaluated on benchmark sites?  If it is, which sites?

=== Assistant: “ChatGPT 5.2 Pro” is a bit of a shape-shifter as far as benchmarking goes. ===
“ChatGPT 5.2 Pro” is a bit of a shape-shifter as far as benchmarking goes.

Most benchmark/leaderboard sites don’t (and often can’t) evaluate the ChatGPT product experience directly. They usually evaluate an API model with a stable, scriptable interface, because benchmarks require thousands of automated runs with controlled settings.

Why you often don’t see “ChatGPT 5.2 Pro” on the big public leaderboards
# ChatGPT (the app) isn’t a clean benchmark target The ChatGPT experience includes extra layers that benchmarks try to avoid: system instructions, safety layers, tool routing, UI behaviors, feature toggles, and sometimes “smart defaults” that aren’t exposed as simple knobs. That makes results harder to reproduce and compare apples-to-apples across vendors.
# Automated evaluation via ChatGPT is generally a no-go OpenAI’s Terms of Use forbid automatically or programmatically extracting data or output from the services, which blocks the typical “benchmark harness logs into the website and blasts 50k prompts” approach. OpenAI<ref>{{cite web|title=OpenAI|url=https://openai.com/policies/row-terms-of-use/|publisher=openai.com|access-date=2025-12-22}}</ref> So serious evaluators generally stick to the API.
# GPT-5.2 Pro is API-accessible, but it’s a specialty endpoint OpenAI’s model page states GPT-5.2 Pro is available via the Responses API (not the older Chat Completions flow), which can be an extra integration hurdle for some benchmark rigs. OpenAI Platform<ref>{{cite web|title=OpenAI Platform|url=https://platform.openai.com/docs/models/gpt-5.2-pro|publisher=OpenAI Platform|access-date=2025-12-22}}</ref>
# It’s expensive to benchmark at scale The official model listing shows GPT-5.2 Pro priced at $21 per 1M input tokens and $168 per 1M output tokens. Benchmarks can become hilariously pricey, fast. OpenAI Platform<ref>{{cite web|title=OpenAI Platform|url=https://platform.openai.com/docs/models/gpt-5.2-pro|publisher=OpenAI Platform|access-date=2025-12-22}}</ref> Even if a site could run it, many choose not to continuously retest the most expensive tiers.

So is it evaluated anywhere? Yep, just not everywhere.

Where GPT-5.2 Pro does show up (examples)

SimpleBench
SimpleBench’s public leaderboard includes “GPT-5.2 Pro (xhigh)” (and also “GPT-5.2 (high)”). SimpleBench<ref>{{cite web|title=SimpleBench|url=https://simple-bench.com/|publisher=simple-bench.com|access-date=2025-12-22}}</ref>

ARC Prize leaderboard (ARC-AGI)
ARC Prize’s leaderboard includes GPT-5.2 Pro results (they track performance vs cost per task for ARC-AGI style evaluations). ARC Prize<ref>{{cite web|title=ARC Prize|url=https://arcprize.org/leaderboard|publisher=ARC Prize|access-date=2025-12-22}}</ref>
Note: their page is heavily visual/interactive, so some text scrapers show less detail, but the leaderboard is a real benchmarking venue.

Where you’ll often NOT see GPT-5.2 Pro

LMArena / Chatbot Arena (LMSYS)
Arena currently lists GPT-5.2 variants like gpt-5.2, gpt-5.2-high, and gpt-5.2-search, but not a “Pro” entry. LMArena<ref>{{cite web|title=LMArena|url=https://lmarena.ai/leaderboard|publisher=lmarena.ai|access-date=2025-12-22}}</ref>
That’s consistent with the cost + access + “which exact settings did you run?” problem.

Bonus: sites that track GPT-5.2 generally (may not label “Pro” separately)
Artificial Analysis has a GPT-5.2 model page and comparisons, but it commonly shows GPT-5.2 variants (like xhigh) rather than always breaking out “Pro” as its own named entry. Artificial Analysis<ref>{{cite web|title=Artificial Analysis|url=https://artificialanalysis.ai/models/gpt-5-2|publisher=Artificial Analysis|access-date=2025-12-22}}</ref>
Epoch AI’s ECI aggregates benchmark scores and reports GPT-5.2 on its capability index (again, not always split into every tier the way OpenAI markets them). Substack<ref>{{cite web|title=Substack|url=https://substack.com/%40epochai/note/c-187985664|publisher=substack.com|access-date=2025-12-22}}</ref>

If you want a clean mental model: benchmark sites are usually grading the underlying API model (gpt-5.2-pro) under very specific settings, not the full ChatGPT “Pro” experience you use in the app. That’s why the label you’re looking for often seems to “vanish” like a magician’s coin, even when the model itself is being tested. 🪄

== References ==
<references />


[[Category:ChatGPT Shares]]
[[Category:openai]]