Jump to content

Openai/694896a0-6814-800f-b484-4ed0ec0b6e9c

From freem


This conversation was exported from ChatGPT[1]

gpt-5-2-thinking[edit | edit source]

User: Why is ChatGPT 5.2 Pro not evaluated on benchmark sites?[edit | edit source]

Why is ChatGPT 5.2 Pro not evaluated on benchmark sites? If it is, which sites?

Assistant: “ChatGPT 5.2 Pro” is a bit of a shape-shifter as far as benchmarking goes.[edit | edit source]

“ChatGPT 5.2 Pro” is a bit of a shape-shifter as far as benchmarking goes.

Most benchmark/leaderboard sites don’t (and often can’t) evaluate the ChatGPT product experience directly. They usually evaluate an API model with a stable, scriptable interface, because benchmarks require thousands of automated runs with controlled settings.

Why you often don’t see “ChatGPT 5.2 Pro” on the big public leaderboards

  1. ChatGPT (the app) isn’t a clean benchmark target The ChatGPT experience includes extra layers that benchmarks try to avoid: system instructions, safety layers, tool routing, UI behaviors, feature toggles, and sometimes “smart defaults” that aren’t exposed as simple knobs. That makes results harder to reproduce and compare apples-to-apples across vendors.
  2. Automated evaluation via ChatGPT is generally a no-go OpenAI’s Terms of Use forbid automatically or programmatically extracting data or output from the services, which blocks the typical “benchmark harness logs into the website and blasts 50k prompts” approach. OpenAI[2] So serious evaluators generally stick to the API.
  3. GPT-5.2 Pro is API-accessible, but it’s a specialty endpoint OpenAI’s model page states GPT-5.2 Pro is available via the Responses API (not the older Chat Completions flow), which can be an extra integration hurdle for some benchmark rigs. OpenAI Platform[3]
  4. It’s expensive to benchmark at scale The official model listing shows GPT-5.2 Pro priced at $21 per 1M input tokens and $168 per 1M output tokens. Benchmarks can become hilariously pricey, fast. OpenAI Platform[4] Even if a site could run it, many choose not to continuously retest the most expensive tiers.

So is it evaluated anywhere? Yep, just not everywhere.

Where GPT-5.2 Pro does show up (examples)

SimpleBench SimpleBench’s public leaderboard includes “GPT-5.2 Pro (xhigh)” (and also “GPT-5.2 (high)”). SimpleBench[5]

ARC Prize leaderboard (ARC-AGI) ARC Prize’s leaderboard includes GPT-5.2 Pro results (they track performance vs cost per task for ARC-AGI style evaluations). ARC Prize[6] Note: their page is heavily visual/interactive, so some text scrapers show less detail, but the leaderboard is a real benchmarking venue.

Where you’ll often NOT see GPT-5.2 Pro

LMArena / Chatbot Arena (LMSYS) Arena currently lists GPT-5.2 variants like gpt-5.2, gpt-5.2-high, and gpt-5.2-search, but not a “Pro” entry. LMArena[7] That’s consistent with the cost + access + “which exact settings did you run?” problem.

Bonus: sites that track GPT-5.2 generally (may not label “Pro” separately) Artificial Analysis has a GPT-5.2 model page and comparisons, but it commonly shows GPT-5.2 variants (like xhigh) rather than always breaking out “Pro” as its own named entry. Artificial Analysis[8] Epoch AI’s ECI aggregates benchmark scores and reports GPT-5.2 on its capability index (again, not always split into every tier the way OpenAI markets them). Substack[9]

If you want a clean mental model: benchmark sites are usually grading the underlying API model (gpt-5.2-pro) under very specific settings, not the full ChatGPT “Pro” experience you use in the app. That’s why the label you’re looking for often seems to “vanish” like a magician’s coin, even when the model itself is being tested. 🪄

References[edit | edit source]

  1. "ChatGPT 5.2 Pro benchmarks". ChatGPT. Retrieved 2025-12-22.
  2. "OpenAI". openai.com. Retrieved 2025-12-22.
  3. "OpenAI Platform". OpenAI Platform. Retrieved 2025-12-22.
  4. "OpenAI Platform". OpenAI Platform. Retrieved 2025-12-22.
  5. "SimpleBench". simple-bench.com. Retrieved 2025-12-22.
  6. "ARC Prize". ARC Prize. Retrieved 2025-12-22.
  7. "LMArena". lmarena.ai. Retrieved 2025-12-22.
  8. "Artificial Analysis". Artificial Analysis. Retrieved 2025-12-22.
  9. "Substack". substack.com. Retrieved 2025-12-22.