Editing Openai/695ac389-b450-800e-a8b3-540cf70d0ecc


{{DISPLAYTITLE:LLM Abstention Research}}

''This conversation was exported from ChatGPT''<ref>{{cite web|title=LLM Abstention Research|url=https://chatgpt.com/share/695ac389-b450-800e-a8b3-540cf70d0ecc|publisher=ChatGPT|access-date=2026-01-05}}</ref>

== gpt-5-2 ==
=== User: Someone potentially Anthropic recently published some innovation with something like a null response or somehow includi… ===
Someone (potentially Anthropic) recently published some innovation with something like a null response or somehow including the option of “not knowing” as a potential response, improving the reliability and responses of LLMs. Find where that approach is described.

=== Assistant: There isn’t a single Anthropic-published article that specifically introduces a “null response” or “not knowing” output… ===
There isn’t a single Anthropic-published article that specifically introduces a “null response” or “not knowing” output option as the main innovation in LLMs, but there *are several research efforts in the broader AI/LLM community that formalize this idea of models knowing when not to answer (i.e., abstention or refusal), and one particularly clear recent academic description of that concept:

===== 1. ''“Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models”'' =====
This very recent (Nov 21 2025) paper formalizes the idea that an LLM should abstain from answering when it lacks reliable knowledge, rather than hallucinating a response. It introduces Aspect-Based Causal Abstention (ABCA), a framework that assesses internal diversity of model knowledge and enables the LLM to choose to output an abstention (like “I don’t know”) when appropriate—improving reliability and interpretability compared to previous methods. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/abs/2511.17170|publisher=arxiv.org|access-date=2026-01-05}}</ref>
# ''“Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models”''
This research paper defines Abstention Ability (AA) for LLMs, which is exactly the capability to withhold an answer when the model is uncertain or lacks enough information. It highlights that traditional metrics and models don’t reliably measure or perform this behavior, and evaluates models’ abilities to use answers like “I don’t know” correctly across test cases. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/abs/2407.16221|publisher=arxiv.org|access-date=2026-01-05}}</ref>
# ''“R-Tuning: Instructing Large Language Models to Say ‘I Don’t Know’”''
This earlier work (Nov 2023) presents a method to train LLMs so that they explicitly learn to refuse to answer questions that fall outside their knowledge, rather than generate a hallucinated answer. This is done through a special “refusal-aware instruction tuning” process. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/abs/2311.09677|publisher=arxiv.org|access-date=2026-01-05}}</ref>

===== - Anthropic’s public research on how language models process information and their internal reasoning helps understand why LLMs hallucinate and what’s needed to build more reliable systems, but it doesn’t frame the “null/abstention” answer option as a standalone innovation in a published blog post or paper. There are Anthropic papers on interpretability and internal concept representation, but nothing currently in their official research archive focused on the abstention mechanism specifically as “null responses.” Anthropic<ref>{{cite web|title=Anthropic|url=https://www.anthropic.com/research/mapping-mind-language-model|publisher=anthropic.com|access-date=2026-01-05}}</ref> =====
* Anthropic’s research into model calibration and understanding when a model “knows what it knows” is related in spirit to this idea, but isn’t always presented as a concrete abstention protocol. Anthropic<ref>{{cite web|title=Anthropic|url=https://www.anthropic.com/research/language-models-mostly-know-what-they-know|publisher=anthropic.com|access-date=2026-01-05}}</ref>

===== The core idea behind these research directions is that allowing an LLM to decline to answer—or to output “I don’t know” / “None of the above” rather than inventing a response—can improve model reliability and user trust. This is especially valuable in high-stakes or factual tasks where incorrect answers are harmful. The academic work above is where this approach is being formalized, evaluated, and benchmarked. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/abs/2511.17170|publisher=arxiv.org|access-date=2026-01-05}}</ref> =====

If you want links to the actual papers or summaries of their methods (e.g., how ABCA works internally), let me know!

=== User: There was something like an attention sink, I recall ===
There was something like an attention sink, I recall

=== Assistant: Yes — there is a recent research direction tied to Anthropic’s work that mechanistically explains how LLMs decide wheth… ===
Yes — there ''is'' a recent research direction tied to Anthropic’s work that mechanistically explains how LLMs decide whether to answer at all, essentially modeling a sort of “don’t know / abstain” behavior. The key idea is that internal circuits influence whether a model answers, refuses, or hallucinates, and this has been described in their interpretability research on Claude. Anthropic<ref>{{cite web|title=Anthropic|url=https://www.anthropic.com/research/introspection|publisher=anthropic.com|access-date=2026-01-05}}</ref>

===== 📌 Where This Approach Is Described =====

====== Anthropic: “Signs of introspection in large language models” ======

Anthropic published a research article (Oct 29 2025) showing that models like Claude have internal mechanisms related to introspection and self-monitoring. It suggests that parts of Claude’s internal activations can reflect whether the model really knows something vs. is guessing, which relates to reliability and being able to output null/abstain responses. Anthropic<ref>{{cite web|title=Anthropic|url=https://www.anthropic.com/research/introspection|publisher=anthropic.com|access-date=2026-01-05}}</ref>
* They’re investigating how models might report on their own internal states, which includes recognizing when they lack sufficient information to answer reliably — a step toward systems that can say “I don’t know” rather than hallucinate. Anthropic<ref>{{cite web|title=Anthropic|url=https://www.anthropic.com/research/introspection|publisher=anthropic.com|access-date=2026-01-05}}</ref>

====== Anthropic interpretability: “Tracing the thoughts of a large language model” ======

In this related research, Anthropic describes deep analysis of Claude’s internal computing — showing how internal pathways and activations correspond to reasoning steps. Importantly for your question, this research also touches on why models may refuse to answer if they lack needed information rather than always producing a response. Anthropic<ref>{{cite web|title=Anthropic|url=https://www.anthropic.com/research/tracing-thoughts-language-model|publisher=anthropic.com|access-date=2026-01-05}}</ref>

====== Anthropic alignment science: “Reasoning models don’t always say what they think” ======

This paper from Anthropic’s Alignment Science team discusses how reasoning output (like chain-of-thought) may not reflect internal processes faithfully and that models may deliberately omit certain internal signals — e.g., refraining from answering when uncertain — which connects to the idea of modeling abstention and null responses. Anthropic<ref>{{cite web|title=Anthropic|url=https://www.anthropic.com/research/reasoning-models-dont-say-think|publisher=anthropic.com|access-date=2026-01-05}}</ref>

===== 🔬 Why This Matters =====
* Hallucination mitigation: By understanding internal circuits that signal uncertainty or lack of knowledge, researchers can begin to build LLMs that withhold answers rather than provide confident misinformation. This is effectively giving the model an option to produce a null / abstain response when unsure. Anthropic<ref>{{cite web|title=Anthropic|url=https://www.anthropic.com/research/introspection|publisher=anthropic.com|access-date=2026-01-05}}</ref>
* Mechanistic interpretability: Anthropic’s focus on identifying circuit-level behavior (e.g., introspection, refusal circuits) fits into the broader mechanistic interpretability research area, where you don’t just see the answer — you understand why and when a model decides to answer. Wikipedia<ref>{{cite web|title=Wikipedia|url=https://en.wikipedia.org/wiki/Mechanistic_interpretability|publisher=en.wikipedia.org|access-date=2026-01-05}}</ref>

===== 🧠 Connection to “Attention Sink” =====

The attention sink phenomenon you recalled (from other papers) is a separate technical effect in transformers where attention tends to focus on certain tokens (e.g., early tokens in a sequence). This isn’t directly about abstention or null responses, but it is part of the set of insights researchers are using to understand and improve LLM reliability and behaviour. arxiv.gg<ref>{{cite web|title=arxiv.gg|url=https://arxiv.gg/abs/2410.10781|publisher=arxiv.gg|access-date=2026-01-05}}</ref>

===== 📚 Other Relevant Work (Non-Anthropic) =====

If you want the formal method of forcing abstention/“don’t know” behavior in LLMs, there’s also academic work like R-Tuning, which explicitly trains models to refuse to answer questions outside their knowledge rather than hallucinate — a clear example from the literature. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/abs/2311.09677|publisher=arxiv.org|access-date=2026-01-05}}</ref>

If you’d like direct links to the specific papers or summaries of how the abstention/uncertainty circuits work internally, I can provide those too.

=== User: Summarize and translate to Spanish the information you provided in this conversation. ===
Summarize and translate to Spanish the information you provided in this conversation.

=== Assistant: Aquí tienes un resumen en español de la información clave que apareció a lo largo de esta conversación, integrando y ac… ===
Aquí tienes un resumen en español de la información clave que apareció a lo largo de esta conversación, integrando y aclarando los puntos importantes:

==== Resumen (en español) ====

En los últimos años ha surgido una línea de investigación importante en modelos de lenguaje (LLMs) centrada en mejorar su fiabilidad permitiéndoles “no responder” cuando no saben algo, en lugar de inventar respuestas (alucinaciones). Esta idea se formaliza de varias maneras:

===== 1. Abstención / “I don’t know” como respuesta válida =====

Varios trabajos académicos proponen que los LLMs tengan explícitamente la opción de abstenerse:
* Papers como “Do LLMs Know When to NOT Answer?” y “R-Tuning: Instructing LLMs to Say ‘I Don’t Know’” estudian y entrenan modelos para reconocer su incertidumbre y responder con “no lo sé” cuando corresponde.
* Otros trabajos más recientes proponen métodos causales o internos (p. ej. Aspect-Based Causal Abstention) para detectar cuándo el conocimiento del modelo es inconsistente y, por tanto, debería abstenerse.

El objetivo es reducir alucinaciones y aumentar la confianza del usuario, especialmente en tareas de alta importancia.

===== 2. Investigación de Anthropic (Claude) relacionada =====

Aunque Anthropic no ha publicado un paper cuyo título sea explícitamente “null response” o “respuesta nula”, sí ha investigado los mecanismos internos que permiten a un modelo saber cuándo sabe y cuándo no:
* En trabajos sobre introspección y trazado del razonamiento, Anthropic muestra que los modelos contienen señales internas relacionadas con la certeza, la duda y la decisión de responder o rechazar una pregunta.
* Esta línea de investigación es parte de la interpretabilidad mecanicista, que busca entender los circuitos internos del modelo que controlan comportamientos como la negativa a responder o la cautela.

Esto conecta conceptualmente con la idea de permitir respuestas “no sé” de forma más fiable, aunque no siempre se presente como una opción explícita de salida nula.

===== 3. “Attention sink” =====

Lo que recuerdas como “attention sink” es un concepto real, pero es distinto:
* Se refiere a un fenómeno en transformers donde ciertos tokens (a menudo iniciales o especiales) atraen desproporcionadamente la atención.
* Es relevante para entender el comportamiento interno de los modelos y su estabilidad, pero no es en sí mismo un mecanismo de “no responder”.
* Aun así, forma parte del conjunto de descubrimientos técnicos que ayudan a mejorar la fiabilidad y el control del razonamiento de los LLMs.

===== 4. Idea central común =====

Todas estas líneas convergen en un mismo principio:

: Un modelo más fiable no es el que siempre responde, sino el que sabe cuándo no debería hacerlo.

Permitir explícitamente la abstención, apoyada en señales internas del modelo (incertidumbre, conflicto interno, falta de información), es una de las estrategias más prometedoras para reducir errores y alucinaciones en LLMs.

Si quieres, en el siguiente paso puedo:
* Señalar el paper más probable que encaja exactamente con tu recuerdo del “attention sink” aplicado a fiabilidad, o
* Hacer una tabla comparativa entre abstención explícita, introspección de Anthropic y attention sinks.

== References ==
<references />


[[Category:ChatGPT Shares]]
[[Category:openai]]