Editing Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226 (section)

=== Assistant: Short answer: “training δ” Delta Activations instead of generating a LoRA is not exactly worse, it’s mostly a different… ===
Short answer: “training δ” (Delta Activations) instead of generating a LoRA is not exactly worse, it’s mostly a different object – and by itself it can’t replace LoRA generation. To actually adapt the base model, you still need some mapping back to weights or some retrieval mechanism.

Let me unpack that a bit.

==== Delta Activations define, for a finetuned model fff, ====

Δf(x)=hf(x)−hbase(x)\Delta_f(x) = h_f(x) - h_{\text{base}}(x)Δf(x)=hf(x)−hbase(x)
where hf(x)h_f(x)hf(x) is the last-layer hidden state of the final token for a probe prompt xxx. Then they average over a small probe set DprobeD_{\text{probe}}Dprobe (like 5 generic Alpaca-style prompts) to get a single embedding vector vf∈R4096v_f \in \mathbb{R}^{4096}vf∈R4096 per finetuned model. 2509.04442v1

So:
* LoRA: millions of parameters that actually change the model.
* Delta Activations: a 4K-dimensional descriptor of how the model changed, useful for clustering, selection, etc., but not itself an adapter.

That’s why they use it for:
* Clustering models by domain,
* Additive “Math+Code ≈ Math + Code in embedding space”,
* Model selection / merging in hubs like LoraHub. 2509.04442v1

They never actually run the base model with δ; δ is only a feature.

==== There are at least two possible interpretations of your idea: ====

===== Workflow: =====
# Precompute vfv_fvf (Δ) for a library of finetuned LoRAs.
# For a new task: - Do some training or few-shot finetune to get a temporary model, - Compute its Δ embedding vtaskv_{\text{task}}vtask, - Use nearest neighbors in Δ-space to select existing LoRAs, maybe merge them.

This is actually very close to what the Delta paper already explores (few-shot task embeddings + model selection / merging). 2509.04442v1

Pros:
* You only ever train small finetunes (few-shot) to get a task embedding.
* The heavy adaptation is done by reusing LoRAs you already have.

Cons:
* You still need real LoRAs in your library – δ alone does not adapt the base model.
* If your library doesn’t already contain something close to the new task, retrieval can’t magically create it.

Compared to DnD “generate a fresh LoRA for any new dataset”, this is strictly weaker in expressiveness; it can’t cover capabilities that aren’t already somewhere in the library.

===== More radical version: =====
# Collect many tasks, each with: - LoRA weights ΔW\Delta WΔW, - A Δ-embedding vfv_fvf of that LoRA.
# Learn a generator GGG such that G(vf)≈ΔW.G(v_f) \approx \Delta W.G(vf)≈ΔW.
# For a new task, only train a low-dimensional code vvv via gradients (hundreds or thousands of params), and get its LoRA as G(v)G(v)G(v).

This is basically:
* Factorizing LoRA space into: - a big shared generator GGG (frozen at test time), - a small task code vvv that you optimize.

Compared to DnD:
* DnD: learns a direct mapping from prompts → LoRA, then no training at adaptation.
* “Train δ instead”: learns mapping code → LoRA, then you still train δ per task.

So yes, if your goal is “no training for new tasks”, this is worse than generating LoRAs à la DnD: you added optimization back into the loop.

But as a meta-learning scheme it’s actually interesting:
* You only optimize ~4k parameters instead of millions (LoRA).
* The generator encodes a strong prior over “reasonable” LoRAs (learned from many tasks).
* It could be more sample-efficient than training a fresh LoRA from scratch for each new task.

The weak points:
* The Δ embedding is very lossy by design; many different LoRAs map to similar Δ. The inverse map GGG is underdetermined and might collapse diversity.
* Delta Activations were engineered to cluster domain-level behaviors, not to preserve all fine-grained information needed for optimal performance. Using them as the only latent for reconstruction is risky.
* You’d have to show empirically that optimizing δ + using GGG can reach the same task performance as training the LoRA directly. My guess: you’d often lose some peak performance, especially for hard, narrow tasks.

==== So, is training δ “worse” than generating LoRA? ====

If you mean strictly in the DnD sense (“I want zero-shot prompt-to-weights”):
* Yes, training δ is worse on the axis they care about (no optimization at adaptation time).
* δ is a summary of behavior, not an adapter. You still need a path from δ → weights or δ → retrieval.

If you mean as a general design choice for adaptation/meta-learning:
* It’s not obviously worse; it’s just another trade-off: - Pros: lower-dimensional, better for clustering/retrieval, maybe more stable and regularized. - Cons: doesn’t directly control the model, lossy, needs extra machinery (retrieval or generator) to actually adapt the LLM.

A realistic “best of both worlds” design might be:
# Use Delta Activations to build a model/task embedding space and organize your LoRA zoo. 2509.04442v1
# On top of that, learn a small hypernetwork HHH that maps from Δ (plus maybe statistics of the new dataset’s prompts) to a correction LoRA (or to mixing coefficients over existing LoRAs).
# Optionally allow a low-dim δ code to be optimized per task, but keep most parameters in HHH shared.

That would sit in between:
* pure retrieval in Δ-space, and
* full DnD-style LoRA generation.

If you want, we can sketch a concrete toy experiment design to test “train δ instead of LoRA” and see where it breaks (and where it might actually help).