Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226
This conversation was exported from ChatGPT[1]
gpt-5-1-thinking[edit | edit source]
User: This paper is quite complicated.[edit | edit source]
This paper is quite complicated. would you explain this in detail? In here, it seems they assumed trained generator and encoder, but I cannot understand section 4.
Assistant: Short version:[edit | edit source]
Short version: Yes, in Sec. 4 they assume you already have a trained decoder f^\hat ff^ (the generator) and usually also an encoder g^\hat gg^ trained on in-domain (ID) data. Section 4 is about how to actually invert f^\hat ff^ efficiently, especially out-of-domain (OOD), using:
- Gradient-based search (Sec. 4.1) – slow but flexible, done per image.
- Generative replay (Sec. 4.2) – offline training on synthetic OOD images, to get a faster encoder.
Let me walk through the logic step by step and connect it to the rest of the paper.
From Sec. 2:[edit | edit source]
- There is a ground-truth generator f:Z→X,x=f(z)f: Z \to X, \quad x = f(z)f:Z→X,x=f(z) where: - Z=RdzZ = \mathbb{R}^{d_z}Z=Rdz is latent space, decomposed into slots: z=(z1,…,zK),zk∈Rmz = (z_1, \dots, z_K),\quad z_k \in \mathbb{R}^mz=(z1,…,zK),zk∈Rm e.g. “animal 1 slot”, “animal 2 slot”, “background slot”. - X⊂RdxX \subset \mathbb{R}^{d_x}X⊂Rdx is image space (the data manifold).
- A representation z^=φ(x)\hat z = \varphi(x)z^=φ(x) is “good perception” if it inverts fff up to slot-wise reparam + permutation: ∀z∈ZS:φ(f(z))=hπ(z)\forall z\in Z_S: \quad \varphi(f(z)) = h_\pi(z)∀z∈ZS:φ(f(z))=hπ(z) (Eq. 2.1). hπh_\pihπ just re-labels and reparametrizes slots.
They define two regimes:[edit | edit source]
- Generative approach: - Learn a decoder f^:Z→Rdx\hat f : Z \to \mathbb{R}^{d_x}f^:Z→Rdx. - Use its inverse to represent images: φ(x)=f^−1(x)\varphi(x) = \hat f^{-1}(x)φ(x)=f^−1(x) - For this to match truth, f^\hat ff^ must identify fff: f^(hπ(z))=f(z)(Eq. 2.2)\hat f(h_\pi(z)) = f(z) \quad\text{(Eq. 2.2)}f^(hπ(z))=f(z)(Eq. 2.2)
- Non-generative approach: - Learn an encoder directly: φ(x)=g^(x)\varphi(x) = \hat g(x)φ(x)=g^(x) - For this to match truth, g^\hat gg^ must approximate the inverse: g^(x)=hπ(g(x)),g:=f−1(Eq. 2.3)\hat g(x) = h_\pi(g(x)),\quad g := f^{-1} \quad\text{(Eq. 2.3)}g^(x)=hπ(g(x)),g:=f−1(Eq. 2.3)
The difference is not “does the architecture have an encoder or decoder”, but:
- Are you constraining a decoder f^\hat ff^ and then inverting it (generative)?
- Or are you only constraining an encoder g^\hat gg^ (non-generative)?
- ID region: XID=f(ZID)X_{\text{ID}} = f(Z_{\text{ID}})XID=f(ZID) from some subset ZID⊂ZZ_{\text{ID}}\subset ZZID⊂Z of concept combinations.[edit | edit source]
- OOD region: ZOODZ_{\text{OOD}}ZOOD is all other combinations of slot values; XOOD=f(ZOOD)X_{\text{OOD}} = f(Z_{\text{OOD}})XOOD=f(ZOOD).
The goal: if Eq. (2.1) holds on ZIDZ_{\text{ID}}ZID, does it generalize to all combinations in ZOODZ_{\text{OOD}}ZOOD? That’s compositional generalization.
- They define a generator class FintF_{\text{int}}Fint where concepts interact in a structured way (polynomial interactions across slots).[edit | edit source]
- The inverse class Gint={f−1∣f∈Fint}G_{\text{int}} = \{ f^{-1} \mid f \in F_{\text{int}}\}Gint={f−1∣f∈Fint} has strong constraints on Jacobian and Hessian (Eq. 3.3).
Key asymmetry:
- It’s feasible to constrain a decoder to lie in FintF_{\text{int}}Fint (by architecture/regularization).
- It’s basically infeasible to constrain an encoder to lie in GintG_{\text{int}}Gint when images live on a low-dim manifold in a high-dim ambient space (dx≫dzd_x \gg d_zdx≫dz). The Hessian/Jacobian structure “disappears”; almost any encoder is compatible with some f∈Fintf \in F_{\text{int}}f∈Fint.
So: you can enforce the right inductive bias on f^\hat ff^, but not on g^\hat gg^. This is why they say: to guarantee compositional generalization, you need a generative approach.
Given Sec. 3’s conclusion, Sec. 4 now assumes:[edit | edit source]
- You have already trained a decoder f^\hat ff^ with the right inductive bias (so f^≈f\hat f \approx ff^≈f, f^∈Fint\hat f \in F_{\text{int}}f^∈Fint).
- Typically you also have an encoder g^\hat gg^ trained on ID data by simple autoencoding.
And now you want to actually use f^\hat ff^ generatively:
That’s the inference problem for a generative model.
They say: if the decoder had a nice analytic inverse f^−1\hat f^{-1}f^−1, this would be trivial. But:[edit | edit source]
- For images, data live on a manifold X⊂RdxX \subset \mathbb{R}^{d_x}X⊂Rdx, not all of Rdx\mathbb{R}^{d_x}Rdx.
- Making f^\hat ff^ a global diffeomorphism from ZZZ onto Rdx\mathbb{R}^{d_x}Rdx is unnatural and hard; it effectively forces X=RdxX = \mathbb{R}^{d_x}X=Rdx, which is unrealistic.
So in practice, we have a differentiable network f^\hat ff^ whose inverse we can only get via optimization, not closed-form.
For ID images x∈XIDx \in X_{\text{ID}}x∈XID, you do have training data. So you can just train an encoder g^\hat gg^ to invert f^\hat ff^ on those:[edit | edit source]
ming^ Ex∼XID ∥x−f^(g^(x))∥2(4.2)\min_{\hat g} \;\mathbb E_{x \sim X_{\text{ID}}} \,\big\|x - \hat f(\hat g(x))\big\|^2 \tag{4.2}g^minEx∼XIDx−f^(g^(x))2(4.2) This is a standard autoencoder objective:
- g^\hat gg^ encodes images to latents.
- f^\hat ff^ decodes them back.
On ID data, g^(x)\hat g(x)g^(x) approximates z\z^\z\ s.t. x≈f^(z\)x \approx \hat f(z^\)x≈f^(z\*).
So: After training with (4.2), on ID, you can just set: z\(x)≈g^(x).z^\(x) \approx \hat g(x).z\(x)≈g^(x).
This is what you were noticing: Sec. 4 indeed assumes a trained f^\hat ff^ and a trained g^\hat gg^ on ID.
But the whole point is: this is not enough for OOD, because XOODX_{\text{OOD}}XOOD was never seen during this training.
For OOD images x∈XOODx \in X_{\text{OOD}}x∈XOOD:[edit | edit source]
- During training on (4.2), you never saw these images, so: - You cannot minimize (4.2) on XOODX_{\text{OOD}}XOOD directly. - Nothing in the encoder-only training forces g^\hat gg^ to generalize correctly to unseen slot combinations (this is exactly the point of Sec. 3).
So the question of Sec. 4 is:
They explore two strategies to solve Eq. (4.1) more generally:
- Gradient-based search (Sec. 4.1): optimize zzz per OOD image.
- Generative replay (Sec. 4.2): synthesize fake OOD images and train a new encoder that works OOD.
They rewrite Eq. (4.1) as an optimization problem:[edit | edit source]
z\=argminz ∥x−f^(z)∥2(4.3)z^\ = \arg\min_z \; \big\|x - \hat f(z)\big\|^2 \tag{4.3}z\*=argzminx−f^(z)2(4.3) So, for a given (possibly OOD) image xxx:
- Define the reconstruction loss: L(z)=∥x−f^(z)∥2.L(z) = \big\|x - \hat f(z)\big\|^2.L(z)=x−f^(z)2.
- Start from some initial guess z(0)z^{(0)}z(0).
- Do gradient descent: z(t+1)=z(t)−η ∇zL(z(t)).z^{(t+1)} = z^{(t)} - \eta \,\nabla_z L(z^{(t)}).z(t+1)=z(t)−η∇zL(z(t)).
Because f^\hat ff^ is differentiable, you can compute ∇zL\nabla_z L∇zL via backprop.
The efficiency depends heavily on initialization:
- If you start from a random z(0)z^{(0)}z(0), you may need many steps or get stuck.
- Their trick: use the encoder as a “System 1” guess: z(0)=g^(x).z^{(0)} = \hat g(x).z(0)=g^(x).
So the pipeline for Search is:
Intuition:
- g^\hat gg^ is trained only on XIDX_{\text{ID}}XID, so on OOD it might be biased or slightly wrong.
- But it’s often close to a good solution.
- Gradient-based search corrects g^\hat gg^’s mistakes by explicitly enforcing that the decoded image matches xxx under f^\hat ff^.
This is what they show in Fig. 4 (left): encoder gives an initial slot decomposition; search refines it to better match xxx.
Pros:
- In principle, if f^\hat ff^ truly identifies fff, then solving (4.3) correctly recovers the “right” slots for any x∈Xx \in Xx∈X, including OOD.
- You don’t need extra training data.
Cons:
- It’s online and per-image expensive (hundreds of gradient steps in experiments).
- For large models / many images, this is slow.
The second strategy is to fix the runtime cost by doing the hard work offline.[edit | edit source]
Idea:
- Your decoder f^\hat ff^ identifies the ground-truth generator fff up to slot-wise transforms.
- That means OOD images are generated by re-combining slots from ID in new ways.
So they say: “If I can sample slot combinations z^\hat zz^ from a distribution pz^p_{\hat z}pz^ whose marginals are independent per slot, and decode them via f^\hat ff^, I can generate synthetic OOD images f^(z^)\hat f(\hat z)f^(z^).”
Concretely:
- Choose pz^p_{\hat z}pz^ such that pz^(z)=∏k=1Kpk(zk)p_{\hat z}(z) = \prod_{k=1}^K p_k(z_k)pz^(z)=k=1∏Kpk(zk) where each pkp_kpk is estimated from ID slot distributions, but they’re independent across slots.
- Sample z^∼pz^\hat z \sim p_{\hat z}z^∼pz^, decode: x^=f^(z^).\hat x = \hat f(\hat z).x^=f^(z^).
- Because slots are sampled independently, many of these (z^,x^)(\hat z, \hat x)(z^,x^) pairs correspond to unseen combinations ⇒ they lie in XOODX_{\text{OOD}}XOOD.
Now you have unlimited synthetic OOD data with ground-truth latents!
Then you can train a new encoder g^\hat gg^ to invert f^\hat ff^ on this synthetic data:
ming^ Ez^∼pz^ ∥z^−g^(f^(z^))∥2(4.4)\min_{\hat g} \;\mathbb E_{\hat z \sim p_{\hat z}} \,\big\|\hat z - \hat g(\hat f(\hat z))\big\|^2 \tag{4.4}g^minEz^∼pz^z^−g^(f^(z^))2(4.4) So we’re training on pairs:
- input image: f^(z^)\hat f(\hat z)f^(z^) (possibly OOD combination),
- target latent: z^\hat zz^.
This is just a self-supervised autoencoder on synthetic data, but crucially:
- The images explore combinatorial variations (thanks to independent slot sampling).
- So g^\hat gg^ is trained to correctly invert f^\hat ff^ across a much larger region of XXX, including OOD.
At test time:
That’s what Fig. 4 (right) illustrates: decode synthetic OOD images with f^\hat ff^ → train g^\hat gg^ on them → deploy g^\hat gg^ at test time.
Pros:
- After offline training, inference is fast (single forward pass).
- The encoder has seen many OOD combinations via generated images, so it can generalize better.
Cons:
- Depends on how well f^\hat ff^ approximates the true generative process.
- If f^\hat ff^ can’t generate realistic or diverse OOD combinations, replay is limited.
In Sec. 5.2 they say:[edit | edit source]
- They take VAEs trained in § 5.1 (with a Transformer decoder) on PUG datasets. Those give: - A decoder f^\hat ff^, - An encoder g^\hat gg^ trained with a VAE objective on ID.
Then they:
- Apply generative replay (Eq. 4.4): - Use f^\hat ff^ to generate OOD images by recombining slots (like shuffling animal/background slots). - Train a new encoder on these synthetic images.
- Apply search (Eq. 4.3) at test time: - Use the encoder’s output as init, - Refine it by gradient-based optimization.
They show that:
- Non-generative approaches (encoder-only, even with lots of data) struggle with OOD combinations.
- Generative approaches using search and replay on top of a good decoder significantly boost OOD performance.
:[edit | edit source]
You’re exactly right that they’re assuming some things already trained:
- f^\hat ff^ (decoder/generator) is learned first, with the inductive biases that make it belong to FintF_{\text{int}}Fint.
- g^\hat gg^ is initially trained on ID via an autoencoding/VAE objective (Eq. 4.2).
Section 4 then does not focus on how to train f^\hat ff^; it assumes it already exists and is good, and studies:
- How to invert f^\hat ff^ on OOD images by: - Search: per-image gradient optimization (Eq. 4.3) initialized with g^(x)\hat g(x)g^(x). - Replay: offline training of g^\hat gg^ on synthetic OOD images generated by f^\hat ff^ (Eq. 4.4).
So conceptually:
- Non-generative baseline: just use g^(x)\hat g(x)g^(x) trained on ID and hope it generalizes.
- Generative approach (Sec. 4): - either refine g^(x)\hat g(x)g^(x) via optimization (search), - or retrain g^\hat gg^ on synthetic OOD from f^\hat ff^ (replay), - or both.
If you’d like, next we can:
- Draw pseudo-code for the search and replay procedures as if you were implementing them.
- Or unpack the assumptions on pz^p_{\hat z}pz^ and what “independent slot marginals” really mean in practice.
User: hmm...[edit | edit source]
hmm... but we can assume 'pseudo label' for the given data, say, the number of class is the same as number of dataset, and can't we train encoder by 'predict the label with the limited hidden representation, which is <<len(class) ?
Assistant: Nice idea, and you’re definitely poking in the right place capacity + bottleneck vs structure, but this particular tric…[edit | edit source]
Nice idea, and you’re definitely poking in the right place (capacity + bottleneck vs structure), but this particular trick doesn’t actually get you what the paper is worried about.
Let me restate what you’re proposing in my words:
- You have a dataset with NNN images.
- You assign one unique pseudo-label per image (so NNN classes).
- You train an encoder gθ(x)∈Rdg_\theta(x)\in\mathbb{R}^dgθ(x)∈Rd with small latent dimension d≪Nd \ll Nd≪N.
- A classifier head on top of gθ(x)g_\theta(x)gθ(x) predicts the pseudo-label.
- Hope: because the representation is low-dimensional but must separate many “classes”, it will be forced to learn something like the underlying compositional structure (slots).
This feels like an “information bottleneck self-supervised instance discrimination” idea.
The paper’s impossibility result is not about number of classes or supervision; it’s about the space of encoders that fit the ID data and how unconstrained they are OOD.[edit | edit source]
Roughly, they show (Theorem 3.2) that when dx≫dzd_x \gg d_zdx≫dz, for almost any Jacobian/Hessian you want for an encoder ggg at a point on the manifold, there exists some generator f∈Fintf \in F_{\text{int}}f∈Fint such that ggg is a valid inverse of fff on the data manifold – but its behavior off the manifold (i.e., in unseen combinations) can be arbitrarily bad. The structure that characterizes the “good” inverses GintG_{\text{int}}Gint only shows up restricted to the data manifold geometry, which you don’t know OOD. So you can’t encode the right inductive bias with architecture/regularization alone. 2512.08854v1
Your pseudo-label idea still only constrains ggg on the training points (ID images), not on the whole manifold or its OOD regions. So there are many encoders that:
- give perfect pseudo-label accuracy on all training images,
- but implement completely different—and arbitrarily bad—extensions to any new combination (OOD).
Nothing in the pseudo-label objective forces you towards the particular inverse g∈Gintg \in G_{\text{int}}g∈Gint that corresponds to the “true slots”.
Intuitively it’s tempting to think: “but d≪Nd \ll Nd≪N, so the representation can’t just memorize!”[edit | edit source]
Unfortunately, it absolutely can.
A deep net encoder with small output dimension can still:
- map each training image xix_ixi to some distinct vector zi∈Rdz_i \in \mathbb{R}^dzi∈Rd,
- and a classifier head can memorize a mapping {zi}→{1,…,N}\{z_i\} \to \{1,\dots,N\}{zi}→{1,…,N}.
Key points:
- Continuous codes carry lots of information. Even a 32-dim float vector can encode absurdly many distinct codes. We are not limited to “one dimension per class” – that’s only true if you force something like one-hot or linear separability constraints and low numeric precision.
- Memorization in low-dim is easy for finite data. For a finite set of NNN points, you can always construct an injective map xi↦zi∈Rdx_i \mapsto z_i \in \mathbb{R}^dxi↦zi∈Rd for very small ddd (even d=2d=2d=2 for reasonable NNN). Then the classifier head (linear or shallow MLP) just learns decision regions around those codes. Geometrically: you only need each training point to sit in a little “cell” of representation space that maps to its label. No requirement those cells are arranged in a way that reflects compositional structure.
- No pressure to share factors. The bottleneck doesn’t force the encoder to separate “animal slot” vs “background slot”; it just forces some kind of compression. The easiest compression for SGD might be something like “two big clusters corresponding to (camel,desert) images vs (penguin,snow) images”, not “4 independent factors”.
So the encoder can happily learn a highly entangled representation that just supports “memorize N instance labels” with no notion of compositionality.
Take the simple case: two binary factors[edit | edit source]
- animal: {camel, penguin}
- background: {desert, snow}
There are 4 possible combos, but ID only contains:
- (camel, desert) images
- (penguin, snow) images
Your scheme:
- give each training image its own label,
- train encoder g(x)∈Rdg(x)\in\mathbb{R}^dg(x)∈Rd, ddd small.
What can ggg do?
- It can map all (camel, desert) images into some complicated 1D curve in latent space, all (penguin, snow) images into a different curve.
- The classifier head just cuts that space to separate each individual sample.
When you show an OOD (penguin, desert) image:
- There is zero incentive from training that its representation should be near “penguin slot + desert slot”.
- It will land wherever the network’s entangled features send it—likely near one of the existing ID clusters—so you misinterpret it.
This is exactly the phenomenon they show empirically: even with proper supervised labels for animal/background, encoders trained non-generatively get near-perfect ID accuracy but often fail badly OOD unless you add huge pretraining or extra supervision. 2512.08854v1
Your pseudo-labels are actually weaker supervision than that (they don’t tell the model which factors are shared across images).
What you’re proposing is very close to:[edit | edit source]
- Instance discrimination / contrastive learning (each image as its own “class”),
- with a small embedding dimension.
People already do something like this (SimCLR, MoCo, etc.), and:
- It does give good general representations in practice if you use tons of data and strong augmentations.
- But it still falls under their “non-generative” umbrella: you’re just learning an encoder whose behavior OOD is determined by the training distribution and optimization, not by inverting a structured generator.
And it doesn’t touch the paper’s main theoretical obstruction: you still have a huge space of encoders consistent with ID data, and nothing that forces you into the specific inverse class GintG_{\text{int}}Gint.
To turn your idea into something that could in principle guarantee compositional generalization, you’d need more than:[edit | edit source]
- small latent dim, and
- classification/pseudo-label objectives.
You’d need constraints that approximate the derivative structure they identify for good inverses (block-diagonal-ish behavior on the tangent space of the data manifold). But:
- those constraints depend on the geometry of the manifold XXX, especially in OOD regions, which you don’t observe;
- that’s exactly why they argue encoder-side inductive biases are fundamentally weaker than decoder-side ones.
So, even if you add pseudo-labels + bottleneck, you’re still in the “encoder can fit ID, OOD is unconstrained” regime.
TL;DR on your proposal
- It’s a reasonable heuristic (and quite close to things tried in SSL already).
- But it does not escape the paper’s core critique of non-generative methods: - You’re still only constraining the encoder on XIDX_{\text{ID}}XID. - The bottleneck isn’t structured enough to enforce the slotwise/polynomial structure required for compositional generalization. - There remain many encoders that fit ID pseudo-labels perfectly and behave arbitrarily OOD.
If you want, we can sketch a concrete toy construction where two different low-dim encoders get 100% pseudo-label accuracy on the same dataset but give totally different decompositions—and opposite predictions—on OOD combos. That’s a nice way to see the ambiguity very explicitly.
User: Hmm, recently I've read this paper.[edit | edit source]
Hmm, recently I've read this paper. it's about 'not training but generating lora'. Do you think are there connections between these?
Assistant: Yeah, I actually think there is a nice conceptual connection – but it’s a bit indirect.[edit | edit source]
Yeah, I actually think there is a nice conceptual connection – but it’s a bit indirect.
Let me phrase both papers in the same language first, then line up the parallels / differences.
###[edit | edit source]
- There is a generator f:z↦xf: z \mapsto xf:z↦x mapping latent slots z=(z1,…,zK)z = (z_1,\dots,z_K)z=(z1,…,zK) to images xxx.
- They argue: if your decoder f^\hat ff^ is trained and constrained to a structured class FintF_{\text{int}}Fint (polynomial-like interactions between slots), then: - It can identify the true generator up to slot-wise reparam / permutation. - Inverting f^\hat ff^ gives a representation that provably compositionalizes and generalizes to unseen slot combinations (OOD). 2512.08854v1
- Non-generative encoders g^:x↦z\hat g: x \mapsto zg^:x↦z can’t easily be constrained this way; the correct inverse lies in a tiny, hard-to-enforce subclass GintG_{\text{int}}Gint. On a low-dimensional data manifold in a high-dimensional ambient space, lots of wild encoders fit the training data but explode OOD. 2512.08854v1
- Sec. 4 then shows how to use the trained generator: - Search: gradient-based inversion argminz∥x−f^(z)∥2\arg\min_z \|x - \hat f(z)\|^2argminz∥x−f^(z)∥2. - Generative replay: sample synthetic zzz → f^(z)\hat f(z)f^(z) → train a new encoder on those synthetic OOD images. 2512.08854v1
So: learn a structured generator → invert it → get good OOD behavior.
- They view a LoRA adapter ΔW\Delta WΔW as a function of its training data DDD: gradient descent “drags” base weights W0W_0W0 to W0+ΔW(D)W_0+\Delta W(D)W0+ΔW(D). 2506.16406v1[edit | edit source]
- Instead of re-running optimization for every new dataset, they learn a hyper-network: G:prompt batch p ↦ ΔWG: \text{prompt batch } p \;\mapsto\; \Delta WG:prompt batch p↦ΔW
- Training data: lots of prompt–checkpoint pairs from many datasets: 1. Fine-tune LoRA on many tasks → store checkpoints. 2. For each checkpoint mjm_jmj, take a random batch of unlabeled prompts pip_ipi from its dataset. 3. Encode prompts with a frozen text encoder (Sentence-BERT, etc.) to get a condition embedding cic_ici. 4. Train a hyper-convolutional decoder to map ci↦c_i \mapstoci↦ tokenized LoRA weights, with MSE loss. 2506.16406v1
- At test time, for a new dataset: - Feed a handful of its prompts into the encoder + hyper-decoder. - Get a LoRA ΔW\Delta WΔW in one forward pass – no per-task training.
They explicitly frame this as treating network parameters as a new data modality and doing prompt-conditioned parameter generation instead of gradient-based adaptation. 2506.16406v1
So: learn a generator over weights → sample task-specific LoRAs conditioned on prompts.
At a high level, both papers say:[edit | edit source]
More concretely:
- In the compositional reasoning paper, generator side = decoder f^(z)\hat f(z)f^(z), encoder side = representation g^(x)\hat g(x)g^(x).[edit | edit source]
- They show that structural assumptions are much easier to impose on the generator side FintF_{\text{int}}Fint than on the inverse side GintG_{\text{int}}Gint, which is why generative modeling wins for OOD compositionality. 2512.08854v1
DnD is philosophically similar:
- Instead of “training the adapter by gradient descent for each task” (analog of repeatedly learning a discriminative model), they build a generator over LoRA spaces: condition (prompts) → weights\text{condition (prompts)} \;\rightarrow\; \text{weights}condition (prompts)→weights
- That’s a generator living in weight space rather than input space.
You could say:
- The compositional paper: “be generative about images”.
- DnD: “be generative about adapters”.
Both are pushing the work into a single, amortized generative system instead of re-doing task-specific optimization.
- Compositional paper: - True inference: given xxx, solve x=f^(z)x = \hat f(z)x=f^(z). - Naïve method: per-image gradient search. - Amortized methods: train an encoder via generative replay so a single forward pass approximates the solution. 2512.08854v1[edit | edit source]
- DnD: - True adaptation: given dataset DDD, run gradient descent to find LoRA ΔW(D)\Delta W(D)ΔW(D). - Naïve method: per-task fine-tuning. - DnD: train a hyper-network on past (prompts, checkpoints) so a single forward pass approximates ΔW(D)\Delta W(D)ΔW(D). 2506.16406v1
So DnD is basically “amortized fine-tuning,” exactly like an amortized inference network is “amortized posterior estimation.” That’s structurally very close to the search vs replay discussion in Sec. 4.
The kind of generalization is different, but the flavor is similar:[edit | edit source]
- Compositional model: - Generalizes from some combinations of latent slots (ID) to new combinations (OOD). - Because the generator class FintF_{\text{int}}Fint encodes how slots interact across all of ZZZ, not just on data you’ve seen.
- DnD: - Generalizes from some datasets / tasks (the ones you have LoRAs and prompts for) to new datasets. - The hope is that prompts act as a “fingerprint” of dataset semantics, and the hyper-network has learned a structured weight space where nearby prompts map to useful nearby LoRAs. 2506.16406v1
You can think of this (informally) as a kind of compositionality over tasks:
- Each dataset’s prompts contain mixtures of “concepts” (domain, difficulty, style, modality, etc.).
- DnD is trying to learn how those conceptual patterns in prompts map into specific regions of LoRA space.
The compositional theory paper gives you a math-y framework for this kind of story in input space; DnD is doing an analogous trick empirically in weight space.
This is the part where I’ll be a bit critical, since you asked me to push on weaknesses.[edit | edit source]
In the compositional paper, they prove that:[edit | edit source]
- If f^∈Fint\hat f \in F_{\text{int}}f^∈Fint fits the data, then it essentially must be the true generator (up to slotwise transforms).
- This is a strong structural statement about which decoders are allowed, via Jacobian/Hessian constraints. 2512.08854v1
DnD doesn’t have anything analogous:
- The hyper-convolutional decoder is chosen mostly for efficiency, not because it’s the unique class of functions that can map prompts to LoRAs “correctly.” 2506.16406v1
- For a given dataset, there may be many LoRAs that give similar performance; and for a given set of prompt batches, many mappings from text embeddings to weights will fit the training checkpoints.
So the “generator over weights” in DnD is more like a strong practical prior than a theoretically justified FintF_{\text{int}}Fint-style class.
If you were hoping to cleanly import the compositional identifiability result into DnD, that doesn’t work yet: they haven’t characterized the space of good hyper-networks the same way.
The compositional paper is about per-example latent variables z(x)z(x)z(x), with slots that can be recombined across images.[edit | edit source]
DnD:
- Condition = a batch of prompts from the dataset, not individual examples.
- It generates one LoRA per dataset/task, not a per-input latent code.
So the “slots” here are really “tasks across datasets”, not “factors inside one datapoint”. That’s a much coarser granularity.
You could imagine a more compositional version of DnD, where the parameter generator has explicit factors like:
- domain slot,
- reasoning-style slot,
- modality slot,
- difficulty slot,
and you combine them to synthesize adapters for totally new mixtures. The current DnD architecture doesn’t enforce or expose such a factorization; it just hopes the text encoder + hyper-conv decoder learn something reasonable.
Exactly like we discussed for your pseudo-label idea:[edit | edit source]
- Given finitely many training datasets and checkpoints, a high-capacity hyper-network can, in principle, memorize mapping from condition embeddings to weights.
- Even with MSE on tokenized weights, you can approximate the mapping arbitrarily well on those training pairs, while behaving unpredictably on new datasets.
The paper shows empirically that DnD does generalize to new benchmarks (e.g., common sense → science, text → multimodal) and often beats the average of training LoRAs. 2506.16406v1
But unlike the generator-identifiability result, there’s nothing that rules out “memorize and interpolate poorly” solutions theoretically.
This is where it gets interesting for research ideas:[edit | edit source]
- Compositional hyper-networks You could try to design a structurally constrained hyper-network class over LoRA space that plays the same role as FintF_{\text{int}}Fint: - Decompose LoRA updates into a sum of slot-specific contributions: ΔW=∑khk(sk)\Delta W = \sum_k h_k(s_k)ΔW=∑khk(sk), where sks_ksk are “task factors” extracted from prompts. - Constrain interactions between these factors in a polynomial / low-order fashion (like the image generator). - Aim for formal results like: if the hyper-network fits many tasks, it must be close to the “true” factorized mapping.
- Two-level generative story Combine them: - Bottom level: a compositional generator over data (like the first paper), giving strong OOD reasoning inside a task. - Top level: a parameter generator (DnD-style) that quickly moves that generator into a task-specific region of parameter space given prompts from a new domain. Then you’re using generative structure both in input space and in weight space.
- Replay in weight space vs replay in data space The compositional paper uses generative replay of images to train a better encoder. DnD can be read as a form of “replay in parameter space”: you train on many past LoRAs and learn a distribution over them. There might be a neat meta-learning angle where you use a structured data generator to create synthetic tasks, fine-tune them into synthetic LoRAs, and then train a DnD-style hypernetwork on that synthetic parameter corpus.
So, to answer your question directly:[edit | edit source]
- Yes, there is a conceptual connection: both papers say “don’t just directly train new things for each case; instead, learn a structured generator (over data or weights) and then adapt via inversion or generation.”
- The compositional paper is more theoretical, giving a clean story about why structure on the generator side leads to compositional OOD generalization and why encoder-only approaches are under-constrained. 2512.08854v1
- DnD is more engineering / meta-learning, showing empirically that a generator over LoRA parameters conditioned on prompts can replace gradient-based adapter training in many scenarios. 2506.16406v1
But DnD doesn’t yet have the kind of identifiability / “right inverse” guarantees the first paper has. If you’re thinking about bridging the two, I’d focus on:
- designing structured parameter generators with explicit factorization,
- and asking whether you can prove “good OOD behavior over tasks” in the same spirit as their compositional generalization over slot combinations.
If you want, we can try to sketch a concrete toy setting where:
- tasks are defined by combinations of a few discrete “factors”, and
- a DnD-style hypernetwork with an FintF_{\text{int}}Fint-like constraint over LoRA space could provably generalize to unseen task combinations.
User: oh, how about training 'delta' instead?[edit | edit source]
oh, how about training 'delta' instead? oh, umm... this must be worse than generating lora right?
Assistant: Short answer: “training δ” Delta Activations instead of generating a LoRA is not exactly worse, it’s mostly a different…[edit | edit source]
Short answer: “training δ” (Delta Activations) instead of generating a LoRA is not exactly worse, it’s mostly a different object – and by itself it can’t replace LoRA generation. To actually adapt the base model, you still need some mapping back to weights or some retrieval mechanism.
Let me unpack that a bit.
Delta Activations define, for a finetuned model fff,[edit | edit source]
Δf(x)=hf(x)−hbase(x)\Delta_f(x) = h_f(x) - h_{\text{base}}(x)Δf(x)=hf(x)−hbase(x) where hf(x)h_f(x)hf(x) is the last-layer hidden state of the final token for a probe prompt xxx. Then they average over a small probe set DprobeD_{\text{probe}}Dprobe (like 5 generic Alpaca-style prompts) to get a single embedding vector vf∈R4096v_f \in \mathbb{R}^{4096}vf∈R4096 per finetuned model. 2509.04442v1
So:
- LoRA: millions of parameters that actually change the model.
- Delta Activations: a 4K-dimensional descriptor of how the model changed, useful for clustering, selection, etc., but not itself an adapter.
That’s why they use it for:
- Clustering models by domain,
- Additive “Math+Code ≈ Math + Code in embedding space”,
- Model selection / merging in hubs like LoraHub. 2509.04442v1
They never actually run the base model with δ; δ is only a feature.
There are at least two possible interpretations of your idea:[edit | edit source]
Workflow:[edit | edit source]
- Precompute vfv_fvf (Δ) for a library of finetuned LoRAs.
- For a new task: - Do some training or few-shot finetune to get a temporary model, - Compute its Δ embedding vtaskv_{\text{task}}vtask, - Use nearest neighbors in Δ-space to select existing LoRAs, maybe merge them.
This is actually very close to what the Delta paper already explores (few-shot task embeddings + model selection / merging). 2509.04442v1
Pros:
- You only ever train small finetunes (few-shot) to get a task embedding.
- The heavy adaptation is done by reusing LoRAs you already have.
Cons:
- You still need real LoRAs in your library – δ alone does not adapt the base model.
- If your library doesn’t already contain something close to the new task, retrieval can’t magically create it.
Compared to DnD “generate a fresh LoRA for any new dataset”, this is strictly weaker in expressiveness; it can’t cover capabilities that aren’t already somewhere in the library.
More radical version:[edit | edit source]
- Collect many tasks, each with: - LoRA weights ΔW\Delta WΔW, - A Δ-embedding vfv_fvf of that LoRA.
- Learn a generator GGG such that G(vf)≈ΔW.G(v_f) \approx \Delta W.G(vf)≈ΔW.
- For a new task, only train a low-dimensional code vvv via gradients (hundreds or thousands of params), and get its LoRA as G(v)G(v)G(v).
This is basically:
- Factorizing LoRA space into: - a big shared generator GGG (frozen at test time), - a small task code vvv that you optimize.
Compared to DnD:
- DnD: learns a direct mapping from prompts → LoRA, then no training at adaptation.
- “Train δ instead”: learns mapping code → LoRA, then you still train δ per task.
So yes, if your goal is “no training for new tasks”, this is worse than generating LoRAs à la DnD: you added optimization back into the loop.
But as a meta-learning scheme it’s actually interesting:
- You only optimize ~4k parameters instead of millions (LoRA).
- The generator encodes a strong prior over “reasonable” LoRAs (learned from many tasks).
- It could be more sample-efficient than training a fresh LoRA from scratch for each new task.
The weak points:
- The Δ embedding is very lossy by design; many different LoRAs map to similar Δ. The inverse map GGG is underdetermined and might collapse diversity.
- Delta Activations were engineered to cluster domain-level behaviors, not to preserve all fine-grained information needed for optimal performance. Using them as the only latent for reconstruction is risky.
- You’d have to show empirically that optimizing δ + using GGG can reach the same task performance as training the LoRA directly. My guess: you’d often lose some peak performance, especially for hard, narrow tasks.
So, is training δ “worse” than generating LoRA?[edit | edit source]
If you mean strictly in the DnD sense (“I want zero-shot prompt-to-weights”):
- Yes, training δ is worse on the axis they care about (no optimization at adaptation time).
- δ is a summary of behavior, not an adapter. You still need a path from δ → weights or δ → retrieval.
If you mean as a general design choice for adaptation/meta-learning:
- It’s not obviously worse; it’s just another trade-off: - Pros: lower-dimensional, better for clustering/retrieval, maybe more stable and regularized. - Cons: doesn’t directly control the model, lossy, needs extra machinery (retrieval or generator) to actually adapt the LLM.
A realistic “best of both worlds” design might be:
- Use Delta Activations to build a model/task embedding space and organize your LoRA zoo. 2509.04442v1
- On top of that, learn a small hypernetwork HHH that maps from Δ (plus maybe statistics of the new dataset’s prompts) to a correction LoRA (or to mixing coefficients over existing LoRAs).
- Optionally allow a low-dim δ code to be optimized per task, but keep most parameters in HHH shared.
That would sit in between:
- pure retrieval in Δ-space, and
- full DnD-style LoRA generation.
If you want, we can sketch a concrete toy experiment design to test “train δ instead of LoRA” and see where it breaks (and where it might actually help).
User: Hmm....[edit | edit source]
Hmm.... how about generating lora with delta activations, not with datasets?
Assistant: Nice, now we’re in the fun “mash up two papers and see what falls out” zone 😄[edit | edit source]
Nice, now we’re in the fun “mash up two papers and see what falls out” zone 😄
Let me restate your idea in my words first:
Short version:
- You can build a LoRA generator conditioned on Delta Activations,
- but by itself it doesn’t solve “no-training adaptation for new tasks”.
- It’s strongest as a way to organize, compose, or refine existing LoRAs, not as a replacement for dataset-based conditioning.
Let’s break that down a bit.
In the Delta Activations paper:[edit | edit source]
- For each finetuned model (or LoRA), you compute a single vector δ∈Rd\delta \in \mathbb{R}^dδ∈Rd (say d ≈ few thousand) summarizing how its internal activations differ from the base model on a fixed small probe set.
- This δ\deltaδ behaves like a “semantic embedding of the finetune”: - similar tasks → similar deltas, - you can add / interpolate them for rough composition, - you can cluster models or select good LoRAs based on this space.
Crucial: δ\deltaδ is not an adapter; it does not change the base model. It’s more like a descriptor of the adapter you already trained.
Imagine we train a network GGG such that:[edit | edit source]
G(δ)≈LoRA weightsG(\delta) \approx \text{LoRA weights}G(δ)≈LoRA weights using many (δ, LoRA) pairs from your LoRA zoo.
Then:
- For an existing LoRA, you can encode it → δ → decode back LoRA (autoencoder in behavior space).
- You can do ops like δcombo=δA+δB\delta_{\text{combo}} = \delta_A + \delta_Bδcombo=δA+δB then feed G(δcombo)G(\delta_{\text{combo}})G(δcombo) to get a new, “blended” LoRA.
That’s actually a cool idea:
- Delta Activations give you a nice, low-dimensional latent space over model behaviors.
- A generator GGG makes that latent space “actionable” by turning latents back into weights.
But notice: this is fundamentally about recombining / editing behaviors of LoRAs you already have.
If you don’t use datasets at all, then:
- You can sample random δ and generate random LoRAs…
- …but you don’t know what these LoRAs are good for. You’ve lost the link to any task semantics.
So with only δ as condition, you can:
- Explore a smooth manifold of “things like the LoRAs I trained before”
- Compose known domains (e.g., mix math & code LoRAs via δ-math + δ-code)
- Possibly regularize / denoise noisy LoRAs via autoencoding
…but you cannot say “give me a LoRA for this new dataset / capability”, because δ doesn’t know what that is unless you somehow produced δ from that task.
This is the key point where it becomes weaker than DnD-style conditioning.[edit | edit source]
- DnD: - Condition = a batch of prompts from the new dataset. - No training required at adaptation time: you just feed prompts in and get a LoRA out.
- “Generate LoRA from δ”: - Condition = Delta Activation vector describing “what the model should do”. - To get δ for a new task, you need something that already behaves like that task: - a finetuned LoRA, - a few gradient steps, - or some other expensive process.
So unless you have a magical way to compute δ directly from raw data (without training), you’re back to:
In that sense, yes, for “no-training” adaptation it’s strictly worse than DnD:
- DnD: dataset → LoRA (zero-shot)
- δ-conditioning: requires a model that already approximates the task to define δ.
Even though it’s not a drop-in replacement for dataset-conditioning, there are interesting uses:[edit | edit source]
- Train an encoder E: LoRA → δ (basically Delta Activations).[edit | edit source]
- Train a decoder G: δ → LoRA.
- Now: - Edit δ (add, interpolate, project, do PCA or whatever), - Decode back to LoRA.
This gives you:
- A smooth latent editing space for LoRAs, like a StyleGAN latent space but for adapters.
- Strong prior: G only outputs LoRAs similar to those seen during training, reducing weird failure LoRAs.
Weakness:
- It’s still limited to the convex-ish hull of behaviors in your LoRA zoo.
- Doesn’t help you create fundamentally new capabilities you’ve never finetuned for.
If δ behaves compositional-ish (e.g., “math + code ≈ math + code in delta space”), you can:[edit | edit source]
- Use δ-space as the slot space for task factors.
- Then G(δ) is the generator that instantiates those factors as a concrete LoRA.
This connects nicely to the compositional paper you read earlier:
- δ plays the role of “slots” (task factors),
- G is the generator f,
- LoRA weights are “images” in that manifold.
You can then play generative games like:
- δ_new = δ_A (domain) + δ_B (behavior style)
- LoRA_new = G(δ_new)
Again the limitation: you’re composing known high-level behaviors; you still don’t get a new capability without some supervision somewhere.
A more powerful approach would be:[edit | edit source]
- Condition G on both δ and dataset / prompts, e.g. condition=(δprior,prompt stats)\text{condition} = (\delta_{\text{prior}}, \text{prompt stats})condition=(δprior,prompt stats)
- δ_prior: your best guess of “what kind of task this is like” (e.g. from retrieval in Δ-space).
- Prompt stats: information from the new dataset, like DnD.
Then you get:
- Prior over LoRA space from δ (reuse of existing skills),
- Fine-grained, task-specific signal from prompts.
This could make adaptation faster and more stable than using prompts alone.
:[edit | edit source]
- If you literally mean only δ, no dataset at inference: - You can generate some LoRA, but you don’t know what it solves. - It’s good for exploring / composing existing LoRAs, not for new tasks. - It does not give the “no-training adaptation to arbitrary new datasets” that DnD aims for.
- If you’re willing to: - either (a) use δ for existing LoRAs, and - (b) maybe combine δ with dataset prompts for new tasks, then it becomes a cool meta-learning & composition tool: - δ as a low-dim task/behavior code, - a generator G(δ) as a LoRA decoder, - possible compositional tricks (δ_add, δ_mix) in that space.
So conceptually it’s promising, but its main weakness is: δ depends on having some model that already does the task, so by itself it can’t replace dataset-conditioned LoRA generation if your goal is “not training but generating LoRA.” It’s a better fit for “not training a new LoRA but re-mixing/compressing existing LoRAs.”
If you want, next step we could sketch a toy experimental design for:
- build δ-space over a bunch of LoRAs,
- train G(δ) → LoRA,
- test: does δ_math + δ_code → LoRA that actually improves on math+code benchmarks?
User: hmm...[edit | edit source]
hmm... how about... introducing it as auxiliary task? imo generating and calculating loss with lora adapter directly, is not that easy task
Assistant: Yeah, using Delta Activations as an auxiliary task actually makes a lot of sense.[edit | edit source]
Yeah, using Delta Activations as an auxiliary task actually makes a lot of sense. And you’re right: “generate full LoRA and do direct loss on weights” is an ugly, high-dimensional target.
Let me structure this as:
- What’s hard about “generate LoRA and MSE on weights”
- How Delta Activations can be used as an auxiliary signal
- Concrete training designs
- Where this helps vs where it doesn’t
In DnD-style LoRA generation, the core thing is:[edit | edit source]
Problems:
- Huge dimension: LoRA is millions of parameters. Even with tokenization, it’s a massive regression target.
- Multi-modality: many different LoRAs can reach similar task performance. Forcing MSE to one specific reference LoRA is over-constraining (penalizing functionally equivalent alternatives).
- Alignment mismatch: what we care about is behavior (how the model answers prompts), not exact equality of weights.
So your intuition is right: “generate LoRA and just do MSE on adapter” is conceptually crude.
Delta Activations δ\deltaδ give you something much smaller and more behavioral.
Delta Activations give, for each trained LoRA/model:[edit | edit source]
- A single vector δ∈Rd\delta \in \mathbb{R}^dδ∈Rd (e.g., 4k dim)
- Computed as average difference in hidden states on a small probe set between base and finetuned.
- That vector summarizes “how this LoRA changes the model’s behavior on generic probes.”
So for each training example (a task / dataset) you have:
- Condition input: prompts / dataset features / descriptions
- Target 1: LoRA weights ΔW\Delta WΔW
- Target 2: Delta activation δ\deltaδ
You can absolutely make “predict δ\deltaδ” an auxiliary task:
- Because δ\deltaδ is low-dimensional and behavior-based,
- It can regularize the representation / conditioning side of the generator in a much nicer way than raw weight MSE.
Key idea:
Let’s sketch a few variants.[edit | edit source]
Data for each training LoRA ttt:[edit | edit source]
- Prompts batch PtP_tPt (like DnD)
- True LoRA weights ΔWt\Delta W_tΔWt
- True Delta Activation δt\delta_tδt
Model:
- Condition encoder: et=E(Pt)e_t = E(P_t)et=E(Pt) (e.g., text encoder + pooling)
- LoRA generator: ΔW^t=Gθ(et)\hat{\Delta W}_t = G_\theta(e_t)ΔW^t=Gθ(et)
- Delta head: Option 1 (cheaper): a small learned approximator δ^t=Hϕ(ΔW^t)\hat\delta_t = H_\phi(\hat{\Delta W}_t)δ^t=Hϕ(ΔW^t) Option 2 (more faithful): actually run base+LoRA on a fixed probe set to compute δ^t\hat\delta_tδ^t.
Loss:
L=λweights∥ΔW^t−ΔWt∥2⏟classical DnD-style loss+λδ∥δ^t−δt∥2⏟auxiliary behavioral loss\mathcal{L} = \lambda_\text{weights} \underbrace{\|\hat{\Delta W}_t - \Delta W_t\|^2}_{\text{classical DnD-style loss}} + \lambda_\delta \underbrace{\|\hat\delta_t - \delta_t\|^2}_{\text{auxiliary behavioral loss}}L=λweightsclassical DnD-style loss∥ΔW^t−ΔWt∥2+λδauxiliary behavioral loss∥δ^t−δt∥2 What does the δ-loss do?
- It pushes the generated LoRA to not only be close in parameter space, but also to induce the correct global behavior on probe prompts.
- It gives an extra gradient signal that’s much lower-dim and semantically aligned, which can stabilize training especially when weight loss is noisy / multi-modal.
Weak points:
- If you compute δ^t\hat\delta_tδ^t via true forward passes on probes, this adds heavy cost per step.
- If you approximate HϕH_\phiHϕ (LoRA → δ), you’ve added another network that might be lossy; but maybe that’s okay because δ itself is coarse.
To avoid directly regressing millions of weights from prompts, you can insert δ as a bottleneck:[edit | edit source]
Stage 1: learn a task encoder using δ
- Train EψE_\psiEψ s.t. Eψ(Pt)≈δtE_\psi(P_t) \approx \delta_tEψ(Pt)≈δt.
- That’s a simple regression in low dimension; way easier than full LoRA.
This makes δ the “task embedding space” learned from actual finetuned behaviors.
Stage 2: learn δ → LoRA map
Now train a separate generator GθG_\thetaGθ to map δ to LoRA:
ΔW^t=Gθ(δt)\hat{\Delta W}_t = G_\theta(\delta_t)ΔW^t=Gθ(δt) with weight-level loss:
Lstage2=∥ΔW^t−ΔWt∥2\mathcal{L}_\text{stage2} = \|\hat{\Delta W}_t - \Delta W_t\|^2Lstage2=∥ΔW^t−ΔWt∥2 At inference:
- New dataset ➜ prompts PnewP_{\text{new}}Pnew
- Stage 1: δ^new=Eψ(Pnew)\hat\delta_{\text{new}} = E_\psi(P_{\text{new}})δ^new=Eψ(Pnew)
- Stage 2: ΔW^new=Gθ(δ^new)\hat{\Delta W}_{\text{new}} = G_\theta(\hat\delta_{\text{new}})ΔW^new=Gθ(δ^new)
Now δ is both:
- A learned task representation (Stage 1), supervised by behavioral deltas;
- The bottleneck that the LoRA generator uses (Stage 2).
Pros:
- You’ve decomposed a very hard mapping “prompts → millions of weights” into: - “prompts → 4k-dim δ” (reasonable), and - “δ → weights” (pure parameter-space mapping).
- You can even pretrain Stage 2 using any LoRA–δ pairs you have, independent of prompts.
Weaknesses:
- δ is intentionally lossy. Many LoRAs can share similar δ, especially if probes are generic. So δ alone might not contain enough info to reconstruct exact LoRA.
- You’re back to weight-level MSE in Stage 2; though G has an easier job because δ already encodes “what kind of task” it is.
You can combine both worlds: in Stage 2, also add a δ-consistency loss:
∥Delta(ΔW^t)−δt∥2\|\text{Delta}(\hat{\Delta W}_t) - \delta_t\|^2∥Delta(ΔW^t)−δt∥2 so that even if weights aren’t exactly matching, their induced δ is.
A more behaviorally principled variant:[edit | edit source]
- Prompts PtP_tPt → embedding ete_tet
- Generator G(et)=ΔW^tG(e_t) = \hat{\Delta W}_tG(et)=ΔW^t
- Use two behavioral losses instead of relying heavily on raw weight MSE: - Delta loss: ∥Delta(ΔW^t)−δt∥2\|\text{Delta}(\hat{\Delta W}_t) - \delta_t\|^2∥Delta(ΔW^t)−δt∥2 (behavior on a fixed probe set) - Output distillation loss: On a small set of task-specific prompts QtQ_tQt, minimize, e.g., KL(pbase+ΔW^t(⋅∣q) ∥ pbase+ΔWt(⋅∣q))\text{KL}\big(p_{\text{base}+\hat{\Delta W}_t}(\cdot|q) \,\|\, p_{\text{base}+\Delta W_t}(\cdot|q)\big)KL(pbase+ΔW^t(⋅∣q)∥pbase+ΔWt(⋅∣q)) so that generated LoRA matches the real LoRA’s outputs.
Now weight-level supervision becomes optional or just a small regularizer. Most of the signal is:
- “your LoRA should behave like this known one on probes and on some task prompts”.
This is much closer to what we actually want (function match), and δ is a cheap, global summarization of that behavior.
Weaknesses:
- Needs two model evaluations (base+true-LoRA and base+gen-LoRA) during training. Expensive but maybe ok if you restrict to small probe sets.
- Still doesn’t solve the identifiability problem perfectly: lots of LoRAs can match the same δ and distillation on small Q_t but differ elsewhere. But that might be fine in practice.
###[edit | edit source]
- Better supervision for the condition encoder - Training “prompts → δ” is a cheap, well-shaped regression problem. - δ is aligned with semantic behavior of the LoRA; so the encoder learns a meaningful task embedding, instead of overfitting to weird idiosyncrasies of weight tensors.
- Regularization / robustness - A generator that matches δ (behavior on generic probes) and not just weights is less likely to cheat by pinning random parameters that happen to match the target LoRA numerically.
- Easier debugging / interpretability - You can inspect δ predictions directly and see: - “Does the model think this task is close to math+code?” - “Do δ’s cluster as we expect?” That gives you a handle that raw weight MSE doesn’t.
[edit | edit source]
- Cost of computing δ during training - If you rely on the true Delta-operator (base+LoRA forward passes on probes) for every gradient step, this is heavy. - You’ll probably approximate it with a small network H(ΔW)H(\Delta W)H(ΔW), but then you’re stacking approximations.
- No direct effect on “no-training adaptation” - The hard part DnD solves is: from only prompts of a NEW dataset, output a useful LoRA with zero finetuning. - δ auxiliary loss helps learn a better mapping during training, but at test time, you still need the prompts → LoRA generator. You’re not replacing dataset-conditioning—just helping it.
:[edit | edit source]
I’d say:
- Yes, that’s a genuinely good idea as an architectural / training improvement: - δ is a natural, low-dim behavioral target. - Auxiliary “predict δ” or “match δ” losses can stabilize training and push the generator toward behaviorally relevant solutions instead of pure weight matching.
- But: - δ cannot be the only supervision if you want precise adapters. It’s too low-bandwidth and probe-dependent. - You still need some combination of: - weight-level MSE (even if downweighted), - and/or distillation on outputs for a subset of prompts.
If you want to turn this into a concrete research idea, an experiment that’s not insane in scale could be:
- Take a smaller LLM + a collection of LoRAs on multiple tasks.
- Compute δ for each LoRA using fixed probes.
- Train three LoRA generators (prompt-conditioned) with: - (a) weight-only MSE (DnD-style baseline), - (b) weight MSE + δ auxiliary loss, - (c) δ + output distillation but no weight MSE.
- Evaluate on: - In-domain tasks (reproducing known LoRAs), - Cross-task generalization (unseen tasks or unseen mixtures), - Robustness when reference LoRAs are noisy or partially corrupted.
My prediction:
- (b) will outperform (a) in stability and maybe generalization,
- (c) will be competitive on easy tasks but may underperform on hard ones → this will expose exactly how much information δ carries.
If you’d like, we can sketch that experimental setup more concretely (num tasks, probe design, what exact losses to use), or even think about borrowing the compositional-theory angle: δ as slot-space, generator as fff, and asking when “δ-compositionality” implies good LoRA compositionality.
User: ok would you write a research plan for this, including backgrounds and motivations?[edit | edit source]
ok would you write a research plan for this, including backgrounds and motivations? I think this can be incremental proposal or, improvement for data-to-lora task. ofc I wont use delta for the only supervision.
Assistant: Nice, this is a really reasonable incremental project.[edit | edit source]
Nice, this is a really reasonable incremental project. I’ll write this as a research-plan skeleton you could almost paste into a proposal.
Tentative Title[edit | edit source]
Behavior-Aware Data-to-LoRA Generation via Delta Activations
1. Background[edit | edit source]
1.1 Parameter-efficient finetuning and data-to-LoRA generation[edit | edit source]
Large language models (LLMs) are commonly adapted via LoRA: low-rank adapters ΔW=BA\Delta W = BAΔW=BA added to a frozen base model, trained on task data DDD. 2506.16406v1
Recent work like Drag-and-Drop LLMs (DnD) treats LoRA updates themselves as generative targets: a hyper-network takes a batch of unlabeled prompts from a dataset and directly outputs the LoRA weights for that dataset. This collapses the classical data→gradients→weights loop into a single forward pass and removes per-task finetuning.
DnD trains the hyper-network with MSE on tokenized LoRA weights between generated and reference adapters. 2506.16406v1
Limitations of current data-to-LoRA training:
- The target space (LoRA weights) is huge, so pure MSE is noisy and high-variance.
- Weight space is multi-modal: many different LoRAs can yield similar task behavior, but MSE to a single reference penalizes all but one.
- The supervision is purely parametric, not directly aligned with model behavior.
This makes training the generator harder, and probably limits generalization to unseen tasks.
1.2 Delta Activations as behavioral model embeddings[edit | edit source]
Delta Activations propose a compact, behavior-based representation for finetuned models. For a base model fbasef_\text{base}fbase and a finetuned model fff, they define:
- For each probe input xxx from a small generic probe set DprobeD_\text{probe}Dprobe, compute the last-layer hidden state for the final token: - hf(x)h_f(x)hf(x) for the finetuned model, - hbase(x)h_\text{base}(x)hbase(x) for the base model.
- The delta activation for that input is: Δf(x)=hf(x)−hbase(x)\Delta_f(x) = h_f(x) - h_\text{base}(x)Δf(x)=hf(x)−hbase(x)
- Aggregate over the probe set: vf=1∣Dprobe∣∑x∈DprobeΔf(x)v_f = \frac{1}{|D_\text{probe}|} \sum_{x\in D_\text{probe}} \Delta_f(x)vf=∣Dprobe∣1x∈Dprobe∑Δf(x) giving a single vector vf∈Rdv_f \in \mathbb{R}^dvf∈Rd.
They show that:
- vfv_fvf clusters finetuned models by domain (math, coding, legal, etc.).
- The embedding is approximately additive: combining finetuning datasets corresponds to vector addition in the delta space. 2509.04442v1
- A few-shot finetuned model can serve as a task embedding via its Delta Activations.
So Delta Activations give a low-dimensional, behavior-aligned representation of how a LoRA modifies a model.
1.3 Gap and opportunity[edit | edit source]
DnD learns a mapping:
- prompts from dataset → LoRA weights,
- supervised only in weight space.
Delta Activations gives:
- finetuned model (or LoRA) → behavioral embedding δ.
These two lines almost meet, but not quite:
- DnD knows how to go from data to weights, but doesn’t explicitly use any behavioral summary.
- Delta Activations knows how to go from weights to behavior, but doesn’t feed back into adapter generation.
Idea: use Delta Activations as auxiliary behavioral supervision for the data→LoRA generator, to make training easier and improve generalization.
You won’t replace weight-level supervision with δ (good instinct) – you’ll add δ as an extra, low-dimensional, behavior-aligned target.
2. Goals and Research Questions[edit | edit source]
Goal: Improve data-to-LoRA generation by incorporating Delta Activations as an auxiliary task, making the generator more behavior-aware and easier to train.
RQ1 – In-domain reconstruction: Does delta-based auxiliary supervision improve how well a prompt-conditioned generator reproduces known LoRAs and their behavior on the same tasks?
RQ2 – Zero-shot cross-task generalization: Given only prompts from a new dataset, does the delta-augmented generator produce LoRAs that perform better (or at least more robustly) than a DnD-style baseline?
RQ3 – Robustness and sample efficiency: Does δ supervision help when training data (number of prompt–checkpoint pairs) is small, or when LoRA targets are noisy?
RQ4 – Compositional task behavior: If we interpolate or add deltas (e.g., δ_math + δ_code), and decode them into LoRAs, do we get meaningful joint capabilities?
3. Proposed Method[edit | edit source]
3.1 Setup and data[edit | edit source]
Start from a DnD-like setup:
- Base LLM (e.g., LLaMA-3.1-8B or a smaller variant).
- A pool of finetuned LoRA adapters over diverse domains (e.g., LEGAL, MATH, MEDICAL, COMMONSENSE, CODING) as in the Delta paper. 2509.04442v1
- For each LoRA ΔWt\Delta W_tΔWt, you have: - its training prompts / dataset, - the finetuned adapter weights, - and you can compute its Delta Activation vector δt\delta_tδt.
Generator architecture:
- Reuse DnD’s macro-architecture: frozen text encoder over prompt batches, feeding a hyper-convolutional decoder that outputs tokenized LoRA weights.
3.2 Computing Delta Activations[edit | edit source]
For each LoRA ttt:
- Fix a small generic probe set DprobeD_\text{probe}Dprobe (e.g., 5 Alpaca-style generic instructions as in the Delta paper). 2509.04442v1
- For each probe x∈Dprobex\in D_\text{probe}x∈Dprobe: - Run base model → hbase(x)h_\text{base}(x)hbase(x), - Run base+LoRA_t → ht(x)h_t(x)ht(x), - Compute Δt(x)=ht(x)−hbase(x)\Delta_t(x) = h_t(x) - h_\text{base}(x)Δt(x)=ht(x)−hbase(x).
- Average to get the embedding δt\delta_tδt.
We’ll treat δt\delta_tδt as the behavioral target for LoRA ttt.
3.3 Core model variants[edit | edit source]
You can define a few variants—one as your main method, the rest as ablations.
Variant A – Multi-task (prompts → weights + δ)[edit | edit source]
- Condition encoder EEE: et=E(Pt)e_t = E(P_t)et=E(Pt) where PtP_tPt is a batch of prompts from dataset ttt.
- Hyper-decoder GGG: ΔW^t=G(et)\hat{\Delta W}_t = G(e_t)ΔW^t=G(et)
- δ-head HHH: - Option A1 (cheaper): learn a small network HHH so that δ^t=H(ΔW^t)\hat{\delta}_t = H(\hat{\Delta W}_t)δ^t=H(ΔW^t). - Option A2 (more accurate): compute δ^t\hat{\delta}_tδ^t by actually running base+ ΔW^t\hat{\Delta W}_tΔW^t on the probe set.
Loss:
Ltotal=λw∥ΔW^t−ΔWt∥2⏟DnD weight MSE+λδ∥δ^t−δt∥2⏟Delta auxiliary loss\mathcal{L}_\text{total} = \lambda_\text{w} \underbrace{\|\hat{\Delta W}_t - \Delta W_t\|^2}_{\text{DnD weight MSE}} + \lambda_\delta \underbrace{\|\hat{\delta}_t - \delta_t\|^2}_{\text{Delta auxiliary loss}}Ltotal=λwDnD weight MSE∥ΔW^t−ΔWt∥2+λδDelta auxiliary loss∥δ^t−δt∥2 This keeps the original DnD objective but adds δ as an extra constraint.
Variant B – Two-stage with δ bottleneck[edit | edit source]
Stage 1: prompts → δ
Train a task encoder:
δ^t=Eψ(Pt)\hat{\delta}_t = E_\psi(P_t)δ^t=Eψ(Pt) with loss:
Lδ-reg=∥δ^t−δt∥2\mathcal{L}_\text{δ-reg} = \|\hat{\delta}_t - \delta_t\|^2Lδ-reg=∥δ^t−δt∥2 Now δ becomes an explicit task embedding space that you learn.
Stage 2: δ → LoRA
Train a separate generator:
ΔW^t=Gθ(δt)\hat{\Delta W}_t = G_\theta(\delta_t)ΔW^t=Gθ(δt) with loss:
Lstage2=∥ΔW^t−ΔWt∥2+λδ-cons∥Delta(ΔW^t)−δt∥2\mathcal{L}_\text{stage2} = \|\hat{\Delta W}_t - \Delta W_t\|^2 + \lambda_\text{δ-cons} \|\text{Delta}(\hat{\Delta W}_t) - \delta_t\|^2Lstage2=∥ΔW^t−ΔWt∥2+λδ-cons∥Delta(ΔW^t)−δt∥2 At inference on a new dataset:
- Pnew→δ^new=Eψ(Pnew)P_\text{new} \to \hat{\delta}_\text{new} = E_\psi(P_\text{new})Pnew→δ^new=Eψ(Pnew)
- ΔW^new=Gθ(δ^new)\hat{\Delta W}_\text{new} = G_\theta(\hat{\delta}_\text{new})ΔW^new=Gθ(δ^new).
This cleanly factorizes:
- “understand the dataset” (prompts → δ),
- “instantiate a LoRA for a desired δ” (δ → weights).
Variant C – Behavior-first: δ + output distillation[edit | edit source]
To reduce reliance on raw weight MSE:
- Generate ΔW^t\hat{\Delta W}_tΔW^t from prompts.
- Compute: - δ-consistency: ∥Delta(ΔW^t)−δt∥2\|\text{Delta}(\hat{\Delta W}_t) - \delta_t\|^2∥Delta(ΔW^t)−δt∥2, - output distillation: on a small set of task prompts QtQ_tQt, minimize KL between base+ΔW^t\hat{\Delta W}_tΔW^t and base+ΔWt\Delta W_tΔWt.
Total loss:
Lbeh=λδ∥Delta(ΔW^t)−δt∥2+λdistillEq∈QtKL(pbase+ΔW^t(⋅∣q) ∥ pbase+ΔWt(⋅∣q))+λw∥ΔW^t−ΔWt∥2\mathcal{L}_\text{beh} = \lambda_\delta \|\text{Delta}(\hat{\Delta W}_t) - \delta_t\|^2 + \lambda_\text{distill} \mathbb{E}_{q \in Q_t} \text{KL}\big( p_{\text{base}+\hat{\Delta W}_t}(\cdot|q) \,\|\, p_{\text{base}+\Delta W_t}(\cdot|q) \big) + \lambda_\text{w} \|\hat{\Delta W}_t - \Delta W_t\|^2Lbeh=λδ∥Delta(ΔW^t)−δt∥2+λdistillEq∈QtKL(pbase+ΔW^t(⋅∣q)∥pbase+ΔWt(⋅∣q))+λw∥ΔW^t−ΔWt∥2 Here the core supervision is behavioral (δ + distillation), with weight MSE as a regularizer.
4. Experimental Plan[edit | edit source]
4.1 Datasets and tasks[edit | edit source]
Use a setup that’s as close as possible to existing works (for easy comparison):
- Base model: a mid-size LLaMA/Gemma/Qwen.
- Task domains: reuse the 5 tasks from Delta Activations (LEGAL, MATH, MEDICAL, COMMONSENSE, CODING) and ideally some from DnD’s benchmarks.
- For each domain: - Train a reference LoRA (or reuse existing checkpoints if available). - Build prompt batches PtP_tPt for training the generator.
4.2 Baselines[edit | edit source]
- DnD baseline – prompts → LoRA, trained only with weight MSE, same architecture.
- LoRA retrieval baseline – maybe a simple “find closest existing LoRA via bag-of-words or prompt embedding” to show that your method indeed does more than retrieval.
4.3 Metrics[edit | edit source]
- Task performance (main metric): accuracy / exact match / BLEU / etc., depending on domain, on: - in-domain test sets, - unseen test datasets (cross-drag).
- Adapter reconstruction quality: - Parameter MSE (for sanity), - Functional distance: average KL divergence between outputs of base+generated vs base+true LoRA on a held-out set.
- Embedding behavior: - Distance between δ^t\hat{\delta}_tδ^t and δt\delta_tδt, - Cluster quality (silhouette score) in δ space for generated vs true LoRAs, as in the Delta paper. 2509.04442v1
- Generalization to new tasks: - Train generator on a subset of domains and evaluate on held-out domains/datasets (e.g., train on MATH+LEGAL+CODING, test on MEDICAL or new dataset). Compare DnD vs δ-augmented variants.
4.4 Ablation studies[edit | edit source]
- Remove δ loss (back to vanilla DnD) to measure gain.
- Vary: - ∣Dprobe∣|D_\text{probe}|∣Dprobe∣ (1, 5, 20 prompts), - λ weights (λδ,λdistill\lambda_\delta, \lambda_\text{distill}λδ,λdistill), - A1 vs A2 for δ computation (predicted vs exact).
- Train with fewer prompt–checkpoint pairs and see whether δ helps under low-data regime.
- Randomize δ (shuffle across tasks) as a negative control to show that improvements vanish.
4.5 Compositional experiments (optional but cool)[edit | edit source]
Leverage the additive properties of Delta Activations:
- Take δ_math, δ_code, δ_medical from existing LoRAs.
- Form combined codes: - δ_math+code = δ_math + δ_code - δ_math+medical = δ_math + δ_medical
- Decode via GGG to LoRAs and test on mixed benchmarks (e.g., math problems with code, math problems referencing medical scenarios).
Compare:
- δ-addition + decode vs
- naive LoRA addition/merging vs
- DnD directly trained on mixed-task data (if available).
This probes whether δ truly captures compositional structure over tasks and whether your generator can realize that composition in weights.
5. Expected Contributions[edit | edit source]
- Method: A data-to-LoRA generator augmented with Delta Activations as auxiliary behavioral supervision, showing improved training stability and/or generalization versus a pure weight-MSE baseline.
- Analysis: Empirical evidence about: - how much behavioral information δ actually carries for reconstructing adapters, - whether δ-based supervision is particularly helpful in low-data or OOD settings.
- Practical takeaway: A better recipe for training DnD-like systems: “always add a δ-head/δ-loss; it costs little and improves robustness.”
- Optional: Insights into compositionality in δ-space (task mixing, factorization), connecting the “generator-side structure” story in the compositional paper to task/adapter-level behavior.
6. Risks / Weaknesses (the honest part)[edit | edit source]
- δ is inherently lossy. Even with a good probe set, different LoRAs can share similar Delta Activations. So δ can’t fully disambiguate adapters; improvements might be modest rather than dramatic.
- Probe bias. Fixed generic probes might underspecify behavior for narrow domains (e.g., some specialized coding style), and δ could mislead the generator.
- Compute overhead. Computing exact δ for generated LoRAs inside the training loop doubles forward passes for probe prompts. You’ll probably need an approximate δ-head HHH, which adds its own approximation error.
- Still no theory. Unlike the compositional generator-identifiability work, there is no clean theorem saying “this δ-regularized generator must produce the right LoRA.” This will be an empirical improvement story, not a theory paper.
But as an incremental, concrete improvement to data-to-LoRA, this is very defensible: it piggybacks on two existing ideas (DnD + Delta Activations), adds a new, plausible type of supervision, and gives lots of measurable knobs to experiment with.
If you want, next step we can narrow down to one main variant (probably multi-task A or two-stage B), and I can help you turn this into a 1–2 page “mini-proposal” with more formal notation and maybe pseudo-code.
References[edit | edit source]
- ↑ "Explain section 4 details". ChatGPT. Retrieved 2025-12-19.