Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: Yeah, I actually think there is a nice conceptual connection – but it’s a bit indirect. === Yeah, I actually think there is a nice conceptual connection – but it’s a bit indirect. Let me phrase both papers in the same language first, then line up the parallels / differences. ==== ### ==== * There is a generator f:z↦xf: z \mapsto xf:z↦x mapping latent slots z=(z1,…,zK)z = (z_1,\dots,z_K)z=(z1,…,zK) to images xxx. * They argue: if your decoder f^\hat ff^ is trained and constrained to a structured class FintF_{\text{int}}Fint (polynomial-like interactions between slots), then: - It can identify the true generator up to slot-wise reparam / permutation. - Inverting f^\hat ff^ gives a representation that provably compositionalizes and generalizes to unseen slot combinations (OOD). 2512.08854v1 * Non-generative encoders g^:x↦z\hat g: x \mapsto zg^:x↦z can’t easily be constrained this way; the correct inverse lies in a tiny, hard-to-enforce subclass GintG_{\text{int}}Gint. On a low-dimensional data manifold in a high-dimensional ambient space, lots of wild encoders fit the training data but explode OOD. 2512.08854v1 * Sec. 4 then shows how to use the trained generator: - Search: gradient-based inversion argminz∥x−f^(z)∥2\arg\min_z \|x - \hat f(z)\|^2argminz∥x−f^(z)∥2. - Generative replay: sample synthetic zzz → f^(z)\hat f(z)f^(z) → train a new encoder on those synthetic OOD images. 2512.08854v1 So: learn a structured generator → invert it → get good OOD behavior. ===== - They view a LoRA adapter ΔW\Delta WΔW as a function of its training data DDD: gradient descent “drags” base weights W0W_0W0 to W0+ΔW(D)W_0+\Delta W(D)W0+ΔW(D). 2506.16406v1 ===== * Instead of re-running optimization for every new dataset, they learn a hyper-network: G:prompt batch p ↦ ΔWG: \text{prompt batch } p \;\mapsto\; \Delta WG:prompt batch p↦ΔW * Training data: lots of prompt–checkpoint pairs from many datasets: 1. Fine-tune LoRA on many tasks → store checkpoints. 2. For each checkpoint mjm_jmj, take a random batch of unlabeled prompts pip_ipi from its dataset. 3. Encode prompts with a frozen text encoder (Sentence-BERT, etc.) to get a condition embedding cic_ici. 4. Train a hyper-convolutional decoder to map ci↦c_i \mapstoci↦ tokenized LoRA weights, with MSE loss. 2506.16406v1 * At test time, for a new dataset: - Feed a handful of its prompts into the encoder + hyper-decoder. - Get a LoRA ΔW\Delta WΔW in one forward pass – no per-task training. They explicitly frame this as treating network parameters as a new data modality and doing prompt-conditioned parameter generation instead of gradient-based adaptation. 2506.16406v1 So: learn a generator over weights → sample task-specific LoRAs conditioned on prompts. ==== At a high level, both papers say: ==== : More concretely: ===== - In the compositional reasoning paper, generator side = decoder f^(z)\hat f(z)f^(z), encoder side = representation g^(x)\hat g(x)g^(x). ===== * They show that structural assumptions are much easier to impose on the generator side FintF_{\text{int}}Fint than on the inverse side GintG_{\text{int}}Gint, which is why generative modeling wins for OOD compositionality. 2512.08854v1 DnD is philosophically similar: * Instead of “training the adapter by gradient descent for each task” (analog of repeatedly learning a discriminative model), they build a generator over LoRA spaces: condition (prompts) → weights\text{condition (prompts)} \;\rightarrow\; \text{weights}condition (prompts)→weights * That’s a generator living in weight space rather than input space. You could say: * The compositional paper: “be generative about '''images'''”. * DnD: “be generative about '''adapters'''”. Both are pushing the work into a single, amortized generative system instead of re-doing task-specific optimization. ===== - Compositional paper: - True inference: given xxx, solve x=f^(z)x = \hat f(z)x=f^(z). - Naïve method: per-image gradient search. - Amortized methods: train an encoder via generative replay so a single forward pass approximates the solution. 2512.08854v1 ===== * DnD: - True adaptation: given dataset DDD, run gradient descent to find LoRA ΔW(D)\Delta W(D)ΔW(D). - Naïve method: per-task fine-tuning. - DnD: train a hyper-network on past (prompts, checkpoints) so a single forward pass approximates ΔW(D)\Delta W(D)ΔW(D). 2506.16406v1 So DnD is basically “amortized fine-tuning,” exactly like an amortized inference network is “amortized posterior estimation.” That’s structurally very close to the search vs replay discussion in Sec. 4. ===== The kind of generalization is different, but the flavor is similar: ===== * Compositional model: - Generalizes from some combinations of latent slots (ID) to new combinations (OOD). - Because the generator class FintF_{\text{int}}Fint encodes how slots interact across all of ZZZ, not just on data you’ve seen. * DnD: - Generalizes from some datasets / tasks (the ones you have LoRAs and prompts for) to new datasets. - The hope is that prompts act as a “fingerprint” of dataset semantics, and the hyper-network has learned a structured weight space where nearby prompts map to useful nearby LoRAs. 2506.16406v1 You can think of this (informally) as a kind of compositionality over tasks: * Each dataset’s prompts contain mixtures of “concepts” (domain, difficulty, style, modality, etc.). * DnD is trying to learn how those conceptual patterns in prompts map into specific regions of LoRA space. The compositional theory paper gives you a math-y framework for this kind of story in input space; DnD is doing an analogous trick empirically in weight space. ==== This is the part where I’ll be a bit critical, since you asked me to push on weaknesses. ==== ===== In the compositional paper, they prove that: ===== * If f^∈Fint\hat f \in F_{\text{int}}f^∈Fint fits the data, then it essentially must be the true generator (up to slotwise transforms). * This is a strong structural statement about which decoders are allowed, via Jacobian/Hessian constraints. 2512.08854v1 DnD doesn’t have anything analogous: * The hyper-convolutional decoder is chosen mostly for efficiency, not because it’s the unique class of functions that can map prompts to LoRAs “correctly.” 2506.16406v1 * For a given dataset, there may be many LoRAs that give similar performance; and for a given set of prompt batches, many mappings from text embeddings to weights will fit the training checkpoints. So the “generator over weights” in DnD is more like a strong practical prior than a theoretically justified FintF_{\text{int}}Fint-style class. If you were hoping to cleanly import the compositional identifiability result into DnD, that doesn’t work yet: they haven’t characterized the space of good hyper-networks the same way. ===== The compositional paper is about per-example latent variables z(x)z(x)z(x), with slots that can be recombined across images. ===== DnD: * Condition = a batch of prompts from the dataset, not individual examples. * It generates one LoRA per dataset/task, not a per-input latent code. So the “slots” here are really “tasks across datasets”, not “factors inside one datapoint”. That’s a much coarser granularity. You could imagine a more compositional version of DnD, where the parameter generator has explicit factors like: * domain slot, * reasoning-style slot, * modality slot, * difficulty slot, and you combine them to synthesize adapters for totally new mixtures. The current DnD architecture doesn’t enforce or expose such a factorization; it just hopes the text encoder + hyper-conv decoder learn something reasonable. ===== Exactly like we discussed for your pseudo-label idea: ===== * Given finitely many training datasets and checkpoints, a high-capacity hyper-network can, in principle, memorize mapping from condition embeddings to weights. * Even with MSE on tokenized weights, you can approximate the mapping arbitrarily well on those training pairs, while behaving unpredictably on new datasets. The paper shows empirically that DnD does generalize to new benchmarks (e.g., common sense → science, text → multimodal) and often beats the average of training LoRAs. 2506.16406v1 But unlike the generator-identifiability result, there’s nothing that rules out “memorize and interpolate poorly” solutions theoretically. ==== This is where it gets interesting for research ideas: ==== # Compositional hyper-networks You could try to design a structurally constrained hyper-network class over LoRA space that plays the same role as FintF_{\text{int}}Fint: - Decompose LoRA updates into a sum of slot-specific contributions: ΔW=∑khk(sk)\Delta W = \sum_k h_k(s_k)ΔW=∑khk(sk), where sks_ksk are “task factors” extracted from prompts. - Constrain interactions between these factors in a polynomial / low-order fashion (like the image generator). - Aim for formal results like: if the hyper-network fits many tasks, it must be close to the “true” factorized mapping. # Two-level generative story Combine them: - Bottom level: a compositional generator over data (like the first paper), giving strong OOD reasoning inside a task. - Top level: a parameter generator (DnD-style) that quickly moves that generator into a task-specific region of parameter space given prompts from a new domain. Then you’re using generative structure both in input space and in weight space. # Replay in ''weight'' space vs replay in ''data'' space The compositional paper uses generative replay of images to train a better encoder. DnD can be read as a form of “replay in parameter space”: you train on many past LoRAs and learn a distribution over them. There might be a neat meta-learning angle where you use a structured data generator to create synthetic tasks, fine-tune them into synthetic LoRAs, and then train a DnD-style hypernetwork on that synthetic parameter corpus. ==== So, to answer your question directly: ==== * Yes, there is a conceptual connection: both papers say “don’t just directly train new things for each case; instead, learn a structured generator (over data or weights) and then adapt via inversion or generation.” * The compositional paper is more theoretical, giving a clean story about why structure on the generator side leads to compositional OOD generalization and why encoder-only approaches are under-constrained. 2512.08854v1 * DnD is more engineering / meta-learning, showing empirically that a generator over LoRA parameters conditioned on prompts can replace gradient-based adapter training in many scenarios. 2506.16406v1 But DnD doesn’t yet have the kind of identifiability / “right inverse” guarantees the first paper has. If you’re thinking about bridging the two, I’d focus on: * designing structured parameter generators with explicit factorization, * and asking whether you can prove “good OOD behavior over tasks” in the same spirit as their compositional generalization over slot combinations. If you want, we can try to sketch a concrete toy setting where: * tasks are defined by combinations of a few discrete “factors”, and * a DnD-style hypernetwork with an FintF_{\text{int}}Fint-like constraint over LoRA space could provably generalize to unseen task combinations.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)