Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: Short version: === Short version: Yes, in Sec. 4 they assume you already have a trained decoder f^\hat ff^ (the generator) and usually also an encoder g^\hat gg^ trained on in-domain (ID) data. Section 4 is about how to actually invert f^\hat ff^ efficiently, especially out-of-domain (OOD), using: # Gradient-based search (Sec. 4.1) – slow but flexible, done per image. # Generative replay (Sec. 4.2) – offline training on synthetic OOD images, to get a faster encoder. Let me walk through the logic step by step and connect it to the rest of the paper. ==== From Sec. 2: ==== * There is a ground-truth generator f:Z→X,x=f(z)f: Z \to X, \quad x = f(z)f:Z→X,x=f(z) where: - Z=RdzZ = \mathbb{R}^{d_z}Z=Rdz is latent space, decomposed into slots: z=(z1,…,zK),zk∈Rmz = (z_1, \dots, z_K),\quad z_k \in \mathbb{R}^mz=(z1,…,zK),zk∈Rm e.g. “animal 1 slot”, “animal 2 slot”, “background slot”. - X⊂RdxX \subset \mathbb{R}^{d_x}X⊂Rdx is image space (the data manifold). * A representation z^=φ(x)\hat z = \varphi(x)z^=φ(x) is “good perception” if it inverts fff up to slot-wise reparam + permutation: ∀z∈ZS:φ(f(z))=hπ(z)\forall z\in Z_S: \quad \varphi(f(z)) = h_\pi(z)∀z∈ZS:φ(f(z))=hπ(z) (Eq. 2.1). hπh_\pihπ just re-labels and reparametrizes slots. ===== They define two regimes: ===== * Generative approach: - Learn a decoder f^:Z→Rdx\hat f : Z \to \mathbb{R}^{d_x}f^:Z→Rdx. - Use its inverse to represent images: φ(x)=f^−1(x)\varphi(x) = \hat f^{-1}(x)φ(x)=f^−1(x) - For this to match truth, f^\hat ff^ must identify fff: f^(hπ(z))=f(z)(Eq. 2.2)\hat f(h_\pi(z)) = f(z) \quad\text{(Eq. 2.2)}f^(hπ(z))=f(z)(Eq. 2.2) * Non-generative approach: - Learn an encoder directly: φ(x)=g^(x)\varphi(x) = \hat g(x)φ(x)=g^(x) - For this to match truth, g^\hat gg^ must approximate the inverse: g^(x)=hπ(g(x)),g:=f−1(Eq. 2.3)\hat g(x) = h_\pi(g(x)),\quad g := f^{-1} \quad\text{(Eq. 2.3)}g^(x)=hπ(g(x)),g:=f−1(Eq. 2.3) The difference is not “does the architecture have an encoder or decoder”, but: * Are you constraining a decoder f^\hat ff^ and then inverting it (generative)? * Or are you only constraining an encoder g^\hat gg^ (non-generative)? ===== - ID region: XID=f(ZID)X_{\text{ID}} = f(Z_{\text{ID}})XID=f(ZID) from some subset ZID⊂ZZ_{\text{ID}}\subset ZZID⊂Z of concept combinations. ===== * OOD region: ZOODZ_{\text{OOD}}ZOOD is all other combinations of slot values; XOOD=f(ZOOD)X_{\text{OOD}} = f(Z_{\text{OOD}})XOOD=f(ZOOD). The goal: if Eq. (2.1) holds on ZIDZ_{\text{ID}}ZID, does it generalize to all combinations in ZOODZ_{\text{OOD}}ZOOD? That’s compositional generalization. ===== - They define a generator class FintF_{\text{int}}Fint where concepts interact in a structured way (polynomial interactions across slots). ===== * The inverse class Gint={f−1∣f∈Fint}G_{\text{int}} = \{ f^{-1} \mid f \in F_{\text{int}}\}Gint={f−1∣f∈Fint} has strong constraints on Jacobian and Hessian (Eq. 3.3). Key asymmetry: * It’s feasible to constrain a decoder to lie in FintF_{\text{int}}Fint (by architecture/regularization). * It’s basically infeasible to constrain an encoder to lie in GintG_{\text{int}}Gint when images live on a low-dim manifold in a high-dim ambient space (dx≫dzd_x \gg d_zdx≫dz). The Hessian/Jacobian structure “disappears”; almost any encoder is compatible with some f∈Fintf \in F_{\text{int}}f∈Fint. So: you can enforce the right inductive bias on f^\hat ff^, but not on g^\hat gg^. This is why they say: to guarantee compositional generalization, you need a generative approach. ==== Given Sec. 3’s conclusion, Sec. 4 now assumes: ==== * You have already trained a decoder f^\hat ff^ with the right inductive bias (so f^≈f\hat f \approx ff^≈f, f^∈Fint\hat f \in F_{\text{int}}f^∈Fint). * Typically you also have an encoder g^\hat gg^ trained on ID data by simple autoencoding. And now you want to actually use f^\hat ff^ generatively: : That’s the inference problem for a generative model. ===== They say: if the decoder had a nice analytic inverse f^−1\hat f^{-1}f^−1, this would be trivial. But: ===== * For images, data live on a manifold X⊂RdxX \subset \mathbb{R}^{d_x}X⊂Rdx, not all of Rdx\mathbb{R}^{d_x}Rdx. * Making f^\hat ff^ a global diffeomorphism from ZZZ onto Rdx\mathbb{R}^{d_x}Rdx is unnatural and hard; it effectively forces X=RdxX = \mathbb{R}^{d_x}X=Rdx, which is unrealistic. So in practice, we have a differentiable network f^\hat ff^ whose inverse we can only get via optimization, not closed-form. ==== For ID images x∈XIDx \in X_{\text{ID}}x∈XID, you do have training data. So you can just train an encoder g^\hat gg^ to invert f^\hat ff^ on those: ==== ming^ Ex∼XID ∥x−f^(g^(x))∥2(4.2)\min_{\hat g} \;\mathbb E_{x \sim X_{\text{ID}}} \,\big\|x - \hat f(\hat g(x))\big\|^2 \tag{4.2}g^minEx∼XIDx−f^(g^(x))2(4.2) This is a standard autoencoder objective: * g^\hat gg^ encodes images to latents. * f^\hat ff^ decodes them back. '' On ID data, g^(x)\hat g(x)g^(x) approximates z\''z^\''z\'' s.t. x≈f^(z\'')x \approx \hat f(z^\'')x≈f^(z\*). So: '' After training with (4.2), on ID, you can just set: z\''(x)≈g^(x).z^\''(x) \approx \hat g(x).z\''(x)≈g^(x). This is what you were noticing: Sec. 4 indeed assumes a trained f^\hat ff^ and a trained g^\hat gg^ on ID. But the whole point is: this is ''not enough'' for OOD, because XOODX_{\text{OOD}}XOOD was never seen during this training. ==== For OOD images x∈XOODx \in X_{\text{OOD}}x∈XOOD: ==== * During training on (4.2), you never saw these images, so: - You cannot minimize (4.2) on XOODX_{\text{OOD}}XOOD directly. - Nothing in the encoder-only training forces g^\hat gg^ to generalize correctly to unseen slot combinations (this is exactly the point of Sec. 3). So the question of Sec. 4 is: : They explore two strategies to solve Eq. (4.1) more generally: # Gradient-based search (Sec. 4.1): optimize zzz per OOD image. # Generative replay (Sec. 4.2): synthesize fake OOD images and train a new encoder that works OOD. ==== They rewrite Eq. (4.1) as an optimization problem: ==== z\''=argminz ∥x−f^(z)∥2(4.3)z^\'' = \arg\min_z \; \big\|x - \hat f(z)\big\|^2 \tag{4.3}z\*=argzminx−f^(z)2(4.3) So, for a given (possibly OOD) image xxx: # Define the reconstruction loss: L(z)=∥x−f^(z)∥2.L(z) = \big\|x - \hat f(z)\big\|^2.L(z)=x−f^(z)2. # Start from some initial guess z(0)z^{(0)}z(0). # Do gradient descent: z(t+1)=z(t)−η ∇zL(z(t)).z^{(t+1)} = z^{(t)} - \eta \,\nabla_z L(z^{(t)}).z(t+1)=z(t)−η∇zL(z(t)). Because f^\hat ff^ is differentiable, you can compute ∇zL\nabla_z L∇zL via backprop. The efficiency depends heavily on initialization: * If you start from a random z(0)z^{(0)}z(0), you may need many steps or get stuck. * Their trick: use the encoder as a “System 1” guess: z(0)=g^(x).z^{(0)} = \hat g(x).z(0)=g^(x). So the pipeline for Search is: : Intuition: * g^\hat gg^ is trained only on XIDX_{\text{ID}}XID, so on OOD it might be biased or slightly wrong. * But it’s often close to a good solution. * Gradient-based search corrects g^\hat gg^’s mistakes by explicitly enforcing that the decoded image matches xxx under f^\hat ff^. This is what they show in Fig. 4 (left): encoder gives an initial slot decomposition; search refines it to better match xxx. Pros: * In principle, if f^\hat ff^ truly identifies fff, then solving (4.3) correctly recovers the “right” slots for any x∈Xx \in Xx∈X, including OOD. * You don’t need extra training data. Cons: * It’s online and per-image expensive (hundreds of gradient steps in experiments). * For large models / many images, this is slow. ==== The second strategy is to fix the runtime cost by doing the hard work offline. ==== Idea: # Your decoder f^\hat ff^ identifies the ground-truth generator fff up to slot-wise transforms. # That means OOD images are generated by re-combining slots from ID in new ways. So they say: “If I can sample slot combinations z^\hat zz^ from a distribution pz^p_{\hat z}pz^ whose marginals are independent per slot, and decode them via f^\hat ff^, I can generate synthetic OOD images f^(z^)\hat f(\hat z)f^(z^).” Concretely: * Choose pz^p_{\hat z}pz^ such that pz^(z)=∏k=1Kpk(zk)p_{\hat z}(z) = \prod_{k=1}^K p_k(z_k)pz^(z)=k=1∏Kpk(zk) where each pkp_kpk is estimated from ID slot distributions, but they’re independent across slots. * Sample z^∼pz^\hat z \sim p_{\hat z}z^∼pz^, decode: x^=f^(z^).\hat x = \hat f(\hat z).x^=f^(z^). * Because slots are sampled independently, many of these (z^,x^)(\hat z, \hat x)(z^,x^) pairs correspond to unseen combinations ⇒ they lie in XOODX_{\text{OOD}}XOOD. Now you have unlimited synthetic OOD data with ground-truth latents! Then you can train a new encoder g^\hat gg^ to invert f^\hat ff^ on this synthetic data: ming^ Ez^∼pz^ ∥z^−g^(f^(z^))∥2(4.4)\min_{\hat g} \;\mathbb E_{\hat z \sim p_{\hat z}} \,\big\|\hat z - \hat g(\hat f(\hat z))\big\|^2 \tag{4.4}g^minEz^∼pz^z^−g^(f^(z^))2(4.4) So we’re training on pairs: * input image: f^(z^)\hat f(\hat z)f^(z^) (possibly OOD combination), * target latent: z^\hat zz^. This is just a self-supervised autoencoder on synthetic data, but crucially: * The images explore combinatorial variations (thanks to independent slot sampling). * So g^\hat gg^ is trained to correctly invert f^\hat ff^ across a much larger region of XXX, including OOD. At test time: : That’s what Fig. 4 (right) illustrates: decode synthetic OOD images with f^\hat ff^ → train g^\hat gg^ on them → deploy g^\hat gg^ at test time. Pros: * After offline training, inference is fast (single forward pass). * The encoder has seen many OOD combinations via generated images, so it can generalize better. Cons: * Depends on how well f^\hat ff^ approximates the true generative process. * If f^\hat ff^ can’t generate realistic or diverse OOD combinations, replay is limited. ==== In Sec. 5.2 they say: ==== * They take VAEs trained in § 5.1 (with a Transformer decoder) on PUG datasets. Those give: - A decoder f^\hat ff^, - An encoder g^\hat gg^ trained with a VAE objective on ID. Then they: # Apply generative replay (Eq. 4.4): - Use f^\hat ff^ to generate OOD images by recombining slots (like shuffling animal/background slots). - Train a new encoder on these synthetic images. # Apply search (Eq. 4.3) at test time: - Use the encoder’s output as init, - Refine it by gradient-based optimization. They show that: * Non-generative approaches (encoder-only, even with lots of data) struggle with OOD combinations. * Generative approaches using search and replay on top of a good decoder significantly boost OOD performance. ==== : ==== You’re exactly right that they’re assuming some things already trained: * f^\hat ff^ (decoder/generator) is learned first, with the inductive biases that make it belong to FintF_{\text{int}}Fint. * g^\hat gg^ is initially trained on ID via an autoencoding/VAE objective (Eq. 4.2). Section 4 then does not focus on how to train f^\hat ff^; it assumes it already exists and is good, and studies: # How to invert f^\hat ff^ on OOD images by: - Search: per-image gradient optimization (Eq. 4.3) initialized with g^(x)\hat g(x)g^(x). - Replay: offline training of g^\hat gg^ on synthetic OOD images generated by f^\hat ff^ (Eq. 4.4). So conceptually: * Non-generative baseline: just use g^(x)\hat g(x)g^(x) trained on ID and hope it generalizes. * Generative approach (Sec. 4): - either refine g^(x)\hat g(x)g^(x) via optimization (search), - or retrain g^\hat gg^ on synthetic OOD from f^\hat ff^ (replay), - or both. If you’d like, next we can: * Draw pseudo-code for the search and replay procedures as if you were implementing them. * Or unpack the assumptions on pz^p_{\hat z}pz^ and what “independent slot marginals” really mean in practice.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)