Editing Openai/693e3ce6-229c-8008-97dc-ab720cb1f95a (section)

=== Assistant: No — γ0\gamma0γ0 does not mean the algorithm needs infinite work for any fixed accuracy ε0\varepsilon0ε0. ===
No — γ=0\gamma=0γ=0 does ''not'' mean the algorithm needs infinite work for any fixed accuracy ε>0\varepsilon>0ε>0. What it means is: the particular “local work” bound you’re quoting was proved only under the strict margin assumption γ>0\gamma>0γ>0, and if you plug in γ=0\gamma=0γ=0 you are applying the theorem outside its validity range. The “+∞+\infty+∞” you get is the bound becoming vacuous, not the algorithm becoming non-terminating.

Here’s exactly what is going on.

==== In our derivation, the key counting step was: ====
* if an inactive coordinate i∈I⋆i\in I^\stari∈I⋆ becomes nonzero at an iterate,
* then the prox input must cross the threshold by at least tγt\gammatγ,
* which leads to a bound like

∣Ak∣  ≤  ∥u(yk)−u(x⋆)∥22t2γ2  ≤  (1+tLtγ)2∥yk−x⋆∥22.|A_k| \;\le\; \frac{\|u(y_k)-u(x^\star)\|_2^2}{t^2\gamma^2}
\;\le\;\left(\frac{1+tL}{t\gamma}\right)^2\|y_k-x^\star\|_2^2.∣Ak∣≤t2γ2∥u(yk)−u(x⋆)∥22≤(tγ1+tL)2∥yk−x⋆∥22.
That inequality divides by γ2\gamma^2γ2. If γ=0\gamma=0γ=0, the statement “each spurious active coordinate forces a jump of at least tγt\gammatγ” becomes “forces a jump of at least 0,” which gives no information. So the proof technique cannot upper bound the number of spurious activations, and the resulting constant 1/γ21/\gamma^21/γ2 blows up.

So what fails at γ=0\gamma=0γ=0 is finite identification / stability of the optimal inactive set, not convergence of APG itself.

==== γ=0\gamma=0γ=0 means there exists an inactive coordinate i∈I⋆i\in I^\stari∈I⋆ such that the KKT inequality is tight: ====

∣∇f(x⋆)i∣=λi.|\nabla f(x^\star)_i| = \lambda_i.∣∇f(x⋆)i∣=λi.
Equivalently, at the optimum the prox input is exactly at the soft-threshold boundary:

∣u(x⋆)i∣=tλi,u(x):=x−t∇f(x).|u(x^\star)_i| = t\lambda_i,\quad u(x):=x-t\nabla f(x).∣u(x⋆)i∣=tλi,u(x):=x−t∇f(x).
When you are exactly on that boundary, then arbitrarily small perturbations of yyy around x⋆x^\starx⋆ can flip that coordinate between
* returning 0 (inactive), and
* returning a tiny nonzero (active),

because soft-thresholding switches at equality.

Accelerated methods generate extrapolated points yky_kyk that can oscillate around x⋆x^\starx⋆, so there is no reason (in worst case) that such “tight” coordinates must become permanently zero after a finite time. In other words:
* γ>0\gamma>0γ>0 ⇒ there is a buffer zone in which inactive coordinates are guaranteed to stay inactive, hence finite identification, hence the spurious-work term is a constant (independent of ε\varepsilonε).
* γ=0\gamma=0γ=0 ⇒ there is no buffer zone, so spurious coordinates can keep flickering for as long as you run; the spurious work is no longer a constant independent of ε\varepsilonε.

That’s exactly why the margin matters.

==== For any fixed ε>0\varepsilon>0ε>0: no. ====
Accelerated proximal gradient still has the iteration complexity Nε=O(1/αlog⁡(1/ε))N_\varepsilon = O(\sqrt{1/\alpha}\log(1/\varepsilon))Nε=O(1/αlog(1/ε)) for strongly convex composite objectives, regardless of γ\gammaγ. This is the standard statement you quoted in the open-problem note. open-problem-fountoulakis22a

What you lose at γ=0\gamma=0γ=0 is a guarantee that each iteration can be implemented with local cost O~(1/ρ)\tilde O(1/\rho)O~(1/ρ) (or that the cumulative support outside S⋆S^\starS⋆ is bounded by a constant). In the worst case you may have to fall back to the pessimistic bound you already mention in your note:

Work(Nε)  ≤  O ⁣(n1αlog⁡1ε),\text{Work}(N_\varepsilon)\;\le\; O\!\left(n\sqrt{\frac{1}{\alpha}}\log\frac{1}{\varepsilon}\right),Work(Nε)≤O(nα1logε1),
because without margin you cannot prove the iterate supports stay small. open-problem-fountoulakis22a

So: finite work, but potentially non-local.

==== Two different “limits” get conflated: ====

===== - You run Nε<∞N_\varepsilon<\inftyNε<∞ iterations. =====
* Work is at most nNε<∞nN_\varepsilon<\inftynNε<∞.
* Nothing is infinite.

===== - Nε→∞N_\varepsilon\to\inftyNε→∞ for any method. =====
* If γ>0\gamma>0γ>0, we proved that the spurious support contribution stays bounded even as ε→0\varepsilon\to 0ε→0, because identification happens in finite time.
* If γ=0\gamma=0γ=0, you can have no finite identification time, so spurious activations can persist for arbitrarily many iterations; the spurious contribution can grow with NεN_\varepsilonNε. In that sense, a bound that tries to be uniform “independent of ε\varepsilonε” will necessarily break.

That is exactly what the 1γ2α\frac{1}{\gamma^2\sqrt{\alpha}}γ2α1 term is signaling.

==== Not a missing step — it’s a missing assumption. The bound you quoted is a conditional theorem: ====
* It is correct under Assumption γ>0\gamma>0γ>0.
* At γ=0\gamma=0γ=0, that assumption is false, and the bound is not meant to apply.

More strongly: the blow-up is not an artifact; it reflects a real phenomenon — at γ=0\gamma=0γ=0 you generally cannot guarantee finite-time identification or an ε\varepsilonε-independent bound on cumulative spurious support for inertial/accelerated proximal schemes.

If you want, I can also show (still without LaTeX) how to state an alternative runtime bound valid even when γ=0\gamma=0γ=0: it will necessarily depend on nnn (or on how many “tight” inactive coordinates there are), matching the pessimistic O(n1/αlog⁡(1/ε))O(n\sqrt{1/\alpha}\log(1/\varepsilon))O(n1/αlog(1/ε)) running time in your open-problem statement. open-problem-fountoulakis22a