Introduction¶
The code and training logs for this submission are available in the Parameter Golf PR #2145.
This is a follow-up to the original MHALM blog post. If you haven’t read that one, the short version is: MHALM is an experimental language model architecture that replaces the transformer’s attention + MLP blocks with IGL’s kernel-based readout on learned Stäckel coordinates.
The bet is simple: token embeddings live on a low-dimensional manifold (Fefferman et al., 2016), and standard transformers (Vaswani et al., 2017) ignore this structure — they compute dense \(O(T^2)\) attention over the full head dimension \(d_h\). If you can find nonlinear coordinates where the attention kernel separates into a product of 1D kernels, three things follow:
- Parameter efficiency — the readout stores 1D basis functions instead of \(R \times V\) weight matrices, saving massive parameter count at large vocabularies. In the competition I operated at \(V = 1024\) where this advantage is negligible, but at \(V = 50\text{K}\)+ it becomes significant.
- Linear-time spatial readout — the kernel heads evaluate at \(O(T \cdot R)\) instead of \(O(T^2)\), because the factorised kernel enables per-axis precomputation. V2’s temporal path still uses standard \(O(T^2)\) attention alongside the SSM, but the spatial path is already linear in \(T\).
- Geometric inductive bias — the Stäckel constraint and Stiefel enforcement give the model a structured prior on the manifold. V2 demonstrates this concretely: Weyl spectral init and z-space processing both produced measurable gains.
In March I entered MHALM V1 into the Parameter Golf competition and got 1.46 bpb — far from the 1.23 bpb of the heavily optimised transformer baseline. Since then I’ve run 45+ ablations, combining architectural changes (processing in Stäckel coordinates, Weyl spectral initialisation) with engineering fixes (Stiefel enforcement, cuFFT SSM, Muon routing) to reach 1.35 bpb — closing 47% of the gap to the transformer. Perhaps the most telling result: when given the freedom to balance the spatial (geometric) and temporal (SSM + attention) paths, the model consistently upweights the geometric one — the token manifold’s structure carries genuine predictive signal.
The remaining gap is likely a mix of hardware disadvantage and format mismatch. MHALM’s kernel operations (exp, cos, sqrt) run on CUDA ALUs at ~60 TFLOPS while the transformer’s matmuls hit H100 Tensor Cores at ~989 TFLOPS — a 15× hardware penalty that translates directly into fewer tokens seen per wallclock. The 10-minute speedrun format of Parameter Golf amplifies this: it measures which architecture learns faster per H100-second, not which learns better. At iso-tokens the gap narrows (+0.27 → +0.11 val_loss), which is consistent with higher data efficiency per token — but at this stage it is not possible to extrapolate whether the gap would close at convergence. A proper comparison would require iso-token evaluation on converged models, or at minimum a setting where MHALM’s advantages — larger vocabularies where the VP trick saves parameters, longer sequences where separable kernels scale — actually come into play.
Regardless of these future considerations, here’s what I learned.
From V1 to V2¶
V1 had the right ingredients — three chart encoders, multi-kernel readout, SSM + attention — but several implementation choices limited it. The temporal processing (SSM and attention) operated on the 1024-dim vocabulary logits rather than the encoded Stäckel coordinates, meaning the attention’s QK dot product had no geometric meaning. The Stiefel enforcement used torch.linalg.matrix_norm(ord=2), which silently triggered cuSOLVER host-device synchronisation and crushed throughput. The SSM used a Python-level parallel scan instead of cuFFT. And the SSM eigenvalues were initialised randomly instead of from the Weyl-law spectrum of the Stäckel metric.
V2 fixes all of these: temporal processing moves to the encoded \(z\)-space, Stiefel enforcement uses power-iteration (no host-device sync), the SSM uses cuFFT causal convolution, and eigenvalues are initialised from Weyl’s law. The result is the same architecture with better-grounded design choices — and −0.107 bpb to show for it.
| Version | val_bpb | Hardware | Tokens seen | Artifact |
|---|---|---|---|---|
| V1 (March) | 1.4574 | 8×H100 | 3.59B | 11.0 MB |
| V2 (April, best) | 1.3477 | 8×H100 | 3.19B | 13.0 MB |
| Transformer baseline | 1.2268 | 8×H100 | 7.19B | — |
−0.107 bpb from V1 to V2. V2 closed about 47% of the V1→transformer gap.
The V2 architecture¶
Here’s what V2 looks like in detail.
The starting point is the IGL (Inverse Green’s Learning) framework. The attention kernel \(K(q_i, k_j) = \exp(q_i^\top k_j / \sqrt{d_h})\) defines an integral operator — a Green’s function acting on the token sequence. If one can find coordinates where this operator separates — factorises into a product of independent 1D operators — then the full kernel decomposes into a product of 1D kernels, each computable independently. This is the Stäckel separability condition (Stäckel, 1891; Eisenhart, 1934): in the right coordinate system, the metric becomes diagonal, the PDE separates, and the Green’s function factors.
The problem is that Stäckel coordinates are not the standard embedding coordinates. They are a nonlinear coordinate system on the token manifold where the metric induced by the attention kernel is (approximately) diagonal. Finding them is the encoder’s job: learn a nonlinear map \(\Psi: \mathbb{R}^{d_\text{emb}} \to \mathbb{R}^d\) such that the pullback metric \(\Psi^* g\) is as close to diagonal as possible. Once you have such coordinates, the kernel heads can evaluate separable kernel functions (Gabor, Laplacian, Nyström) cheaply — each head computes spatial similarity in the space where the geometry is factored.
A caveat: symmetry vs causality. A diagonal metric implies a symmetric kernel: \(K(\Psi(x_i), \Psi(x_j)) = K(\Psi(x_j), \Psi(x_i))\). But causal self-attention is inherently non-symmetric — token \(i\) attends to token \(j < i\) but not the reverse. In standard transformers this asymmetry comes from the causal mask and from the separate \(W_Q\), \(W_K\) projections (\(q_i \neq k_i\) in general). MHALM’s current design uses a single encoder per kernel head, enforcing symmetry in the kernel evaluation, and relies on the causal mask and the temporal path (SSM + attention) to break the symmetry. I tested an alternative — the dual encoder approach — where a separate encoder \(\Psi_K\) produces keys while \(\Psi_Q\) produces queries, so \(K(\Psi_Q(x_i), \Psi_K(x_j)) \neq K(\Psi_K(x_i), \Psi_Q(x_j))\). This is the cleanest way to restore asymmetry in the kernel itself, at the cost of +920K parameters and a 26% per-step speed penalty. The dual encoder showed better per-step quality (−0.028 bpb at matched steps), but the speed penalty meant fewer total training steps, and the net result was +0.030 bpb under the competition format. Whether the asymmetric formulation works with a longer training budget remains an open question.
The VP trick: parameter savings at scale. In a standard transformer, the output projection stores an \(H \times d_h \times V\) weight tensor — every head needs a full \(d_h \times V\) matrix to map from head space to vocabulary. In MHALM, the spatial anchors are tied to the vocabulary embeddings rather than requiring separate readout matrices: the kernel readout uses \(R\) anchor points with 1D basis functions per axis. The readout weight is \(R \times V\), but because the kernel factorises, \(R\) can be kept small (256 in V2) regardless of vocabulary size. At \(V = 1024\) (the Parameter Golf setting), this saves almost nothing. But at \(V = 50\text{K}\) (GPT-2) or \(V = 128\text{K}\) (LLaMA-3), the VP trick could save 30–40% of total model weights.
Complexity reduction via Stäckel coordinates. Standard softmax attention computes all \(T^2\) pairwise scores in \(d_h\) dimensions: cost \(O(T^2 \cdot d_h)\) per head. When the kernel separates in Stäckel coordinates of intrinsic dimension \(d \ll d_h\), each 1D kernel can be approximated by a degree-\(D\) polynomial feature map, giving \(R = \binom{d+D}{D}\) features. The attention sum then factorises into a precompute step (\(O(R \cdot T \cdot d_h)\), once per sequence) and a per-query evaluation (\(O(R \cdot d_h)\) per token) — total \(O(R \cdot T \cdot d_h)\), linear in sequence length. At \(d = 50\), \(D = 1\): \(R = 51\), the framework predicts a speedup of 1 to 2 orders of magnitude over standard attention (this is a theoretical bound for the polynomial-feature formulation, not yet realised in V2). The key insight is that the speedup comes from operating at the intrinsic dimension \(d\) of the data manifold, not the ambient head dimension \(d_h\).
MHALM instantiates this with three learned coordinate encoders (an “atlas” of the token manifold) followed by kernel-based spatial readout in the encoded space and temporal processing (SSM + attention) that also operates in the encoded coordinates, where the QK dot product is geometrically meaningful.
Data flow¶
1 2 3 4 5 | |
Each HybridAtlasBlock has three stages:
Stage 1 — Chart encoders (512 → 160 × 3)¶
Three independent ChartEncoders map the 512-dim token embedding into 160-dim Stäckel coordinates. Each encoder is a 4-hidden-layer MLP (SiLU activations, pre-norm residuals) followed by a tanh output with learnable per-dimension temperature. The output weights are Stiefel-enforced (Absil et al., 2008) — projected back to the orthogonal manifold after every optimizer step via Newton-Schulz iteration (Björck & Bowie, 1971) with power-iteration spectral norm (−0.048 bpb from enforcement alone; −0.065 bpb once the implementation bug was fixed — the single largest improvement in V2). This is distinct from Muon (Jordan, 2024), which applies an orthogonalising Newton step inside the optimizer (on the gradient momentum): Stiefel enforcement is a hard constraint on the weights themselves, guaranteeing orthogonal encoder output dimensions regardless of the optimizer used.
Why three encoders? A single coordinate chart cannot cover a manifold with non-trivial topology — this is a theorem in differential geometry (no single chart covers the sphere). But there’s a more concrete reason: each encoder feeds a dedicated kernel head — Ψ₀ feeds the Nyström (spherical) head, Ψ₁ feeds the Gabor (oscillatory) head, Ψ₂ feeds the Laplacian (proximity) head. This lets each encoder specialise its learned coordinates for the geometric property its kernel measures. The temporal path uses the concatenation \(z_\text{cat} = [z_0, z_1, z_2]\), giving it access to all three perspectives simultaneously.
Why Stiefel enforcement? Without it, the encoder output dimensions tend to collapse — multiple axes learn the same direction, wasting capacity. The orthogonality constraint ensures the 160 output dimensions span distinct directions, maximising the information per coordinate.
Why tanh with learnable temperature? The tanh bounds the coordinates to \([-1, 1]\), preventing unbounded drift during training, while the per-dimension temperature \(\tau\) controls how much of the \([-1, 1]\) range each axis uses — a form of learned dynamic range.
Why 4 hidden layers? Deeper encoders (n=6) give −0.076 bpb but cost training steps due to higher per-step time. With n=4 and the Stäckel penalty active, the quality is recovered at 3M fewer parameters and 11% more training steps — a better trade-off under the 10-minute wallclock budget.
The three encoders form an atlas: three overlapping coordinate charts on the token manifold. Their outputs are concatenated into \(z_\text{cat} \in \mathbb{R}^{480}\).
Stage 2 — Temporal processing in Stäckel coordinates (2-pass refinement)¶
This is the key architectural insight of V2: temporal processing happens in the encoded coordinate space \(z_\text{cat}\), not in the vocabulary-sized output space. Since \(z\) lives where the metric is approximately diagonal, attention’s QK dot product is geometrically meaningful there.
Why temporal processing at all? The kernel heads provide spatial readout — similarity between the current token and learned anchors — but no mechanism for propagating information across time. A language model needs to condition on context: “the” means different things after different histories. The SSM and attention branches provide complementary temporal capabilities.
Two parallel branches, applied twice (2-pass refinement):
- FFT SSM — a complex-diagonal LTI state-space model in the S4D (Gu et al., 2022) family, implemented as causal convolution via cuFFT. Eigenvalues initialised from Weyl’s law rather than the standard log-spaced default. Cost: \(O(T \log T)\).
- Causal self-attention — 8-head attention with RoPE (Su et al., 2021) and XSA (exclusive self-attention: project out the self-alignment component; −0.002 bpb at zero parameter cost — the only competition technique that helped MHALM). 2 layers per pass.
Why SSM? It provides cheap long-range context propagation — every token receives a summary of the full history at \(O(T \log T)\) cost. The Weyl-law initialisation sets the SSM’s frequency spectrum to match the eigenvalue growth predicted by the Stäckel metric, giving it a physics-informed starting point.
Why keep attention? Pure SSMs struggle with associative recall — retrieving a specific earlier token based on content. Attention excels at this. Running both in parallel and gating the mixture lets the model route each token through whichever mechanism is more useful.
The branches are combined via a learned sigmoid gate: \(H = g \cdot H_\text{SSM} + (1-g) \cdot H_\text{attn}\).
Mathematically, the two passes work as follows. Let \(f_\theta\) denote the gated SSM+Attention operator (with independent weights per pass):
Pass 1 — initial temporal context:
Pass 2 — refinement on residual:
The key is the residual injection in \(z_\text{cat}^{(2)}\): the second pass receives the original Stäckel coordinates \(z_\text{cat}\) plus a learned projection of what Pass 1 produced. This means Pass 2 operates on a representation that combines the raw geometry with first-order temporal context — it can refine the temporal dependencies without losing the geometric signal. All weights (\(g_1, g_2, W_\text{proj}^{(1)}, W_\text{proj}^{(2)}, W_\text{back}^{(2)}\), SSM and attention parameters) are independent between passes — weight tying was tested and failed catastrophically (+0.375 bpb).
Why 2 passes? A single pass captures first-order temporal dependencies. The second pass corrects the residual: it sees what the first pass produced and can fix errors or capture higher-order interactions that the first pass missed. This is analogous to iterative refinement in numerical methods.
Stage 3 — Kernel heads (spatial readout)¶
Why multiple kernel heads? Different kernel functions capture different geometric properties of the token manifold. A Gabor kernel detects oscillatory patterns (periodic structure in the coordinates), a Laplacian kernel measures proximity (smooth decay with distance), and the Nyström kernel approximates the full attention matrix via landmark points. No single kernel can capture all these properties simultaneously — the multi-kernel design lets the model express a rich family of spatial relationships, analogous to how multi-head attention lets different heads attend to different patterns.
Five kernel heads compute token-to-anchor similarities, each on its dedicated encoder’s output. Each head defines a feature map \(\Phi: \mathbb{R}^d \to \mathbb{R}^R\) that maps a token’s Stäckel coordinates to a vector of \(R\) kernel evaluations against learned anchor points \(\{\mu_r\}_{r=1}^R\):
| Head | Input | Kernel | What it captures | R |
|---|---|---|---|---|
| Nyström | \(z_0\) (Ψ₀) | Gegenbauer on cosine similarity | Global token-to-token attention (causal) | 256 |
| Gabor | \(z_1\) (Ψ₁) | Gaussian × cosine | Oscillatory / periodic structure | 256 |
| Laplacian | \(z_2\) (Ψ₂) | Gaussian + Laplacian + Matérn-3/2 | Proximity / smooth decay | 256 |
| Tucker GL | \(z_1 \times z_2\) | Gabor × Laplacian | Conjunctions (oscillatory AND proximate) | 256 |
| Linear | \(z_0\) (Ψ₀) | None (identity) | Direct coordinate-to-prediction path | 160 |
In detail:
Nyström (causal spherical, on \(z_0\)) (Williams & Seeger, 2001). Approximates full token-to-token attention via \(R = 256\) landmark points. Tokens and landmarks are \(\ell_2\)-normalised, and the kernel is a Gegenbauer polynomial mixture on cosine similarity:
where \(c_r = \hat{z}^\top \hat{\mu}_r\) is the cosine similarity, \(P_n\) are Gegenbauer polynomials (\(P_1 = \tfrac{1+c}{2}\), \(P_2 = \tfrac{3c^2-1}{2}\), \(P_3 = \tfrac{5c^3-3c}{2}\)), \(m_r\) is the causal mask (zero for future landmarks), and \(e^{-\eta \Delta t_r}\) is a temporal decay.
Gabor (on \(z_1\)) (Gabor, 1946). Captures oscillatory structure — periodic patterns in Ψ₁’s coordinate system. Each anchor has a position \(\mu_r\), a wave vector \(k_r\), a width \(\sigma_r\), and a phase \(\varphi_r\):
The Gaussian window localises the response; the cosine detects frequency content at the anchor’s characteristic scale.
Laplacian (on \(z_2\)). Measures proximity in Ψ₂’s coordinate system via a learnable RBF mixture. Three kernel families with complementary smoothness properties:
where \(d_r = \|z - \mu_r\|\) and the three terms are Gaussian, Laplacian, and Matérn-3/2 kernels respectively (Rasmussen & Williams, 2006). A temporal bandwidth modulation \(\sigma_r^2 \leftarrow \sigma_r^2(1 + \gamma \cdot t/T)\) widens the bandwidth for later sequence positions.
Tucker GL (cross-encoder: \(z_1 \times z_2\)). Captures conjunctions — “oscillatory AND proximate” — via the element-wise product of Gabor and Laplacian outputs:
No additional parameters beyond those of the Gabor and Laplacian heads.
Linear (on \(z_0\)). A non-kernel baseline: the raw encoded coordinates \(z_0 \in \mathbb{R}^{160}\) used directly as features, with no nonlinear kernel transformation.
All five heads are stacked into a single GEMM for the readout: \(\text{mixed} = \Phi_\text{all} \cdot W_\text{all}\), where \(\Phi_\text{all} \in \mathbb{R}^{B \times T \times (4 \times 256 + 160)}\). Per-head learnable scales (softmax-normalised) control the mixture. The 5-head ensemble was not ablated per-head (removing individual heads); kernel diversity was validated as a whole against degenerate baselines (identity encoders, frozen anchors, MLP replacement). The asymmetric dual-encoder variant of Nyström was tested and rejected (+0.030 bpb — the symmetric single-encoder design is superior).
Output¶
The final logits combine temporal and spatial paths:
where \(\gamma\) (the geom_scale gate) is initialised to 0 — the spatial kernel path opens gradually during training.
Why two separate paths? The temporal path (SSM + attention) captures sequential dependencies; the spatial path (kernel heads) captures geometric structure in the Stäckel coordinates. Combining them at the logit level lets each path specialise without interfering. Initialising \(\gamma = 0\) ensures the temporal path establishes stable gradients first; the spatial path then opens once the encoders have learned meaningful coordinates.
The model prefers the spatial path¶
I ran a convex sweep on the temporal/spatial balance (\(\alpha = 0.27, 0.50, 0.73\)) and also let \(\gamma\) be learned freely. The model consistently learned \(\gamma \approx 1.33\) — it upweighted the spatial kernel path beyond equal balance. The convex sweep confirmed this: \(\alpha = 0.73\) (favouring spatial) beat \(\alpha = 0.27\) (favouring temporal) by −0.003 bpb. The geometric kernel readout carries more signal than the temporal SSM+attention path for next-token prediction in this architecture. This is one of the strongest reasons to pursue the geometric approach: the spatial path — which is the novel, IGL-derived component — is not a marginal add-on that the model tolerates. It is at least as important as the more traditional temporal one. The token manifold’s geometry contains genuine signal for prediction, and the architecture is learning to exploit it.
Parameter budget¶
| Component | Parameters | % |
|---|---|---|
| 3 ChartEncoders × 2 blocks | 8.4M | 46% |
| 5 kernel readout matrices × 2 blocks | 2.4M | 13% |
| BigramHash (16384×160 + projection) | 2.7M | 15% |
| 4 SSMs (2 per block, 2-pass) | 0.6M | 3% |
| 8 Attention layers (2 per pass × 2 passes × 2 blocks) | 2.1M | 12% |
| Basis params (anchors, σ, φ) | 0.25M | 1% |
| Other (embeddings, gates, norms, skip) | 1.8M | 10% |
| Total | 18.3M |
Compressed artifact: 13.0 MB (int8 per-tensor quantisation + zstd-22).
Training¶
The optimizer is decoupled: Muon (momentum 0.92→0.99, 1500-step warmup) for all 2D weight matrices (encoders, attention QKV/proj, SSM projections), AdamW at varying learning rates for everything else (1e-4 for basis params, 3e-4 for gates, 0.05 for embeddings). Whole-model torch.compile with a 3-step compile warmup before the training clock starts (−0.005 bpb + 16% speed; step 1 at 105ms instead of 1400ms).
Why decoupled? Muon’s equivariant Newton step accelerates training on 2D weight matrices (−0.022 bpb from extending it to attention and SSM projections), but applying it to scalar parameters (gate biases, kernel anchors, temperatures) destabilises Stiefel enforcement — full Muon trained at +0.038 bpb. The decoupling lets each parameter type use the optimizer suited to its geometry.
A soft Stäckel penalty encourages diagonal covariance of the encoded coordinates. For each encoder output \(z \in \mathbb{R}^{B \times T \times d}\), I compute the empirical covariance \(C = z^\top z / N\) and penalise its off-diagonal energy:
with \(\beta = 0.02\), subsampled to 4096 tokens for efficiency. This nudges the learned metric toward separability — if \(C\) is diagonal, the coordinate axes are statistically independent, which is the Stäckel condition in the empirical limit. The penalty is soft: the model is free to violate it if the task loss demands correlated coordinates, but in practice it converges to near-diagonal covariance. With deep encoders (n=6), the penalty is redundant — Stiefel enforcement alone suffices. But with shallow encoders (n=4, as in V2), the penalty is needed: the shallow encoder can’t enforce statistical independence of coordinates on its own.
Ablation results¶
Here’s the attribution of where the V1→V2 gains came from (converted to est_bpb, the competition metric):
| Change | Impact (est_bpb) | Type |
|---|---|---|
| Stiefel enforcement fix (power iteration) | −0.065 | Bug fix |
| Weyl spectral SSM initialisation | −0.062 | Architecture |
| d_max 128 → 160 (Tensor Core alignment) | −0.036 | Engineering |
| z-space temporal processing (SSM + attention on z, not on logits) | −0.034 | Architecture |
| FFT SSM (cuFFT causal convolution) | −0.031 | Engineering |
| Surgical Muon routing | −0.022 | Engineering |
| torch.compile whole-model | −0.005 | Engineering |
The largest single gain came from fixing the Stiefel enforcement. The original torch.linalg.matrix_norm(ord=2) call silently triggered cuSOLVER host-device synchronisation, breaking CUDA graphs and crushing throughput. Replacing it with power iteration removed the performance cliff and unlocked headroom for the architectural gains to matter.
What worked¶
z-space temporal processing (−0.034 bpb). The V1 architecture applied SSM and attention to the 1024-dim mixed output — the vocabulary-sized logit space. V2 moved these operations to the encoded coordinates \(z_\text{cat}\) (480-dim). This is not just a dimensionality reduction: \(z\) lives in the space where the metric is approximately diagonal, so attention’s QK dot product computes distances that respect the manifold geometry rather than operating on an arbitrary post-kernel mixture. The SSM similarly benefits — its recurrence propagates information along geometrically meaningful axes rather than coupled logit dimensions.
Weyl spectral SSM initialisation (−0.062 bpb, zero parameters). The SSM’s complex eigenvalues \(\lambda_k = -\sigma_k + i\omega_k\) control which frequencies the recurrence propagates. Random initialisation forces the model to discover the right frequency spectrum from scratch. Weyl’s law (Weyl, 1911) predicts the asymptotic eigenvalue growth of the Laplacian on a Riemannian manifold: \(\omega_n \sim n^{1/d}\) where \(d\) is the intrinsic dimension. I set the SSM frequencies to a 50/50 blend of Weyl frequencies (\(n\pi / R_s\)) and log-spaced frequencies, giving the SSM a physics-informed starting point that matches the spectral structure of the Stäckel metric. This is the cleanest example of IGL theory transferring directly to language modelling — at zero additional parameters.
Surgical Muon routing (−0.022 bpb). Muon’s Newton step provides equivariant updates that respect the geometry of weight matrices. V1 applied Muon only to encoder weights. V2 extended it to attention QKV/proj and SSM projections — any 2D weight matrix in the architecture. The key constraint: Muon must not be applied to scalar parameters (gates, temperatures, kernel anchors), where it destabilises the Stiefel manifold constraints (+0.038 bpb with full Muon). The surgical routing — Muon for matrices, AdamW for scalars — lets each parameter type use the optimiser suited to its geometry.
What didn’t work¶
Not all failures are equal. Some ideas are theoretically sound but penalised by the competition’s 10-minute wallclock format, which favours fast steps over better-but-slower architectures. I distinguish between ideas that are fundamentally flawed and ideas that are format-disadvantaged.
Fundamentally flawed:
-
Phi-normalisation (+0.013 bpb): I normalised the kernel design matrix rows before readout — L2 normalisation for Gabor (preserving sign from cosine oscillation), softmax for Laplacian (making positive RBF outputs a probability distribution), Nyström unchanged (already row-sum normalised). The idea was to make kernel outputs scale-invariant. The result: it destroyed load-bearing magnitude information. The raw kernel outputs carry useful scale information — how strongly a token matches an anchor — that the readout weights rely on. Normalising it away forces the model to reconstruct scale from other signals, at a net cost.
-
Competition techniques (+0.036 to +0.041 bpb): EMA, QK-Gain, and Muon-EQ are tricks borrowed from competitors on the regular transformer track of Parameter Golf (Keller Jordan’s modded-nanoGPT and its derivatives). They were tuned for the transformer baseline’s specific architecture and training dynamics. EMA (exponential moving average of weights, decay=0.997): +0.015 bpb — SWA (Izmailov et al., 2018) works better for Stiefel architectures because averaging on the orthogonal manifold requires different interpolation than Euclidean averaging. QK-Gain (learnable per-head scalar gain on queries, set to 5.0): +0.041 bpb — it oversaturates the tanh-bounded Stäckel coordinates, pushing them into the flat tails where gradients vanish. The default gain of 1.5 already accounts for the bounded range. Only XSA (exclusive self-attention: project out the self-alignment component, −0.002 bpb) transferred successfully — it’s architecture-agnostic.
Format-disadvantaged (may work at larger scale or longer training):
-
Koopman propagator (+0.050 bpb): Koopman operator theory (Koopman, 1931; Budišić et al., 2012) shows that any nonlinear dynamical system can be represented as a linear operator acting on a lifted space of observables. If you find the right set of observables — functions of the state — the nonlinear dynamics become a linear map in that space. Applied to MHALM, the idea was to replace attention with this linearisation: lift \(z_\text{cat}\) to a higher-dimensional space via a polynomial (Carleman linearisation at degree 2: concatenate \(z_\text{cat}\) with \(z_\text{cat}^2\), expanding 384-dim to 768-dim), then apply a single linear projection \(W \in \mathbb{R}^{768 \times 256}\). The result: +0.050 bpb. The degree-2 polynomial lift is too crude to capture the content-based retrieval that attention provides — it can represent quadratic interactions but not the sharp, input-dependent routing of softmax. Language requires precise associative recall (“which earlier token matches this query?”), not just smooth quadratic mixing. To be honest, the evidence here is not just a format issue: D-koopman ran 2708 steps vs 1422 for the baseline (90% more tokens, since the single GEMM is much faster than attention), and still ended +0.050 bpb worse. More tokens didn’t help — the degree-2 lift is genuinely insufficient. But the fix paths are concrete: (1) higher-degree lifts on Stäckel coordinates — at intrinsic dimension \(d = 160\), not the ambient 384-dim, which keeps \(R = \binom{d+p}{p}\) tractable; (2) learned Koopman dictionaries (EDMD-style autoencoder where the latent dynamics are constrained to be linear; Williams et al., 2015); (3) input-dependent observables, which connects to Mamba’s selectivity — the lifting functions should depend on the current token. This is still work in progress on my side.
-
Dual encoder (+0.030 bpb): standard attention uses separate \(W_Q\) and \(W_K\) projections — queries and keys live in different spaces. MHALM’s Nyström head uses a single encoder Ψ₀ for both query tokens and landmark keys, enforcing \(s(i,j) = s(j,i)\) (symmetric kernel). The dual encoder variant adds a second encoder Ψ_K (H=256) specifically for landmarks, breaking the symmetry: \(s(i,j) = K(\Psi_Q(x_i), \Psi_K(x_j)) \neq s(j,i)\). Cost: +920K parameters. Per-step, the dual encoder was actually better — at 1400 matched steps, it led by −0.028 bpb. But the 26% speed penalty (155ms vs 123ms per step) meant fewer total steps in the wallclock budget, and the net result was +0.030 bpb. In a longer training regime where per-step cost is amortised, the asymmetric representation might learn finer-grained attention patterns.
-
Depth recurrence (+0.375 bpb in V2): instead of two independent blocks with separate weights, use a single block iterated \(L = 2\) times (Magnus expansion, first-order approximation: \(e^{L_1 + L_2} \approx e^{2L}\); Magnus, 1954). This frees ~2.5M parameters that can be reinvested into wider encoders. In isolation, weight-tying was genuinely positive: −0.041 bpb with 21% more steps (1723 vs 1422, since shared weights are cheaper). But combined with the full V2 architecture (spectral init, 2-pass temporal), it catastrophically failed at +0.375 bpb. The root cause is gradient interference: Pass 2 must correct the residual from Pass 1, but shared weights receive conflicting gradients from both roles. The Magnus theory predicts this should work — the issue is training dynamics, not architecture. Promising fix paths: (1) detached residual (stop gradients from Pass 1 to Pass 2, so each pass trains independently); (2) progressive unfreezing (train Pass 1 first, then unfreeze sharing); (3) partial weight sharing (share SSM weights — which are frequency-based and should be universal — but keep attention weights independent). All of these require longer training runs to validate.
-
More layers (L=3, −0.025 bpb per-token but rejected): adding a third HybridAtlasBlock with shallower encoders (n=4 to fit the parameter budget) showed −0.025 bpb improvement over L=2 at matched steps (1446 vs 1439). The per-token learning efficiency was better. But the compressed artifact hit 18.2 MB, exceeding the 16 MB competition limit. L=3 is a promising direction for settings without an artifact size constraint.
-
Input-dependent anchors (+0.021 bpb under competition format): MHALM’s kernel heads use fixed learned anchors — positions \(\mu_r\), widths \(\sigma_r\), frequencies \(k_r\) are constant at inference. The SSM is similarly LTI. This means the model cannot selectively propagate or forget information depending on the current token — the same limitation that separated S4/H3 from Mamba (Gu et al., 2021; Gu & Dao, 2023). The fix: make the anchors functions of the input (\(\mu_r(x)\), \(\sigma_r(x)\), \(k_r(x)\)), so different tokens attend to different geometric neighbourhoods in Stäckel space. I made a preliminary attempt — input-dependent bandwidth modulation (\(\sigma_r(x) = \sigma_r \cdot (1 + 0.5\tanh(W_\sigma z))\)) combined with a selective SSM — which showed +0.021 bpb due to +15ms/step overhead. The mechanism learned non-trivial modulation patterns; the wallclock penalty killed it. Re-evaluating with full input-dependent anchors and a proper training budget is the top priority.
Putting the numbers in context¶
The competition format also imposes constraints worth understanding when interpreting the V2 result:
Hardware. H100 Tensor Cores do BF16 matmul at ~989 TFLOPS. MHALM’s Gabor/Laplacian/Nyström kernels use exp, cos, sqrt — ALU-bound ops at ~60 TFLOPS on CUDA cores. That’s a 15× hardware efficiency gap. V2’s 2.2× step-time ratio is actually the best achievable given this — meaning the architecture is already well-optimised for the hardware it’s fighting against.
Walltime vs learning efficiency. MHALM sees 3.19B tokens vs the transformer’s 7.19B in the same 10-minute wallclock. The competition measures “which learns better per H100-second?” — not “which learns better per token?” At iso-tokens (matched data, matched batch), the gap narrows from +0.27 → +0.11 val_loss and is still closing at the end of training. The gap narrows with more tokens, consistent with higher data efficiency, though I cannot extrapolate to crossover from this data.
The baseline is heavily engineered. The provided baseline is a modded-nanoGPT variant with Muon optimizer, Flash SDP attention, and ReLU² MLPs — already a well-optimised transformer tuned for FineWeb + H100. A vanilla transformer without these tricks would most likely score higher than MHALM’s 1.35 bpb.
V=1024 hides IGL’s main lever. IGL’s marquee feature — the VP trick where you don’t store \(R \times V\) readout matrices — saves almost nothing at \(V = 1024\). At \(V = 50\text{K}\) or \(V = 128\text{K}\), the VP trick could save 30–40% of total model weights.
What comes next¶
The results so far — 45+ ablations, a 47% gap closure, theory-driven initialisations that work out of the box, and a spatial path the model actively prefers — suggest this direction is worth pursuing seriously. But I’ve reached the limit of what I can reasonably fund in compute on my own. The next steps (iso-token scaling to 1B parameters, long-context evaluation, input-dependent anchors at convergence) require sustained multi-node GPU access that my personal budget can’t absorb.
Looking for support. If you’re a lab, a compute sponsor, or a researcher interested in geometric approaches to language modelling, I’d welcome collaboration — whether that’s compute credits, mentoring, or just a conversation. You can reach me at alexandre@quemy.info.
The theoretical framework¶
MHALM is one instantiation of a broader theoretical programme that connects the transformer’s attention block to differential geometry and PDEs. The attention kernel \(K(q, k) = \exp(q^\top k / \sqrt{d_h})\) defines a Green’s function; the Stäckel separability condition identifies when that Green’s function factorises; and the IGL convolution formula turns the factorisation into a concrete linear-time computation. I am developing this framework into a full treatment — linking attention, MLP, LayerNorm, and the residual stream to operators on the token manifold (Witten Laplacian (Witten, 1982), Koopman propagator, Weyl spectral theory). The goal is a complete PDE dictionary for the transformer, where each architectural choice has a geometric interpretation and each geometric insight suggests a concrete architectural improvement — as Weyl spectral init already demonstrated.
Breaking free from the competition format¶
Parameter Golf proved MHALM works and improves, but its 10-minute speedrun on H100s is the wrong surface to evaluate a kernel-based architecture. The GEMM-optimised transformer runs at 989 TFLOPS on Tensor Cores; MHALM’s kernel ops run at ~60 TFLOPS on CUDA ALUs. In this format, MHALM doesn’t stand a chance — yet. Moreover, the IGL theory provides a path to making MHALM fully GEMM-native: structured polynomial feature maps can replace the explicit exp/cos kernel evaluations with matrix multiplications, leveraging a similar trick as the Performer (Choromanski et al., 2020) but with deterministic, Stäckel-structured features instead of random Fourier ones. This would likely eliminate the hardware penalty.
The immediate plan is to extend MHALM to a regime where its strengths actually matter:
- Iso-token evaluation at convergence. Train both architectures to the same number of tokens on Pile or FineWeb (10–30B tokens at Chinchilla ratio (Hoffmann et al., 2022)), at scales from 125M to 1B parameters. Evaluate on standard zero-shot benchmarks (LAMBADA, HellaSwag, PIQA, ARC). The iso-token gap was already narrowing sharply (+0.27 → +0.11 val_loss at 3B tokens) — I want to see where it goes at convergence.
- Large vocabularies. At \(V = 50\text{K}\)+ the VP trick saves 30–40% of model weights. Parameter Golf’s \(V = 1024\) hides this entirely.
- Long sequences. The LTI SSM with cuFFT should extrapolate to 2K–16K without retraining, where transformer’s \(O(T^2)\) cost becomes prohibitive.
Re-validating promising ideas¶
Several ideas showed theoretically sound foundations and promising per-step results but were killed by the speedrun format. With a longer training budget, I want to revisit:
- Koopman propagator — the degree-2 lift was too crude, but higher-degree lifts on Stäckel coordinates (at intrinsic dimension \(d\), not ambient \(d_h\)) and learned Koopman dictionaries are concrete next steps.
- Depth recurrence — weight-tying worked in isolation (−0.041 bpb) but failed in combination due to gradient interference. Detached residuals, progressive unfreezing, or partial sharing (SSM shared, attention independent) could fix the training dynamics.
- Dual encoder — per-step quality was better (−0.028 bpb at matched steps), just too slow for the wallclock budget.
- More layers (L=3) — genuine per-token improvement (−0.025 bpb), rejected only for artifact size.
- Input-dependent anchors — the most important open direction, and the top priority for re-evaluation with a proper training budget (see the format-disadvantaged section above for details and preliminary results).
Diagnostic experiments¶
Before scaling up, cheap diagnostics can validate or kill the direction:
- MQAR (multi-query associative recall) (Arora et al., 2024): does MHALM’s kernel readout provide recall capacity that gated-convolutions lack? If MHALM scales linearly in \(d\) on MQAR like vanilla gated-conv, the recall story is broken and input-dependent anchors are needed before further scaling.
- “Is it just a kernel method?” ablation: three degenerate baselines (identity encoders, frozen anchors, MLP replacement) to confirm the architecture adds value beyond the kernel.
- Atlas specialisation check: do the three encoders learn distinct coordinate charts, or do they collapse to identical representations?
Conclusion¶
After 45+ ablations, MHALM V2 reaches 1.35 bpb — closing about half the gap to the transformer baseline, despite a significant hardware disadvantage on kernel operations.
The architecture runs end-to-end and improves with iteration. The model prefers the spatial (geometric) path over the temporal one when given the choice. Two theory-driven changes — Weyl spectral initialisation and processing in Stäckel coordinates — produced measurable gains at zero or minimal parameter cost. These were predictions from the IGL framework, not post-hoc tuning.
Not everything worked though (yet?!). The Koopman propagator, depth recurrence in combination, and input-dependent anchors all failed under the competition format. Some of these are likely format-specific (the 10-minute wallclock penalises anything that adds per-step cost), some may be genuine limitations. Disentangling the two requires longer training runs and iso-token comparisons — experiments I haven’t been able to run yet.
The path forward is concrete: iso-token evaluation at convergence, input-dependent anchors with a proper training budget, and a GEMM-native formulation of the kernel readout. The theoretical framework connecting attention to differential geometry and PDEs is being developed in parallel.
If any of this is interesting to you, I’d be happy to talk.