← Back to Research

Research entry

Physics-Inspired LLM Optimization: 5 Directions, 9 Experiments

Active · March 2026

Testing whether structures from physics — thermodynamics, renormalization group, lattice theory, and mean-field dynamics — can guide LLM compression. Five research directions validated on GPT-2 124M: entropy-based quantization (2.6× better than uniform), RG-guided pruning, lattice attention with a phase transition, GPTQ interaction analysis, and metastable cluster verification.

Physics LLM Quantization Pruning Attention Mean-Field Theory GPT-2 PyTorch

View external resource →

Motivation

Large language models are expensive to run. The standard approaches to compression — quantization, pruning, efficient attention — rely on heuristics or expensive Hessian computations. But transformers have deep structural connections to physics: attention is Hopfield energy minimization, layer depth maps to renormalization group flow, and token dynamics follow mean-field interacting particle systems.

Can we exploit these physics structures to compress LLMs more intelligently?

This research program tests five physics-inspired optimization directions on GPT-2 124M, with plans to scale to LLaMA-7B.


Direction 1: Thermodynamic Quantization

Idea: Layers with high activation entropy carry more information and need more bits. Low-entropy layers can be aggressively quantized.

Method: Collect per-layer activation entropy via streaming histograms on 128 WikiText-2 calibration samples. Allocate bits proportionally: high entropy → more bits, low entropy → fewer bits, constrained to 4-bit mean.

Results (Absmax Quantization)

MethodAvg BitsPerplexityvs Uniform
FP32 baseline3229.95
Uniform 4-bit4.012,1961.0×
Entropy-linear 4-bit4.04,7302.6× better
Hessian-linear 4-bit4.016,0890.76× (worse)

Entropy allocation is 2.6× better than uniform at the same total bit budget.

Per-layer activation entropy across GPT-2

Entropy vs Hessian scatter — Pearson r = -0.006

Surprise: Entropy and Hessian Are Orthogonal

We also computed diagonal Fisher (Hessian) sensitivity per layer and tested the correlation with entropy:

  • Pearson r = −0.006 (essentially zero)
  • Spearman r = 0.082 (negligible)

Entropy (information content) and Hessian sensitivity (loss impact) measure completely different things. This is novel — the field assumes Hessian-based methods capture the important signal, but entropy captures something orthogonal.

GPTQ Changes the Story

When we replaced naive absmax with GPTQ (Cholesky-based Hessian error compensation):

MethodPPL
GPTQ uniform 4-bit985
GPTQ entropy [3,6]1,500
GPTQ entropy [2,8]2,677

GPTQ’s Hessian compensation makes entropy redundant. Uniform wins because GPTQ already handles per-layer sensitivity. Entropy allocation is most valuable in the low-sophistication quantization regime where you can’t afford Hessian computation.


Direction 2: Renormalization Group Flow Analysis

Idea: In RG theory, “relevant operators” grow under coarse-graining while “irrelevant operators” decay. If transformers implement RG, singular values of weight matrices should show growth/decay patterns across layers — telling us which weights to prune.

Method: Compute SVD of each weight matrix type across all 12 GPT-2 layers. Track singular value trajectories and compute growth rates.

Results

Weight TypeRG ClassificationMean Growth RatePruning Safety
attn.c_proj65% relevant (growing)+0.108Dangerous to prune
mlp.c_projMarginal+0.077Moderate risk
attn.c_attnMarginal (decay)−0.027Safer to prune
mlp.c_fcFixed point−0.008Safest to prune

The growth rate distribution is multimodal — four distinct peaks for four weight types, not Gaussian noise.

Key insight: Output projections accumulate and amplify signal through the residual stream (relevant operators). Input projections are scale-invariant (marginal/fixed point). This gives a physics-principled pruning strategy: compress input projections aggressively, leave output projections intact.

RG flow trajectories — singular value evolution across layers

Growth rate distribution — multimodal, four distinct peaks


Direction 3: Lattice-Structured Attention

Idea: Replace full O(n²) attention with a physics-motivated decay kernel: a(i,j) ∝ exp(−|i−j|/ξ). This is the Green’s function of a 1D lattice — it describes how correlations decay with distance.

Method: Multiply standard attention scores by the lattice decay mask before softmax. Sweep correlation length ξ from 16 to 1024 tokens.

Results: A Phase Transition

ξ (tokens)Perplexityvs Vanilla
102430.20+0.8%
51232.47+8.4%
25668.20+128%
128406+1,256%
642,391+7,882%
1638,678+129,041%

Sharp phase transition at ξ ≈ 256–512 tokens. GPT-2 is robust down to half its context window, then collapses catastrophically. This is not gradual degradation — it’s a phase transition, the hallmark of a physical system.

This directly confirms the theoretical prediction from Rigollet 2025 (arXiv 2512.01868, ICM 2026) about critical attention scaling.

Perplexity vs correlation length — sharp phase transition

Lattice decay mask patterns at different correlation lengths


Direction 4: Metastable Cluster Verification

Idea: Rigollet 2025 proves that tokens in transformers form metastable clusters that eventually collapse to full synchronization. We verify this empirically on GPT-2.

Method: Capture hidden states at every layer, compute pairwise cosine similarity and cluster count across depth.

Results

LayerMean Cosine SimClusters
00.488107
70.488125 (max diversity)
110.693119
120.9366 (collapse)

Layers 0–11 maintain ~125 clusters (a long metastable plateau). Layer 12 collapses everything to ~6 clusters in one shot. It’s a single cliff, not Rigollet’s predicted multi-step staircase — GPT-2 may be too shallow for multi-step merging.

The 1/t² polynomial decay predicted for Pre-LN does not fit (R² = −611). Residual connections actively fight mean-field collapse until the final layer.

Cluster count across layers — plateau then cliff

Pairwise cosine similarity heatmaps at layers 0, 4, 8, 11


Combined Compression

We applied all three physics principles simultaneously:

MethodPPLCompression
FP32 baseline29.951.0×
Lattice attn (ξ=512)32.471.0×
Entropy quant (4-bit)4,7302.48×
Entropy + RG pruning5,5152.63×
All three combined6,9332.63×

Entropy + RG pruning are slightly superadditive (combining is less bad than sum of individual costs). But lattice attention interferes with degraded weights — when the model is already damaged by quantization, it becomes more dependent on long-range attention.


Big Picture

Five physics principles, each validated experimentally:

  1. Entropy tells you how to quantize (which layers need more bits)
  2. RG flow tells you how to prune (which weight matrices to compress)
  3. Lattice decay tells you the minimum attention range (~512 tokens for GPT-2)
  4. Mean-field theory predicts the clustering dynamics (confirmed: metastable plateau + final collapse)
  5. Hessian and entropy are orthogonal — the field’s standard approach misses half the signal

What’s Next

  • Scale to LLaMA-7B to test whether patterns generalize
  • Direction 5: DMRG tensor compression — the last unexplored direction
  • Combined entropy+Hessian allocator — exploit the orthogonality