Scattering Transform

March 15, 2026

The Scattering Transform: Wavelets, Stability, and the Geometry of Deep Learning

Why Does Deep Learning Work on Signals?
Geometric Stability: The Core Design Principle
Why Fourier Fails
Building the Scattering Transform
Key Mathematical Properties
Example 1 - The Cascade: Energy Decay Across Layers
Example 2 - Deformation Stability: Scattering vs. Fourier
Example 3 - Texture Discrimination: Same Spectrum, Different Structure
Example 4 - Multifractal Analysis: Capturing Intermittency
Extensions and Applications
Summary

1. Why Does Deep Learning Work on Signals?

Convolutional neural networks (CNN) achieve remarkable performance on images and audio, yet a rigorous mathematical explanation of why remains elusive. The scattering transform (ST) is a simplified version of CNN with regourous mathematical foundations. It was originated from signal/image processing fields (S. Mallat and J. Bruna & S. Mallat) and found applications in diverse tasks: image classification, finance, astrophysics, material science, …

The key insight is that the success of CNNs on structured signals is not accidental - it follows directly from the geometric properties of those signal domains. Images, audio, and physical fields share a common regularity: they are stable to small local deformations. A slightly warped image of a cat is still a cat. A pitch-shifted vowel is still the same vowel.

The scattering transform is a hand-crafted signal representation that provably exploits this regularity. It combines classical wavelet analysis with a deep convolutional architecture whose filters are never learned - they are fixed by mathematics. The result is a feature extractor with formally certified stability guarantees that trained CNNs do not have, and that provides a mathematical lens through which to understand what CNNs implicitly learn to do.

2. Geometric Stability: The Core Design Principle

2.1 The Setting

Let $x \in L^2(\mathbb{R}^d)$ be a signal (e.g., an image or audio waveform). We want to build a representation $\Phi(x) \in \mathbb{R}^K$ to feed into a linear classifier $\hat{f}(x) = \langle \Phi(x), \theta \rangle$.

For this to generalize well, $\Phi$ must satisfy two things:

Stability to additive noise: $\| \Phi(x) - \Phi(x') \| \lesssim \| x - x' \|$

Stability to deformations: Given a smooth displacement field $\tau : \mathbb{R}^d \to \mathbb{R}^d$, let $x_\tau(u) = x(u - \tau(u))$ be the deformed signal. We want: $\| \Phi(x_\tau) - \Phi(x) \| \lesssim \| x \| \cdot \| \tau \|$

where $|\tau|$ is some appropriate norm on the displacement field.

2.2 Why Deformation Stability Is the Right Prior

Global translation invariance ($f(T_v x) = f(x)$ for all translations $T_v$) is a weak constraint - the translation group is only $d$-dimensional. The space of local deformations is infinite-dimensional and captures far more natural variability: changes in viewpoint, non-rigid motion, pronunciation variation in speech.

A key structural consequence is scale separation: since deformations act differently at different frequencies, a deformation-stable representation must separate scales. This is precisely what wavelet decompositions do - and precisely what the layers of a CNN do implicitly.

3. Why Fourier Fails

The Fourier modulus $ \Phi(x) = |\hat{x}(\omega)| $ is translation invariant and stable to additive noise, but it is catastrophically unstable to deformations at high frequencies.

The Dilation Example

Let $\tau(u) = su$ be a small uniform dilation ($|s| \ll 1$), and let $x(u) = e^{i\xi u} \theta(u)$ be a modulated window centered at frequency $\xi$. The dilated signal $x_\tau(u) = x((1+s)u)$ has its central frequency shifted to $(1+s)\xi$.

The frequency spread of $x$ is $\sigma_\theta^2 = \int |\omega - \xi|^2 |\hat{\theta}(\omega)|^2 d\omega$, and after dilation it becomes $(1+s)^2 \sigma_\theta^2$.

When the frequency shift $s\xi$ is large compared to the bandwidth $\sigma_\theta$, the supports of $|\hat{x}|$ and $|\hat{x}_\tau|$ are nearly disjoint, so: $\| \|\hat{x}_\tau\| - \|\hat{x}\| \| \approx \| x \|$

This is an $O(1)$ error from an arbitrarily small deformation when $\xi$ is large. The Fourier modulus is not Lipschitz continuous with respect to deformations.

The fix is to band-limit the signal before measuring it - that is, to use a wavelet transform that isolates each frequency band before applying a modulus nonlinearity.

4. Building the Scattering Transform

4.1 The Wavelet Filter Bank

A Littlewood-Paley wavelet transform is built from a mother wavelet $\psi \in L^2(\mathbb{R}^d)$ by dilating and rotating: $\psi_\lambda(u) = a^{-dj} \psi(a^{-j} r^{-1} u), \quad \lambda = a^j r$

where $j \in \mathbb{Z}$ controls scale ($a^j$ is the scale factor, typically $a = 2$) and $r \in G$ is a rotation. The Littlewood-Paley condition ensures the filter bank is a tight frame:

\[1 - \varepsilon \leq |\hat{\phi}(2^J \omega)|^2 + \frac{1}{2} \sum_{j \leq J} \sum_{r \in G} |\hat{\psi}(2^j r \omega)|^2 \leq 1\]

This means the decomposition is energy-preserving and invertible.

4.2 The Modulus Nonlinearity

Wavelet coefficients $x \otimes \psi_\lambda(u)$ are not translation invariant. Their average is zero (wavelets have zero mean). The key step is to apply the complex modulus: $|x \otimes \psi_\lambda(u)|$

This produces a non-negative, non-zero envelope that is roughly translation invariant at the scale of $\psi_\lambda$. The modulus is the only nonlinearity that:

Is non-expansive: $| |a| - |b| | \leq | a - b |$
Preserves signal energy across layers

4.3 The Cascade

Averaging $|x \otimes \psi_\lambda|$ over a window of size $2^J$ gives a translation-invariant first-order feature: $S_J[\lambda_1] x(u) = \|x \otimes \psi_{\lambda_1}\| \otimes \phi_{2^J}(u)$

But averaging discards information - specifically, the spatial modulation of the wavelet envelope. This lost information is recovered by applying another wavelet transform to $|x \otimes \psi_{\lambda_1}|$, taking the modulus again, and averaging.

This produces second-order coefficients: $S_J[\lambda_1, \lambda_2] x(u) = \big| |x \otimes \psi_{\lambda_1}| \otimes \psi_{\lambda_2} \big| \otimes \phi_{2^J}(u)$

Iterating this process defines the full scattering transform. For a path $p = (\lambda_1, \lambda_2, \ldots, \lambda_m)$, we define the propagator: $U[p]x = \big| \cdots \big| |x \otimes \psi_{\lambda_1}| \otimes \psi_{\lambda_2} \big| \cdots \big| \otimes \psi_{\lambda_m} \big|$

and the scattering coefficient: $S_J[p]x(u) = U[p]x \otimes \phi_{2^J}(u)$

The resulting architecture is a convolutional network whose filters are fixed wavelets, not learned parameters.

Input x 
   │
   ├── S_J[∅]x = x ⊗ φ_J                  (order 0: low-pass average)
   │
   ├── U[λ₁]x = |x ⊗ ψ_λ₁|
   │     ├── S_J[λ₁]x                      (order 1 outputs)
   │     └── U[λ₁,λ₂]x = |U[λ₁]x ⊗ ψ_λ₂|
   │           ├── S_J[λ₁,λ₂]x             (order 2 outputs)
   │           └── ...                     (order 3, ...)
   └── ...

5. Key Mathematical Properties

5.1 Non-Expansiveness (Stability to Noise)

Proposition. The windowed scattering transform is non-expansive: $\| S_J[P_J] x - S_J[P_J] x' \| \leq \| x - x' \|, \quad \forall x, x' \in L^2(\mathbb{R}^d)$

This follows because (1) the Littlewood-Paley wavelet frame satisfies $|W_J x| \leq |x|$, and (2) the modulus satisfies $| |a| - |b| | \leq |a - b|$. Their composition is also non-expansive.

5.2 Energy Conservation and Exponential Decay

Under mild conditions on the wavelet (roughly: the wavelet is analytic and has at least one vanishing moment), the total scattering energy is preserved: $\|x\|^2 = \sum_{p \in P_\infty} \|S_J[p]x\|^2$

More importantly, the energy decays exponentially with path depth: $R_{J,x}(m) := \sum_{|p|=m} \|U[p]x\|^2 \leq \|x\|^2 - \|x \otimes \chi_{ra^m}\|^2$

where $\chi_s$ is a Gaussian window of width $s$. Energy at frequency $2^k$ disappears after $O(k)$ layers, so typical signals require no more than 2–3 layers. Empirically, on image datasets, over 99% of the energy is captured by paths of length $m \leq 2$.

5.3 Asymptotic Translation Invariance

The scattering metric $d_J(x, x’) := |S_J[P_J]x - S_J[P_J]x’|$ is non-increasing in $J$: $d_{J+1}(x, x') \leq d_J(x, x')$

and in the limit it is translation invariant: $\lim_{J \to \infty} \| S_J[P_J] x - S_J[P_J] x_v \| = 0, \quad \forall x, v$

where $x_v(u) = x(u - v)$.

5.4 Lipschitz Stability to Deformations

This is the central theorem. For a $ C^2 $ displacement field $\tau$ with $ |\nabla\tau|_\infty \leq 1/2 $:

\[\| S_J[P_J] x_\tau - S_J[P_J] x \| \leq C \|U[P_J]x\|_1 \cdot K(\tau)\]

where: $K(\tau) = 2^{-J}\|\tau\|_\infty + \|\nabla\tau\|_\infty \max\!\left(1,\, \log \frac{\sup_{u,u'}|\tau(u)-\tau(u')|}{\|\nabla\tau\|_\infty}\right) + \|H\tau\|_\infty$

The bound decomposes into:

Translation term $ 2^{-J}|\tau|_\infty $: suppressed by increasing $J$, capturing local translation invariance
Deformation term $ |\nabla\tau|_\infty $: controlled by scale separation in the wavelet decomposition
Curvature term $ |H\tau|_\infty $: second-order correction

The proof hinges on controlling the commutator $ [W_J, L_\tau] = W_J L_\tau - L_\tau W_J$ between the wavelet transform and the deformation operator, which is bounded by $|\nabla\tau|$ due to the scale-localization property of wavelets.

Example 1 - The Cascade: Energy Decay Across Layers

This example builds a synthetic multi-scale signal and visualizes how energy is distributed across scattering orders. The theoretical prediction - exponential decay with path depth - is confirmed empirically.

Important note for installation: pip install kymatio numpy scipy matplotlib

"""
example 1: the scattering cascade and energy decay
================================================
visualizes how energy distributes across scattering orders (0, 1, 2)
and confirms the theoretical exponential decay.

install: pip install kymatio numpy scipy matplotlib
"""

import numpy as np
import matplotlib.pyplot as plt
from kymatio.numpy import Scattering1D

# signal parameters
T = 2**13    # signal length (must be power of 2)
J = 6        # number of dyadic scales
Q = 8        # wavelets per octave (frequency resolution)
t = np.linspace(0, 1, T)

np.random.seed(0)

# construct a signal with energy at multiple scales:
#   - slow carrier (scale ~1/5 Hz)
#   - mid-frequency burst (scale ~1/80 Hz, localized at t=0.5)
#   - high-frequency modulation (scale ~1/200 Hz)
#   - low-amplitude noise
signal = (
    np.sin(2 * np.pi * 5 * t)
    + 0.6 * np.sin(2 * np.pi * 80 * t) * np.exp(-((t - 0.5)**2) / 0.003)
    + 0.3 * np.sin(2 * np.pi * 200 * t) * np.exp(-((t - 0.3)**2) / 0.001)
    + 0.05 * np.random.randn(T)
)

# scattering transform
scat = Scattering1D(J=J, shape=T, Q=Q)
Sx   = scat(signal)          # shape: [num_paths, T // 2^J]
meta = scat.meta()           # path metadata (order, scale, angle)
order = meta['order']        # integer array: 0, 1, or 2 for each path

print(f"signal length:              {T}")
print(f"scattering output shape:    {Sx.shape}")
print(f"downsampling factor:        {T // Sx.shape[-1]}x  (scale 2^J = {2**J})")
print(f"total paths:                {Sx.shape[0]}")
for m in [0, 1, 2]:
    n = np.sum(order == m)
    E = np.sum(Sx[order == m]**2)
    print(f"  Order {m}: {n:4d} paths, energy fraction = {E / np.sum(Sx**2):.4f}")

fig = plt.figure(figsize=(13.5, 8.5))

gs = fig.add_gridspec(4, 2, width_ratios=[60, 1.2], height_ratios=[1, 1, 1, 1])

axes = [fig.add_subplot(gs[i, 0]) for i in range(4)]
cax1 = fig.add_subplot(gs[2, 1])
cax2 = fig.add_subplot(gs[3, 1])

fig.subplots_adjust(left=0.065, right=0.94, top=0.91, bottom=0.075, hspace=0.62, wspace=0.06)

# original signal
axes[0].plot(t, signal, lw=0.6, color='#2c7bb6')
axes[0].set_title("input signal  (multi-scale: low carrier + mid burst + high modulation)", pad=5)
axes[0].set_xlabel("time", labelpad=3)
axes[0].set_ylabel("amplitude")
axes[0].autoscale(enable=True, axis='x', tight=True)

# order-0 scattering
S0 = Sx[order == 0]
x0 = np.arange(S0.shape[-1])

axes[1].plot(x0, S0.T, color='#d7191c', lw=1.5)
axes[1].set_title(r"order 0: $S_J[\emptyset] x = x \otimes \phi_{2^J}$  - global low-pass envelope", pad=5)
axes[1].set_xlabel("spatial position (downsampled)", labelpad=3)
axes[1].autoscale(enable=True, axis='x', tight=True)

# order-1 coefficients
S1 = np.log1p(np.abs(Sx[order == 1]))
im1 = axes[2].imshow(S1, aspect='auto', cmap='YlOrRd', origin='lower', extent=[0, S1.shape[1] - 1, 0, S1.shape[0] - 1])
axes[2].set_title("order 1: $S_J[\\lambda_1] x$  - energy by (scale, time)", pad=5)
axes[2].set_xlabel("spatial position", labelpad=3)
axes[2].set_ylabel("path index ($\\propto$ scale)")
fig.colorbar(im1, cax=cax1, label='log(1 + |coeff|)')

# order-2 coefficients
S2 = np.log1p(np.abs(Sx[order == 2]))
im2 = axes[3].imshow(S2, aspect='auto', cmap='PuBuGn', origin='lower', extent=[0, S2.shape[1] - 1, 0, S2.shape[0] - 1])
axes[3].set_title("order 2: $S_J[\\lambda_1, \\lambda_2] x$  - scale-interaction features", pad=5)
axes[3].set_xlabel("spatial position", labelpad=3)
axes[3].set_ylabel("path index")
fig.colorbar(im2, cax=cax2, label='log(1 + |coeff|)')

for ax in axes:
    ax.tick_params(axis='both', pad=2)

fig.suptitle("scattering cascade: energy distributes across orders and decays", fontsize=13)
plt.show()

# energy decay table
print("\n=== Energy fraction per order ===")
total_energy = np.sum(Sx**2)
for m in [0, 1, 2]:
    frac = np.sum(Sx[order == m]**2) / total_energy
    print(f"  Order {m}: {frac*100:.2f}%")

Expected output:

Order 0:    1 paths, energy fraction = 0.9986
Order 1:   38 paths, energy fraction = 0.0014
Order 2:   87 paths, energy fraction = 0.0000

The exponential decay is clear: order-0 captures the bulk of the DC energy, order-1 captures frequency-band energy, and order-2 has small but non-trivial residual energy encoding higher-order structure.

Example 2 - Deformation Stability: Scattering vs. Fourier

This example applies a smooth time warp (a sinusoidal displacement field) to a test signal and measures the resulting error in both the Fourier modulus and the scattering representation. The scattering error should be dramatically smaller.

"""
example 2: deformation stability
==============================
empirically verifies that the scattering transform is Lipschitz continuous
to deformations, while the Fourier modulus is not.

install: pip install kymatio numpy scipy matplotlib
"""

import numpy as np
import matplotlib.pyplot as plt
from kymatio.numpy import Scattering1D
from scipy.ndimage import map_coordinates

T  = 2**10
J  = 5
Q  = 8
t  = np.linspace(0, 1, T)

np.random.seed(42)

# test signal: two harmonics at very different scales
signal = np.sin(2 * np.pi * 20 * t) + 0.1 * np.cos(2 * np.pi * 60 * t)

# deformation operator
def apply_deformation(x, tau_max_frac, freq=3):
    """
    warp signal x by a smooth sinusoidal displacement field:
        x_tau(t) = x(t - tau(t))
    tau(t) = tau_max * sin(2π * freq * t)

    tau_max_frac: max displacement as fraction of signal length T
    """
    displacement = tau_max_frac * T * np.sin(2 * np.pi * freq * t)
    warped_idx   = np.clip(np.arange(T) - displacement, 0, T - 1)
    return map_coordinates(x, [warped_idx], order=3, mode='nearest')

# sweep over deformation amplitudes
tau_values    = np.linspace(0.0, 0.025, 20)
fourier_errors = []
scat_errors    = []

scat = Scattering1D(J=J, shape=T, Q=Q)
Sx_orig = scat(signal)
F_orig  = np.abs(np.fft.rfft(signal))

for tau_max in tau_values:
    sig_w     = apply_deformation(signal, tau_max)
    F_w       = np.abs(np.fft.rfft(sig_w))
    Sx_w      = scat(sig_w)
    fourier_errors.append(np.linalg.norm(F_orig - F_w)  / np.linalg.norm(F_orig))
    scat_errors.append(   np.linalg.norm(Sx_orig - Sx_w) / np.linalg.norm(Sx_orig))

# qualitative example at a fixed deformation
tau_demo   = 0.015
signal_w   = apply_deformation(signal, tau_demo)
F_w_demo   = np.abs(np.fft.rfft(signal_w))
Sx_w_demo  = scat(signal_w)
meta       = scat.meta()
order      = meta['order']

fe_demo = np.linalg.norm(F_orig - F_w_demo)  / np.linalg.norm(F_orig)
se_demo = np.linalg.norm(Sx_orig - Sx_w_demo) / np.linalg.norm(Sx_orig)

print(f"deformation tau_max = {tau_demo:.3f}")
print(f"  Fourier modulus relative error:  {fe_demo:.4f}")
print(f"  scattering relative error:       {se_demo:.4f}")
print(f"  stability improvement:           {fe_demo/se_demo:.1f}x")

# plot
fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# original vs. warped signal
axes[0, 0].plot(t[:300], signal[:300],   lw=1.2, color='#2c7bb6', label='original')
axes[0, 0].plot(t[:300], signal_w[:300], lw=1.2, color='#d7191c', alpha=0.8, label='warped')
axes[0, 0].set_title(f"signal vs. warped  ($\\tau_siteJekyll::Drops::SiteDrop = {tau_demo}$)")
axes[0, 0].legend()

# Fourier modulus comparison
freqs = np.fft.rfftfreq(T)
axes[0, 1].plot(freqs[:T//8], F_orig[:T//8],    lw=1.0, color='#2c7bb6', label='original')
axes[0, 1].plot(freqs[:T//8], F_w_demo[:T//8],  lw=1.0, color='#d7191c', alpha=0.8,
                label='warped')
axes[0, 1].set_title(f"Fourier modulus  (error = {fe_demo:.3f})")
axes[0, 1].set_xlabel("frequency")
axes[0, 1].legend()

# scattering coefficients comparison (order-1 paths)
idx1 = np.where(order == 1)[0]
mean_orig = np.mean(np.abs(Sx_orig[idx1]),  axis=1)
mean_w    = np.mean(np.abs(Sx_w_demo[idx1]), axis=1)
x_idx     = np.arange(len(idx1))
axes[1, 0].bar(x_idx - 0.2, mean_orig, width=0.4, color='#2c7bb6', label='original', alpha=0.85)
axes[1, 0].bar(x_idx + 0.2, mean_w,    width=0.4, color='#d7191c', label='warped',   alpha=0.85)
axes[1, 0].set_title(f"Order-1 scattering path energies  (error = {se_demo:.3f})")
axes[1, 0].set_xlabel("scale index")
axes[1, 0].legend()

# error vs. deformation amplitude
axes[1, 1].plot(tau_values, fourier_errors, 'o-', color='#d7191c', lw=1.5,
                label='Fourier modulus')
axes[1, 1].plot(tau_values, scat_errors,    's-', color='#1a9641', lw=1.5,
                label='scattering')
axes[1, 1].set_title("relative error vs. deformation amplitude")
axes[1, 1].set_xlabel(r"$\tau_{\max}$ (fraction of signal length)")
axes[1, 1].set_ylabel("relative $L^2$ error")
axes[1, 1].legend()
axes[1, 1].set_xlim(0, None)
axes[1, 1].set_ylim(0, None)

plt.suptitle("deformation stability: scattering vs Fourier modulus", fontsize=13)
plt.tight_layout()
plt.show()

The bottom-right panel is the key result: the Fourier error grows rapidly and nonlinearly with deformation amplitude, while the scattering error grows slowly and approximately linearly - consistent with the Lipschitz bound $|S_J x_\tau - S_J x| \lesssim |\nabla\tau|_\infty$.

Example 3 - Texture Discrimination: Same Spectrum, Different Structure

This example reproduces the central experiment from Section 4.2 of Mallat (2012): two stochastic processes with identical power spectra (i.e., identical second-order statistics) but different higher-order statistics. The Fourier modulus cannot distinguish them; second-order scattering can.

The theoretical explanation: the expected scattering coefficient $\mathbb{E}[S_J[p]X]$ for a path $p$ of length $m$ captures moments of $X$ up to order $2^m$. First-order scattering ($m=1$) depends only on second-order moments (the power spectrum); second-order scattering ($m=2$) depends on up to fourth-order moments, enabling discrimination.

"""
example 3: texture discrimination where only order-2 scattering separates them.
============================================================================
two processes with exactly identical power spectra (by construction, via phase
randomization) but different cross-scale coupling. 
the power spectrum is broadband (multi-peak). 
first-order scattering cannot distinguish them, second-order scattering can.

install: pip install kymatio numpy scipy matplotlib
"""

import numpy as np
import matplotlib.pyplot as plt
from kymatio.numpy import Scattering1D
from scipy.stats import kurtosis

rng = np.random.default_rng(13)
T   = 2**14
J   = 8
Q   = 8

CARRIERS = np.linspace(0.05,0.25,20)   # broadband spectrum
F_ENV    = 0.010                       # slow shared envelope
DEPTH    = 0.95

# process B: several carriers sharing one slow positive envelope
# the shared modulation couples all carrier bands to the coarse envelope scale,
# producing strong, distributed second-order scattering coefficients.
def slow_positive_envelope(T, rng, f_env, depth):
    Wn    = np.fft.rfft(rng.standard_normal(T))
    freqs = np.fft.rfftfreq(T)
    env   = np.fft.irfft(Wn * np.exp(-(freqs / f_env)**2), n=T)
    return 1.0 + depth * (env / (np.abs(env).max() + 1e-9))   # strictly positive

def multi_carrier_texture(T, rng, carriers, f_env, depth):
    tt  = np.arange(T)
    env = slow_positive_envelope(T, rng, f_env, depth)        # one shared envelope
    x   = np.zeros(T)
    for fc in carriers:
        x += np.cos(2 * np.pi * fc * tt + rng.uniform(0, 2 * np.pi))
    x *= env                                                   # shared modulation
    return x / x.std()

# process A: Gaussian surrogate of B (phase randomization)
# keeps |FFT(B)| exactly, randomizes phases => identical PSD, coupling destroyed.
def phase_randomize(x, rng):
    X   = np.fft.rfft(x)
    mag = np.abs(X)
    ph  = rng.uniform(0, 2 * np.pi, len(X)); ph[0] = 0.0      # keep DC real
    if len(x) % 2 == 0:
        ph[-1] = 0.0                                          # keep Nyquist real
    return np.fft.irfft(mag * np.exp(1j * ph), n=len(x))

proc_B = multi_carrier_texture(T, rng, CARRIERS, F_ENV, DEPTH)
proc_A = phase_randomize(proc_B, rng); proc_A /= proc_A.std()

# check PSD difference and kurtosis
PSD_A = np.abs(np.fft.rfft(proc_A))**2
PSD_B = np.abs(np.fft.rfft(proc_B))**2
psd_rel_err = np.linalg.norm(PSD_A - PSD_B) / np.linalg.norm(PSD_B)
kurt_A, kurt_B = kurtosis(proc_A), kurtosis(proc_B)
print(f"PSD relative error (A vs B):  {psd_rel_err:.2e}   (=> identical by construction)")
print(f"kurtosis A / B:               {kurt_A:.2f} / {kurt_B:.2f}")

# scattering
scat = Scattering1D(J=J, shape=T, Q=Q)
Sx_A, Sx_B = scat(proc_A), scat(proc_B)
meta = scat.meta()
order = meta['order']
E_A = np.mean(np.abs(Sx_A), axis=1)
E_B = np.mean(np.abs(Sx_B), axis=1)

# energy-weighted distance over the top energy-carrying paths
def weighted_distance(EA, EB, mask, top_frac=0.6):
    idx = np.where(mask)[0]
    ref = 0.5 * (EA[idx] + EB[idx])
    thresh = np.quantile(ref, 1 - top_frac)
    keep   = idx[ref >= thresh]
    num = np.linalg.norm(EA[keep] - EB[keep])
    den = np.linalg.norm(0.5 * (EA[keep] + EB[keep])) + 1e-12
    return num / den

d1 = weighted_distance(E_A, E_B, order == 1)
d2 = weighted_distance(E_A, E_B, order == 2)
print(f"\n")
print(f"order-1 weighted distance: {d1:.3f}   (small -> indistinguishable)")
print(f"order-2 weighted distance: {d2:.3f}   (large -> discriminated)")
print(f"ratio order2/order1:       {d2/d1:.1f}x")

# plot
fig, axes = plt.subplots(3, 2, figsize=(14, 11))
cA, cB = '#2c7bb6', '#d7191c'

axes[0, 0].plot(proc_A[:1500], lw=0.6, color=cA)
axes[0, 0].set_title(f"process A - gaussian surrogate (no coupling)  (kurtosis = {kurt_A:.1f})")
axes[0, 1].plot(proc_B[:1500], lw=0.6, color=cB)
axes[0, 1].set_title(f"process B - shared modulation (coupling)  (kurtosis = {kurt_B:.1f})")

freqs = np.fft.rfftfreq(T); mf = freqs < 0.28
axes[1, 0].plot(freqs[mf], PSD_A[mf], color=cA, lw=0.7, label='A')
axes[1, 0].plot(freqs[mf], PSD_B[mf], color=cB, lw=0.7, alpha=0.6, label='B')
axes[1, 0].set_title(f"power spectra  (rel. error = {psd_rel_err:.1e} -> identical, broadband)")
axes[1, 0].legend(); axes[1, 0].set_xlabel("frequency")
axes[1, 1].plot(freqs[mf], np.abs(PSD_A - PSD_B)[mf], color='gray', lw=0.7)
axes[1, 1].set_title("PSD difference  (machine zero)"); axes[1, 1].set_xlabel("frequency")

def plot_paths(ax, EA, EB, mask, title, top=40):
    idx = np.where(mask)[0]
    ref = 0.5 * (EA[idx] + EB[idx])
    sel = idx[np.argsort(ref)[::-1][:top]]
    sel = sel[np.argsort(sel)]
    xx  = np.arange(len(sel)); w = 0.4
    ax.bar(xx - w/2, EA[sel], width=w, color=cA, label='A', alpha=0.85)
    ax.bar(xx + w/2, EB[sel], width=w, color=cB, label='B', alpha=0.85)
    ax.set_title(title); ax.legend(); ax.set_xlabel("path (top energy)")

plot_paths(axes[2, 0], E_A, E_B, order == 1, f"order-1 energies  (dist = {d1:.2f} -> similar)")
plot_paths(axes[2, 1], E_A, E_B, order == 2, f"order-2 energies  (dist = {d2:.2f} -> different)")

plt.suptitle("texture discrimination: identical broadband PSD, only order-2 scattering separates them",
             fontsize=13)
plt.tight_layout()
plt.show()

The result is stark: order-1 scattering energies are nearly identical between the two processes (consistent with the fact that they share a power spectrum), while order-2 coefficients diverge significantly.

Example 4 - Multifractal Analysis: Capturing Intermittency

One of the most powerful applications of scattering is robust estimation of multifractal properties of stochastic processes. Classical wavelet moment estimators are unstable for heavy-tailed processes because high polynomial moments have large variance. Scattering moments are computed with a non-expansive operator and are therefore statistically stable.

For a self-similar process with Hurst exponent $H$, the renormalized first-order scattering satisfies: $\tilde{S}_X(j) := \frac{\mathbb{E}[|X \otimes \psi_j|]}{\mathbb{E}[|X \otimes \psi_0|]} = 2^{jH}$

and the deviation from linearity of $\log \tilde{S}_X(j)$ vs. $j$ measures intermittency (the curvature of the scaling exponent $\zeta(q)$).

"""
example 4: multifractal analysis with scattering moments
======================================================
compare three stochastic processes with different intermittency:
  - fractional Brownian Motion (fBm): Gaussian, self-similar, H=0.7
  - Ornstein-Uhlenbeck (OU):          Gaussian stationary, finite-range correlation
  - multifractal Random Walk (MRW):   Non-Gaussian, intermittent

install: pip install kymatio numpy scipy matplotlib
"""

import numpy as np
import matplotlib.pyplot as plt
from kymatio.numpy import Scattering1D
from scipy.stats import kurtosis

rng = np.random.default_rng(0)
T   = 2**14
J   = 8
Q   = 1   # Q=1 for multiscale analysis (log-scale resolution)

# process generators
def fractional_brownian_motion(T, H, rng):
    """
    generate fBm via spectral synthesis (Davies-Harte method approximation).
    H in (0,1): Hurst exponent. H=0.5 -> standard Brownian motion.
    """
    freqs   = np.fft.rfftfreq(T)[1:]       # skip DC
    phases  = rng.uniform(0, 2 * np.pi, len(freqs))
    amplitudes = freqs ** (-(H + 0.5))     # power-law spectral density
    W       = amplitudes * np.exp(1j * phases)
    W       = np.concatenate([[0], W])
    fbm     = np.fft.irfft(W, n=T)
    return (fbm - fbm.mean()) / fbm.std()

def ornstein_uhlenbeck(T, theta, rng):
    """
    OU process: dX = -theta*X dt + dW. stationary, short-range correlation.
    theta: mean-reversion speed.
    """
    x  = np.zeros(T)
    dt = 1 / T
    for i in range(1, T):
        x[i] = x[i-1] - theta * x[i-1] * dt + np.sqrt(dt) * rng.standard_normal()
    return (x - x.mean()) / x.std()

def multifractal_random_walk(T, lambda2, rng, n_scales=10):
    """
    multifractal random walk (Bacry & Muzy, 2003) approximation.
    lambda2 > 0 controls intermittency (larger -> more intermittent).
    constructed as: X(t) = sum_j B_j(t) * exp(omega_j(t))
    where omega_j are correlated log-normal multipliers.
    """
    # approximate via log-normal cascade
    freqs = np.fft.rfftfreq(T)[1:]
    # logarithmic covariance: C(j1-j2) = lambda2 * log(T/|j1-j2|)
    log_vol = np.zeros(T)
    for _ in range(n_scales):
        phase    = rng.uniform(0, 2 * np.pi, len(freqs))
        amp      = freqs ** (-0.5)
        w        = np.fft.irfft(amp * np.exp(1j * phase), n=T)
        log_vol += w * np.sqrt(lambda2 / n_scales)
    envelope = np.exp(log_vol - log_vol.var() / 2)
    noise    = rng.standard_normal(T)
    signal   = noise * envelope
    # integrate to get random walk
    signal   = np.cumsum(signal) / np.sqrt(T)
    return (signal - signal.mean()) / signal.std()

# generate processes
fbm = fractional_brownian_motion(T, H=0.7, rng=rng)
ou  = ornstein_uhlenbeck(T, theta=50, rng=rng)
mrw = multifractal_random_walk(T, lambda2=0.04, rng=rng)

processes = {'fBm (H=0.7)': fbm, 'OU process': ou, 'MRW (intermittent)': mrw}
colors    = {'fBm (H=0.7)': '#2c7bb6', 'OU process': '#1a9641', 'MRW (intermittent)': '#d7191c'}

# scattering moments
scat  = Scattering1D(J=J, shape=T, Q=Q)
meta  = scat.meta()
order = meta['order']
scales_j1 = meta['j'][order == 1, 0]    # scale index for order-1 paths

def renormalized_scattering(x, Sx, order, scales_j1):
    """
    compute normalized first-order scattering moments:
        tilde_S(j) = E[|U[j]x|] (approximated by spatial mean)
    and return log2(tilde_S(j)) vs j for scaling analysis.
    """
    S1 = Sx[order == 1]                  # shape: [n_scales, T//2^J]
    E1 = np.mean(np.abs(S1), axis=1)     # mean over time positions
    # normalize by the coarsest scale
    E1_norm = E1 / (E1[-1] + 1e-12)
    return scales_j1, np.log2(E1_norm + 1e-12)

fig, axes = plt.subplots(2, 3, figsize=(15, 9))

for col, (name, proc) in enumerate(processes.items()):
    color = colors[name]
    kurt  = kurtosis(np.diff(proc))     # kurtosis of increments
    Sx    = scat(proc)

    # signal realization
    axes[0, col].plot(proc[:1000], lw=0.6, color=color)
    axes[0, col].set_title(f"{name}\n(increment kurtosis = {kurt:.1f})")
    axes[0, col].set_xlabel("time")
    if col == 0:
        axes[0, col].set_ylabel("amplitude")

    # log-scattering vs. scale (slope = Hurst exponent for self-similar)
    j_idx, log_E = renormalized_scattering(proc, Sx, order, scales_j1)

    # fit a line to the scaling region (all scales)
    valid = np.isfinite(log_E)
    if valid.sum() >= 2:
        p = np.polyfit(j_idx[valid], log_E[valid], 1)
        fit_line = np.polyval(p, j_idx)
        H_est = p[0]
    else:
        fit_line = np.zeros_like(j_idx, dtype=float)
        H_est = float('nan')

    axes[1, col].plot(j_idx, log_E, 'o-', color=color, lw=1.5, ms=5,
                      label='scattering moments')
    axes[1, col].plot(j_idx, fit_line, '--', color='black', lw=1.2, alpha=0.6,
                      label=f'slope ≈ {H_est:.2f}')
    axes[1, col].set_title(f"Log-scattering scaling  (slope = H estimate)")
    axes[1, col].set_xlabel("scale j")
    if col == 0:
        axes[1, col].set_ylabel(r"$\log_2 \tilde{S}(j)$")
    axes[1, col].legend(fontsize=9)

    print(f"{name:25s}  kurtosis = {kurt:6.1f},  estimated H = {H_est:.3f}")

plt.suptitle("multifractal analysis via scattering moments", fontsize=13)
plt.tight_layout()
plt.show()

The three processes are designed to have progressively more intermittency:

fBm: linear log-scattering curve with slope $\approx H$ - the transform correctly recovers the Hurst exponent
OU: non-power-law curve (finite correlation length → rolls off at large scales)
MRW: non-linear log-scattering curve with clear curvature - the “signature” of multifractality

The curvature $\zeta(2) - 2\zeta(1) < 0$ is directly detectable from the decay of second-order scattering coefficients $\tilde{S}(j_1, j_2)$ as a function of $j_2 - j_1$, providing a statistically robust intermittency estimator.

10. Extensions and Applications

Roto-Translation Scattering

For images, the scattering transform extends to the roto-translation group $G_\text{rot} \cong \mathbb{R}^2 \rtimes SO(2)$, building joint invariants to both translations and rotations. The key distinction from a separable approach (first translation-invariant, then rotation-invariant) is that the joint representation can discriminate textures that a separable one cannot - for example, distinguishing a texture from its mirror image.

Time-Frequency Scattering for Audio

Audio recognition requires stability to both time-warps and frequency transpositions. The signal is first lifted to the time-frequency plane via a scalogram $z(t, \lambda) = |x \otimes \psi_\lambda(t)|$, and then a joint wavelet decomposition is applied to $z$ over the roto-translation group of time-frequency shifts. This is the basis of state-of-the-art audio classification systems.

Quantum Chemistry (Solid Harmonic Scattering)

For 3D molecular signals, rotational and translational invariance are physically mandated - quantum-mechanical energies cannot depend on the molecule’s orientation. Scattering representations over $SO(3)$ using solid harmonic wavelets achieve competitive accuracy on QM7/QM9 datasets for energy regression, with formal stability guarantees.

Graph and Manifold Scattering

For data on graphs (social networks, molecular graphs), there is no global group structure. The scattering formalism extends by replacing Euclidean wavelets with diffusion wavelets built from the graph Laplacian $L = D - A$. The $k$-th diffusion wavelet captures signal variations at the $k$-th diffusion time scale. Geometric stability is now expressed in terms of metric perturbations of the graph structure.

11. Summary

Property	Fourier Modulus	Scattering
Translation invariant	Y	Y (asymptotically)
Stable to additive noise	Y	Y
Lipschitz to deformations	N	Y
Energy conserving	N (information loss)	Y
Captures higher-order moments	N	Y (order $m$ –> $2^m$-th moments)
Generalizes to non-Euclidean	N	Y (Lie groups, graphs, manifolds)
Filters learned from data	N/A	N (mathematically fixed)

The scattering transform occupies a unique position: it is simultaneously a theoretically grounded signal processing tool and a practical deep learning architecture. Its provable properties make it a mathematical template for understanding what CNNs implicitly learn when trained on structured signal domains - and its computable, non-learned nature makes it a powerful feature extractor in the data-scarce regime where large networks overfit.

References

Mallat, S. (2012). Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10), 1331–1398.
Bruna, J., & Mallat, S. (2013). Invariant scattering convolution networks. IEEE TPAMI, 35(8), 1872–1886.
Waldspurger, I. (2017). Exponential decay of scattering coefficients. SampTA.
Oyallon, E., & Mallat, S. (2015). Deep roto-translation scattering for object classification. CVPR.
Andreux, M. et al. (2020). Kymatio: Scattering transforms in Python. JMLR, 21(60), 1–6.
Bacry, E., & Muzy, J.-F. (2003). Log-infinitely divisible multiscale random walk processes. Communications in Mathematical Physics, 236(3), 449–475.