Random Features

February 15, 2023

Random features for machine learning were introduced in the kernel-machine setting by Rahimi & Recht, NIPS 2007. This note explains the idea, the theory (Bochner’s theorem), and shows three short Python tutorials: approximating a kernel, watching the approximation converge, and using random features to make kernel regression scale.

1. The problem with kernels

Kernel methods build a predictor as a linear combination of kernel evaluations:

\[f(x) = \sum_{i=1}^{N} \alpha_i\, k(x, x_i)\]

A common choice is the Gaussian (RBF) kernel:

\[k(x, x') = \exp\!\left(-\frac{\|x - x'\|_2^2}{2\sigma^2}\right)\]

The catch: fitting requires working with the \(N \times N\) Gram matrix. Solving the linear system costs \(O(N^3)\) time and \(O(N^2)\) memory. For large \(N\) this is infeasible.

2. The random-features idea

Replace the implicit, infinite-dimensional kernel feature map with an explicit, finite, randomized map

\[x \in \mathbb{R}^d \;\longmapsto\; z(x) \in \mathbb{R}^D\]

chosen so that the inner product approximates the kernel:

\[k(x, x') \;\approx\; z(x)^\top z(x').\]

Here \(D\) is the number of random features. Unlike the implicit kernel map (which is infinite-dimensional), \(D\) is finite and we control it - typically \(D \gg d\), with accuracy improving as \(D\) grows. Once we have \(z(\cdot)\), we fit a plain linear model in the \(D\)-dimensional feature space, which costs \(O(N D^2)\) instead of \(O(N^3)\). When \(D \ll N\), this is a big win.

3. Bochner’s theorem

A continuous, shift-invariant kernel \(k(x, x’) = k(x - x’)\) that is normalized so \(k(0) = 1\) is the Fourier transform of a probability density \(p(\omega)\):

\[k(x - y) = \int_{\mathbb{R}^d} p(\omega)\, e^{\, i\,\omega^\top (x - y)}\, d\omega = \mathbb{E}_{\omega \sim p}\!\left[ e^{i\omega^\top x}\,\overline{e^{i\omega^\top y}} \right].\]

So if we draw \(\omega_1, \dots, \omega_D \sim p\) i.i.d., the Monte Carlo estimate of the expectation gives our approximation. For the Gaussian kernel with bandwidth \(\sigma\), the density \(p(\omega)\) is itself Gaussian:

\[\omega \sim \mathcal{N}\!\left(0,\; \sigma^{-2} I_d\right).\]

The two feature maps

Both are valid; the real-valued cosine map is what you use in practice.

Complex exponential \(z(x) = \frac{1}{\sqrt{D}}\big[\, e^{i\,\omega_j^\top x} \,\big]_{j=1}^{D}\)

Real cosine (with phase \(b_j \sim \text{Uniform}(0, 2\pi)\)) \(z(x) = \sqrt{\frac{2}{D}}\,\big[\, \cos(\omega_j^\top x + b_j) \,\big]_{j=1}^{D}\)

The \(\frac{1}{\sqrt{D}}\) (resp. \(\sqrt{2/D}\)) factor turns the sum into the average that Monte Carlo requires - this normalization is essential.

4. Example 1: Approximate the kernel

import numpy as np

def rbf_kernel(X, Y, sigma=1.0):
    """exact Gaussian (RBF) Gram matrix."""
    sq = (
        np.sum(X**2, axis=1)[:, None]
        + np.sum(Y**2, axis=1)[None, :]
        - 2 * X @ Y.T
    )
    return np.exp(-sq / (2 * sigma**2))

class RFF:
    """random Fourier features for the RBF kernel (cosine map)."""
    def __init__(self, n_features=200, sigma=1.0, seed=0):
        self.D, self.sigma, self.rng = n_features, sigma, np.random.default_rng(seed)

    def fit(self, X):
        d = X.shape[1]
        # omega ~ N(0, sigma^-2 I);  b ~ U(0, 2*pi)
        self.W = self.rng.normal(scale=1.0 / self.sigma, size=(d, self.D))
        self.b = self.rng.uniform(0, 2 * np.pi, size=self.D)
        return self

    def transform(self, X):
        proj = X @ self.W + self.b
        return np.sqrt(2.0 / self.D) * np.cos(proj)

# compare approximate vs. exact kernel values
rng = np.random.default_rng(1)
X = rng.standard_normal((5, 3))
sigma = 1.5

rff = RFF(n_features=5000, sigma=sigma).fit(X)
Z = rff.transform(X)

K_approx = Z @ Z.T
K_exact  = rbf_kernel(X, X, sigma=sigma)

print("max abs error:", np.abs(K_approx - K_exact).max())
# -> a small number (~1e-2); shrinks as n_features grows

5. Example 2: Convergence of random-feature kernel

The approximation error of a Monte Carlo estimate shrinks like \(O(1/\sqrt{D})\). Let’s confirm that scaling empirically.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)
X = rng.standard_normal((30, 4))
sigma = 1.0
K_exact = rbf_kernel(X, X, sigma=sigma)

Ds = [10, 30, 100, 300, 1000, 3000, 10000]
errors = []
for D in Ds:
    # average over a few seeds to reduce noise
    errs = []
    for s in range(10):
        Z = RFF(n_features=D, sigma=sigma, seed=s).fit(X).transform(X)
        errs.append(np.linalg.norm(Z @ Z.T - K_exact) / np.linalg.norm(K_exact))
    errors.append(np.mean(errs))

plt.loglog(Ds, errors, "o-", label="relative error")
plt.loglog(Ds, errors[0] * np.sqrt(Ds[0]) / np.sqrt(Ds), "--",
           label=r"$O(1/\sqrt{D})$ reference")
plt.xlabel("number of random features $D$")
plt.ylabel("relative Frobenius error")
plt.legend(); plt.title("random-feature kernel approximation error")
plt.show()

The measured error tracks the \(1/\sqrt{D}\) reference line - exactly what the theory predicts.

6. Example 3: Scalable kernel ridge regression

Here is the payoff. We fit ridge regression in random-feature space and compare against exact kernel ridge regression on a noisy 1-D function.

import numpy as np
import matplotlib.pyplot as plt

# synthetic data
rng = np.random.default_rng(42)
n = 400
X = np.sort(rng.uniform(-4, 4, size=(n, 1)), axis=0)
y = np.sin(2 * X[:, 0]) * np.exp(-0.1 * X[:, 0] ** 2) + 0.1 * rng.standard_normal(n)

sigma, lam = 0.5, 1e-3

# exact kernel ridge regression:  alpha = (K + lam I)^-1 y
K = rbf_kernel(X, X, sigma=sigma)
alpha = np.linalg.solve(K + lam * np.eye(n), y)

Xt = np.linspace(-4, 4, 300)[:, None]
y_exact = rbf_kernel(Xt, X, sigma=sigma) @ alpha

# random-feature ridge regression:  w = (Z^T Z + lam I)^-1 Z^T y
rff = RFF(n_features=300, sigma=sigma, seed=0).fit(X)
Z = rff.transform(X)
w = np.linalg.solve(Z.T @ Z + lam * np.eye(Z.shape[1]), Z.T @ y)
y_rff = rff.transform(Xt) @ w

# plot
plt.scatter(X, y, s=8, alpha=0.3, label="data")
plt.plot(Xt, y_exact, "k-", lw=2, label="exact KRR")
plt.plot(Xt, y_rff, "r--", lw=2, label="RFF ridge (D=300)")
plt.legend(); plt.title("kernel ridge regression vs. random features")
plt.show()

The random-feature curve closely matches exact KRR, but the linear solve scales with \(D\) rather than \(N\). When \(N\) runs into the hundreds of thousands, the exact \(O(N^3)\) solve becomes impractical while random features stay cheap.

7. Summary

Random features replace an expensive implicit kernel with a cheap explicit feature map \(z(x)\) such that \(k(x,x’) \approx z(x)^\top z(x’)\).
The justification is Bochner’s theorem plus a Monte Carlo average; error decays as \(O(1/\sqrt{D})\).
For the Gaussian kernel, sample frequencies from \(\mathcal{N}(0, \sigma^{-2}I)\) and use the cosine map with a random phase.
The practical win: \(O(N^3)\) kernel methods become \(O(N D^2)\) linear models.

References

A. Rahimi and B. Recht, Random Features for Large-Scale Kernel Machines, NIPS 2007.