personal notes

Random Features

Random features for machine learning was first investigated within kernel machine learning framework (see Rahimi & Recht, NIPS 2007).

In the kernel methods, the approximation function is a linear combination of kernels as below: \(f(x) = \sum_{i=1}^{N} \alpha_i k(x,x_i)\)

The kernel functions can be a Gaussian function as:

\[k(x,x') = \exp\left( -\frac{1}{2} \|x-x'\|^2_2 \right)\]

For large dataset, direct application of kernel is time-consuming due to the large matrix inverssion step. The idea of kernel approximation is to use a randomized feature map \( x\in \mathbb{R}^d \rightarrow z(x,\omega) \in \mathbb{R}^D \) with \( D < d \):

\[k(x,x') = \sum_{j=1}^{D} z(x,\omega_j)^\top z(x',\omega_j)\]

For Gaussian kernel, the feature map can be written as:

\[z(x,\omega) = \exp(i \omega^\top x)\]

where, \( \omega \sim \mathcal{N}_D(0,\mathbb{I}) \).

An alternative form of feature map for Gaussian kernel is:

\[z(x,\omega) = \sqrt{2} \cos(\omega^\top x + b)\]

where, \( b\in \text{Uniform}(0,2\pi) \).

With this kernel approximation, the approximation function is now:

\[f(x) = \sum_{j=1}^{D} \beta_j z(x,\omega_j)\]

The fundamental theory behind the kernel approximation with random features is the Bochner’s theorem which states that for every kernel \( k(x) \) is a Fourier transform of a non-negative function \( p(\omega) \) as:

\[k(x-y) = \int_{\mathbb{R}^d} p(\omega) \exp\left( i\omega^\top (x-y) \right) d\omega = \mathbb{E}_{p(\omega)}\left[ \exp(i\omega^\top x) \left(\exp(i\omega^\top y)\right)^\star \right]\]

For Gaussian kernel \( k(x) = \exp( - ||x||^2/2) \), the corresponding Fourier transformed function \( p(\omega) \) is also a Gaussian function. This leads to the choice of normal distribution for \( \omega \).