personal notes

Normalizing Flows

Assuming \(\textbf{z}\in \mathbb{R}^m\) and \(\textbf{x} \in \mathbb{R}^n\) are two random variables which are related as \(x=f(z)\), then

\[p_x(\textbf{x}) = p_z(f(\textbf{x}))\left|\text{det}\left(\frac{\partial f(\textbf{x})}{\partial \textbf{x}}\right)\right|\]

Given a dataset \(\textbf{x}_1, …, \textbf{x}_n\) and a prior distribution \(p_z(\textbf{z})\), the idea is to search for a probability of data \(p_x(\textbf{x})\) via an unknown function \(\textbf{z}=f(\textbf{x})\). Typically, the function \(f\) is approximated via a neural network and its weights are obtained via maximum log-likelihood:

\[-\sum_i \log(p_x(\textbf{x}_i)) = -\sum_i \log(p_z(f(\textbf{x}_i))) - \log\left|\text{det}\left(J_f(\textbf{x}_i)\right)\right|\]

here, \( (J_f(\textbf{x}))_{ij} = \frac{\partial f_i(\textbf{x})}{\partial x_j}\) is the Jacobian matrix of function \( f: \mathbb{R}^n \rightarrow \mathbb{R}^m \)

In Real NVP, the function \(f\) is obtained via stacking affine coupling layers. More precisely, from input \(\textbf{x} \in \mathbb{R}^D\) and output \(\textbf{y} \in \mathbb{R}^d\) with \(d<D\), the intermediate layers are defined as:

\[\begin{aligned} \textbf{y}_{1:d} &= \textbf{x}_{1:d}\\ \textbf{y}_{d+1:D} &= \textbf{x}_{d+1:D} \odot \exp(s(\textbf{x}_{1:d})) + t(\textbf{x}_{1:d}) \end{aligned}\]

where \(s\) (scale) and \(t\) (translation) are neural networks mapping \(\mathbb{R}^d\) to \(\mathbb{R}^{D-d}\)

The nice property of this affine coupling layers design is that it is invertible:

\[\begin{cases} \textbf{y}_{1:d} &= \textbf{x}_{1:d}\\ \textbf{y}_{d+1:D} &= \textbf{x}_{d+1:D} \odot \exp(s(\textbf{x}_{1:d})) + t(\textbf{x}_{1:d}) \end{cases} \Longleftrightarrow \begin{cases} \textbf{x}_{1:d} &= \textbf{y}_{1:d}\\ \textbf{x}_{d+1:D} &= (\textbf{y}_{d+1:D}-t(\textbf{y}_{1:d})) \odot \exp(s(\textbf{y}_{1:d})) \end{cases}\]

The Jacobian of an affine coupling layer is:

\[J(x) = \frac{\partial \textbf{y}}{\partial \textbf{x}} = \begin{bmatrix} \mathbb{I}_d & \textbf{0}_{d\times (D-d)}\\ \frac{\partial \textbf{y}_{d+1:D}}{\partial \textbf{x}_{1:d}} & \text{diag}(\exp(s(\textbf{x}_{1:d}))) \end{bmatrix}\]

The determinant of Jacobian matrix is:

\[\left| \text{det}(J(\textbf{x})) \right| = \exp\left(\sum_{j=1}^{D-d} s(\textbf{x}_{1:d})_j\right)\]

In the implementation, a binary mask \(b=(1,…,1,0,…,0)\) is used to describe the coupling affine layer:

\[\begin{aligned} \textbf{y} &= \textbf{x} \odot \exp\left((1-\textbf{b})\odot s(\textbf{b}\odot \textbf{x})\right) + (1-\textbf{b}) \odot t(\textbf{b}\odot \textbf{x})\\ \textbf{x} &= (\textbf{y}-(1-\textbf{b})\odot t(\textbf{b}\odot \textbf{y})) \odot \exp\left(-(1-\textbf{b})\odot s(\textbf{b}\odot \textbf{y})\right) \end{aligned}\]

With this formulation, the determinant is now:

\[\log\left|\text{det}(J(\textbf{x}))\right| = \sum_{j=1}^D \left((1-\textbf{b})\odot s(\textbf{b}\odot \textbf{x})\right)_j\]