Note: This article is a summary of Chapter 11 in Pattern Recognition and Machine Learning (PRML).
Sampling from complex distributions, such as those that are high-dimensional, lack a known normalization term, or do not have a closed-form mathematical expressionm, is a fundamental problem in machine learning. These techniques are widely applied in deep learning contexts, including inference-time sampling for large language models and the denoising process in diffusion models. To study this problem, this article introduces foundational sampling strategies and progresses to more advanced scenarios.
Transform Sampling
Assume we can draw samples from a uniform distribution, $z \sim U(0, 1)$. Consider the objective of sampling from an exponential distribution, $y \sim \mathcal{E}(\lambda)$, where the probability density function is $f_y(y) = \lambda \exp(-\lambda y)$ for $y \geq 0$. One approach is to sample $z$ and apply a transformation such that the result follows the distribution of $y$. Specifically, we seek a function $g: [0, 1] \mapsto [0, +\infty)$ such that $g(z) \sim \mathcal{E}(\lambda)$. The transformation is derived as follows:
$$ \begin{aligned} & f_y(y) = f_z(z) \left|\frac{\mathrm dz}{\mathrm dy} \right| = \left|\frac{\mathrm dz}{\mathrm dy} \right|\\\ \Rightarrow{} & z = \int_{-\infty}^y f_y(\hat y)\mathrm d\hat y = 1 - \exp(-\lambda y) \\\ \Rightarrow{} & y = -\lambda^{-1} \ln (1-z) = g(z) \end{aligned} $$Through this transformation, $y$ follows the target exponential distribution.
This strategy generalizes to multivariate distributions. A standard example is the Box-Muller method for generating Gaussian variables. First, we draw a pair of values $z_1, z_2 \sim U(-1, 1)$ and discard the pair if $z_1^2 + z_2^2 > 1$. This yields a uniform distribution of 2D points within the unit circle, characterized by the density $f_{z_1,z_2}(z_1, z_2) = 1/\pi$. We then compute:
$$ y_1 = z_1 \sqrt{\frac{-2 \ln(s)}{s}}, \quad y_2 = z_2 \sqrt{\frac{-2 \ln(s)}{s}} $$where $s = z_1^2 + z_2^2$. The joint distribution is given by:
$$ \begin{aligned} &f_{y_1, y_2}(y_1, y_2) = f_{z_1, z_2}(z_1, z_2) \left|\frac{\partial(z_1, z_2)}{\partial(y_1, y_2)} \right| \\\ ={}&\left[\frac{1}{\sqrt{2\pi}}\exp(-y_1^2/2) \right] \left[\frac{1}{\sqrt{2\pi}}\exp(-y_2^2/2) \right] \end{aligned} $$This result demonstrates that $y_1$ and $y_2$ are independent and both follow the standard normal distribution $\mathcal{N}(0, 1)$.
However, transform sampling requires an analytically tractable cumulative distribution function, which is difficult to construct for most distributions in practice (e.g., if the integral lacks a closed form). Consequently, more general and robust sampling frameworks are required.
Rejection Sampling
To sample from a target distribution $f(x)$ using rejection sampling, we introduce a proposal distribution $g(x)$ and a constant $M \geq 1$ such that $Mg(x) \geq f(x)$ for all $x$. In this context, $Mg(x)$ is called the envelope distribution. The algorithm operates as follows:
- Sample $x \sim g(x)$.
- Sample $u \sim U(0, 1)$.
- If $u \leq \frac{f(x)}{Mg(x)}$, accept $x$ as a valid sample from $f(x)$.
- Otherwise, reject $x$ and return to step 1.
The probability of accepting a sample is $1/M$. Therefore, to maximize sampling efficiency, the proposal distribution $g(x)$ must closely approximate the shape of the target distribution $f(x)$.
Adaptive Rejection Sampling
Finding a suitable proposal distribution $g(x)$ a priori is often challenging. Gilks and Wild (1992) introduced adaptive rejection sampling (ARS) to dynamically construct and iteratively refine the envelope distribution. ARS requires the target distribution $f(x)$ to be log-concave (i.e., $\log f(x)$ is strictly concave), a property held by many standard distributions.
For log-concave distributions, piecewise linear tangents of the log-density naturally form an upper bound, facilitating efficient rejection sampling on the fly. Suppose we have evaluated the density at a set of points, yielding $S_n = \{(x_1, f(x_1)), (x_2, f(x_2)), \dots, (x_n, f(x_n))\}$, where $x_1 < x_2 < \dots < x_n$. We construct two piecewise linear functions to approximate $\log f(x)$:
- Upper Envelope $u_n(x)$: Formed by the intersection of the tangent lines to $\log f(x)$ at each $x_i$. Due to log-concavity, $u_n(x) \geq \log f(x)$ is mathematically guaranteed.
- Lower Envelope $l_n(x)$: Formed by the secant lines connecting adjacent points $(x_i, \log f(x_i))$ and $(x_{i+1}, \log f(x_{i+1}))$. It follows that $l_n(x) \leq \log f(x)$.
The algorithmic procedure is as follows:
- Initialize the set $S_n$ with at least two distinct points.
- Draw $x \sim \exp(u_n(x))$ and $u \sim U(0, 1)$.
- Squeeze test: If $u \leq \exp(l_n(x) - u_n(x))$, accept $x$. This step computationally bypasses the direct evaluation of $f(x)$.
- Rejection test: If the condition in step 3 is not met, evaluate $f(x)$. If $u \leq \exp(\log f(x) - u_n(x))$, accept $x$. Otherwise, reject $x$.
- Update: If $f(x)$ was evaluated in step 4 (regardless of acceptance or rejection), append $(x, f(x))$ to $S_n$ and update the envelope functions to $u_{n+1}(x)$ and $l_{n+1}(x)$.
As $n$ increases, $u_n(x)$ and $l_n(x)$ converge to $\log f(x)$, systematically reducing the rejection rate.
Consider the application of rejection sampling in high-dimensional spaces. Suppose the target is an $n$-dimensional zero-mean Gaussian distribution with covariance $\sigma_p^2 I_n$, and the proposal is a zero-mean Gaussian with covariance $\sigma_q^2 I_n$. For rejection sampling to be valid, we must set $\sigma_q^2 \geq \sigma_p^2$ to ensure a constant $k$ exists such that $k f_q(x) \geq f_p(x)$.
The optimal scaling constant is $k = (\sigma_q / \sigma_p)^n$, resulting in an acceptance rate of $1/k$. Consequently, the acceptance rate decays exponentially as dimensionality increases. While rejection sampling remains highly effective in low-dimensional settings, it becomes computationally intractable for high-dimensional distributions.
Importance Sampling
A primary objective of statistical sampling is to estimate the expectation of a function $f(x)$ (no need to be a probability distribution) under a target distribution $p(x)$. For a set of $n$ independent and identically distributed samples $\{x_i\}_{i=1}^n$ drawn from $p(x)$, the standard Monte Carlo estimate is given by:
$$ \mathbb{E}_{x\sim p}[f(x)] \approx \frac{1}{n} \sum_{i=1}^n f(x_i) $$When direct sampling from $p(x)$ is intractable, importance sampling introduces a tractable proposal distribution, $q(x)$, from which samples can be easily drawn. The expectation can then be rewritten by multiplying and dividing by $q(x)$:
$$ \mathbb{E}_{x\sim p}[f(x)] = \int f(x)\frac{p(x)}{q(x)}q(x)\mathrm{d}x = \mathbb{E}_{x\sim q}\left[f(x)\frac{p(x)}{q(x)}\right] $$In many practical scenarios, the distributions $p(x)$ and $q(x)$ can only be evaluated up to their respective normalization constants, $Z_p$ and $Z_q$. That is, $p(x) = \tilde{p}(x) / Z_p$ and $q(x) = \tilde{q}(x) / Z_q$, where $\tilde{p}(x)$ and $\tilde{q}(x)$ are the unnormalized density functions. Substituting these into the expectation yields:
$$ \mathbb{E}_{x\sim p}[f(x)] = \frac{Z_q}{Z_p} \mathbb{E}_{x\sim q}\left[f(x) \frac{\tilde{p}(x)}{\tilde{q}(x)}\right] \approx \frac{Z_q}{Z_p}\frac{1}{n} \sum_{i=1}^n \tilde{r}_i f(x_i) $$where $\tilde{r}_i = \tilde{p}(x_i) / \tilde{q}(x_i)$ represents the unnormalized importance weights, and the samples $\{x_i\}_{i=1}^n$ are drawn from $q(x)$.
The unknown ratio of normalization constants, $Z_q/Z_p$, can be approximated using the same sample set $\{x_i\}_{i=1}^n$ by evaluating the expectation of the constant function $f(x) \equiv 1$:
$$ 1 = \frac{Z_q}{Z_p} \mathbb{E}_{x\sim q}\left[\frac{\tilde{p}(x)}{\tilde{q}(x)}\right] \approx \frac{Z_q}{Z_p} \frac{1}{n}\sum_{i=1}^n \tilde{r}_i $$Rearranging this expression provides an estimator for the ratio: $Z_p/Z_q \approx \frac{1}{n}\sum_{i=1}^n \tilde{r}_i$. Substituting this back into the expectation estimate yields the self-normalized importance sampling estimator:
$$ \mathbb{E}_{x\sim p}[f(x)] \approx \sum_{i=1}^n w_i f(x_i), \quad \text{where } w_i = \frac{\tilde{r}_i}{\sum_{k=1}^n \tilde{r}_k} $$Here, $w_i$ denotes the normalized importance weight for the $i$-th sample.
To minimize the estimation variance, the theoretically optimal choice for the proposal distribution $q(x)$ should be proportional to $p(x)\cdot |f(x)|$. Specifically, the optimal proposal distribution $q^*(x)$ is formulated as:
$$ q^*(x) = \frac{|f(x)|p(x)}{\int |f(x)|p(x) \mathrm{d}x} $$For a mathematical derivation, let’s analyze the variance of the standard importance sampling estimator:
$$ \hat{I} = \frac{1}{n}\sum_{i=1}^n f(x_i)\frac{p(x_i)}{q(x_i)} $$Its variance is given by
$$ \text{Var}(\hat{I}) = \frac{1}{n} \left( \mathbb{E}_{x\sim q}\left[ \left( f(x)\frac{p(x)}{q(x)} \right)^2 \right] - I^2 \right) $$where $I = \mathbb{E}_{x\sim p}[f(x)]$ is the true expectation. Since $I^2$ is a constant, minimizing the variance is strictly equivalent to minimizing the second moment:
$$ \mathbb{E}_{x\sim q}\left[ \left( f(x)\frac{p(x)}{q(x)} \right)^2 \right] = \int \frac{f^2(x)p^2(x)}{q(x)} \mathrm{d}x $$Applying the Cauchy-Schwarz inequality:
$$ \left( \int \frac{f^2(x)p^2(x)}{q(x)} \mathrm{d}x \right) \left( \int q(x) \mathrm{d}x \right) \ge \left( \int \sqrt{\frac{f^2(x)p^2(x)}{q(x)}} \sqrt{q(x)} \mathrm{d}x \right)^2 $$We know that $\int q(x) \mathrm{d}x = 1$. Thus, the inequality simplifies to:
$$ \int \frac{f^2(x)p^2(x)}{q(x)} \mathrm{d}x \ge \left( \int |f(x)|p(x) \mathrm{d}x \right)^2 $$This implies that the optimal unnormalized proposal distribution should be $q(x) \propto |f(x)|\cdot p(x)$. In practice, the probability density of $q(x)$ should be highly concentrated in regions where $|f(x)|\cdot p(x)$ is large.
Sampling-Importance-Resampling
The efficiency of the rejection sampling strategy is depended on the constant $M$, which is often impractical to determine for some pairs of target distributions $p(z)$ and proposal distributions $q(z)$. To avoid determining the constant, the sampling-importance-resampling (SIR) strategy is proposed.
It has two stages. In the first stage, $L$ samples $z_1, z_2, \cdots, z_L$ are drawn from $q(z)$. Then, in the second stage, the weights $w_1, w_2, \cdots, w_L$ are calculated according to the importance sampling strategy. Finally, a second set of $L$ samples is drawn from the discrete distribution $(z_1, z_2, \cdots, z_L)$ with probabilities given by $w_1, w_2, \cdots, w_L$. The resulting $L$ samples are approximately distributed according to $p(z)$ when $L$ is large:
$$ \begin{aligned} \mathbb P(z\leq a) &= \sum_{l} I(z_l \leq a) w_l \\\ &= \frac{\sum_{l} I(z_l \leq a) \tilde p(z_l) / \tilde q(z_l)}{\sum_{l} \tilde p(z_l) / \tilde q(z_l)} \\\ (L\to\infty) &= \frac{\int I(z\leq a) (\tilde p(z) / \tilde q(z)) q(z)\mathrm dz}{\int (\tilde p(z) / \tilde q(z)) q(z)\mathrm dz} \\\ &= \frac{\int I(z\leq a) \tilde p(z)\mathrm dz}{\int \tilde p(z)\mathrm dz}\\\ &= \int I(z\leq a) p(z) \mathrm dz \end{aligned} $$which is exactly the cumulative distribution function of $p(z)$.
Markov Chain Monte Carlo
Both rejection sampling and importance sampling suffer from the curse of dimensionality. We next introduce a very general and powerful sampling strategy called Markov Chain Monte Carlo (MCMC).
Markov Chains
Before detailing the MCMC algorithm, we first review the basics of Markov chains.
Remember that our goal is to sample from a distribution $p(z)$. Taking advantage of the properties of Markov chains, we can construct a specific Markov chain whose invariant (stationary) distribution is $p(z)$. In such cases, no matter what the initial distribution is, the final distribution will converge to the invariant distribution $p(z)$.
Let $T(z, z^\prime) = p(z^\prime\mid z)$ be the transition probability from $z$ to $z^\prime$. A sufficient (not necessary) condition for ensuring that the required distribution $p(z)$ is invariant is the detailed balance defined by
$$ p(z) T(z, z^\prime) = p(z^\prime) T(z^\prime, z) $$In practice, we often construct the transition probabilities from a set of base transitions $B_1, B_2, \cdots, B_K$:
$$ T(z, z^\prime) = \sum_k \alpha_k B_k(z, z^\prime), \quad\text{where }\sum_k \alpha_k = 1 $$Alternatively, the base transitions may be combined through successive application:
$$ T(z, z^\prime) = \sum_{z_1} \cdots \sum_{z_{K-1}} B_1(z, z_1)\cdots B_{K-1}(z_{K-2}, z_{K-1}) B_K(z_{K-1}, z^\prime) $$If a distribution is invariant with respect to each base transition, then it will be invariant with respect to either of the $T(z, z^\prime)$ given by the above two equations.
The Metropolis-Hastings Algorithm
Similar to rejection sampling, we also need a proposal function $q(z)$ for MCMC. We now introduce the Metropolis-Hastings algorithm. At time step $\tau$, we keep the current state $z_{\tau}$, draw a sample $z^\star$ from the distribution $q(z\mid z_{\tau})$, and accept it with probability $A(z^*, z_{\tau})$ where
$$ A(z^\star, z_{\tau}) = \min\left(1, \frac{\tilde p(z^\star)q(z_{\tau}\mid z^\star)}{\tilde p(z_{\tau}) q(z^\star\mid z_{\tau})} \right) $$If the candidate sample is accepted, then $z_{\tau+1} = z^\star$. Otherwise, $z_{\tau+1}$ is set to $z_{\tau}$ and another sample is drawn from $q(z\mid z_{\tau+1})$. Again, this criterion doesn’t require knowing the normalizing constant $Z_p$ for $p(z)$. We show that $p(z)$ is an invariant distribution of the Markov chain defined by the Metropolis-Hastings algorithm, by showing that detailed balance is satisfied:
$$ \begin{aligned} p(z) q(z^\prime\mid z) A(z^\prime, z) &= \min\left(p(z)q(z^\prime\mid z), p(z^\prime)q(z\mid z^\prime) \right)\\\ &= p(z^\prime) q(z\mid z^\prime) A(z, z^\prime) \end{aligned} $$Therefore, as $\tau\to\infty$, the distribution of $z_{\tau}$ tends to $p(z)$.
Gibbs Sampling
Gibbs sampling is a special case of the Metropolis-Hastings algorithm. Consider we want to sample from the distribution $p(z_1,\cdots,z_M)$, and it’s easy to sample from each conditional distribution $p(z_i\mid z_{-i})$. Then, Gibbs sampling says that we can repeatedly sample from $p(z_i\mid z_{-i})$ and update $z_i$, where the order of the variables ($z_i$) is chosen either in some particular order or randomly from some distribution. The algorithm can be written as:
- Initialize $\{z_i:i=1,2,\cdots,M \}$
- For $\tau = 1, 2, \cdots, T$:
- Sample $z_1^{(\tau+1)} \sim p(z_1\mid z_2^{(\tau)}, z_3^{(\tau)},\cdots,z_M^{(\tau)})$
- Sample $z_2^{(\tau+1)} \sim p(z_2\mid z_1^{(\tau+1)}, z_3^{(\tau)},\cdots, z_M^{(\tau)})$
- …
- Sample $z_M^{(\tau+1)} \sim p(z_M\mid z_1^{(\tau+1)}, z_2^{(\tau+1)}, \cdots, z_{M-1}^{(\tau+1)})$
For a given step that updates the $k$-th variable $z_k$, and the transition probability from $z$ to $z^{\star}$ is given by $q_k(z^{\star}\mid z) = p(z_k^{\star}\mid z_{-k})$. Since the variables $z_{-k}$ are kept unchanged, we have $z_{-k} = z_{-k}^{\star}$. Then, the acceptance probability in the Metropolis-Hastings algorithm is given by:
$$ A(z^*, z) = \frac{p(z^{\star})q_k(z\mid z^{\star})}{p(z)q_k(z^{\star}\mid z)} = \frac{p(z_k^{\star}\mid z_{-k}^{\star})p(z_{-k}^{\star})p(z_k\mid z_{-k}^{\star})}{p(z_k\mid z_{-k})p(z_{-k})p(z_k^{\star}\mid z_{-k})} = 1 $$This shows that Gibbs sampling is a special case of the Metropolis-Hastings algorithm with acceptance probability always equal to $1$.