Variational Bayesian Inference

01 May, 2023

This post delves into the world of Variational Bayesian (VB) inference, comparing and contrasting it with traditional Bayesian techniques such as MCMC. We begin by introducing the fundamental concepts of Variational Inference (VI) and explore how it can be applied to Bayesian inference. The post then delves into the optimization model of VB and showcases some basic examples, one of which involves a neural network.

Notations

$\Omega$ is the probability space.
For $X$ a random variable defined on $\mathcal{R}$, we will call $P_X$ its probability measure and $F_X$ its C.D.F. If $X$ has a density, we will cal $f_X$ its p.d.f.
When using lower cases $x$, we will denote a particular value and not the actual random variable $X$. In the rest of this article, we will suppose that all the studied random variables admits a density.

Bayesian modeling and the limits of MCMC methods

Variational Bayesian (VB) techniques provide a means of approximating Bayesian inference problems using optimization. In addition to being a foundation for Bayesian neural networks, VB techniques can overcome some of the limitations of traditional Bayesian methods like Markov Chain Monte Carlo (MCMC).

Bayesian Inference (recall)

Given one observed random variable $X = (X_1, …, X_{p})$ and one latent random variable (unobserved) $Z = (Z_1, …, Z_{q})$, Bayesian inference involves computing the posterior distribution $f_{Z | X = x}$ of a latent variable $Z$ given observed data $X$. We update our belief about the latent variable based on the observed data using Bayes’ theorem:

$$ f_{Z | X = x} (z) = \frac{f_{X | Z = z}(x) f_Z(z)}{f_X(x)} $$

Here, $f_{X | Z = z}$ is the likelihood density of $X$ given $Z = z$, $f_Z$ is the prior distribution representing our prior belief about the unobservable data, and $f_X$ is the marginal density or evidence.

We will call a latent variable any variable which cannot be observed directly. It could be a quantity derived from a model generating the observed variable, or the generative parameter of a model itself (like the weights of a neural network).

In order to compute $f_{Z | X = x}$, we need to compute $\frac{f_{X | Z = z}(x) f_Z(z)}{f_X(x)}$ and this will be our end-goal.

Why do we need Variational Inference?

Computing the evidence $f_X(x)$ can be challenging in practice. It is often intractable and requires solving a difficult integral $f_X(x) = \int_{\mathcal{R}} f_{X | Z = z}(x) dz$ . MCMC methods are commonly used to estimate the posterior distribution without computing the evidence, but they can be slow for complex models. In contrast, VB simplifies the problem by transforming it into an optimization problem.

Examples

Bernoulli Bayesian Inference

The Bernoulli inference example involves estimating the probability of obtaining a head from an unfair coin. Let $X$ be the binary random variable representing the outcome of the coin toss. We make the following assumptions about the model:

$X \sim Bernoulli(\theta)$
$\Theta \sim Beta(\alpha, \beta)$

Here, $\alpha$ and $\beta$ are known and represent prior knowledge about the coin. If we have strong prior knowledge about the value, we choose $\alpha$ and $\beta$ such that $Beta(\alpha, \beta)$ has a small variance and an expected value reflecting our knowledge. If we lack prior knowledge, we choose a weak prior.

We want to estimate the posterior distribution using a sample $x = x_1, x_2, …., x_n$ of $n$ coin tosses that are independent. We use Bayes’ formula:

$$ \begin{aligned} P(\Theta = \theta | X_1 = x_1, … X_n = x_n) &= \prod_i^n \frac{P(X_i = x_i| \Theta=\theta) P(\Theta=\theta)}{P(X_i = x_i)} \\ &= \frac{P(X = x| \Theta=\theta) P(\Theta=\theta)}{P(X = x)} \end{aligned} $$

And we have:

$P(X = x| \Theta=\theta) = \theta^{\sum x_i}(1-\theta)^{1 - \sum x_i}$ (Bernoulli)
$P(\Theta=\theta) = \frac{1}{B(\alpha, \beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}$ (Beta)
$P(X = x) = \int_{[0, 1]} P(X = x| \Theta=\theta) P(\theta) d\theta = \frac{B(\alpha + \sum x_i, \beta + 1 - \sum x_i)}{B(\alpha, \beta)}$

Where $B$ denotes the beta function which normalize the beta distribution. Wrapping everything we have:

$$ \begin{aligned} P(\Theta = \theta | X_1 = x_1, … X_n = x_n) &= \frac{\theta^{\sum x_i + \alpha - 1}(1-\theta)^{1 - \sum x_i + \beta - 1}}{B(\alpha + \sum x_i, \beta + 1 - \sum x_i)} \\ &= beta(\sum x_i + \alpha - 1, 1 - \sum x_i + \beta - 1) \end{aligned} $$

By analyzing the number of heads observed during the experiment, we can derive the posterior distribution analytically, which takes the form of a Beta distribution. It is noteworthy that the parameters $\alpha$ and $\beta$ act as “pseudo-counts,” as they determine the strength of the prior, and hence, its impact on the posterior. A higher value of $\alpha$ and $\beta$ will result in a stronger prior influence on the posterior. It is worth noting that we were able to compute the posterior probability by utilizing the conjugate prior, i.e., beta prior, of the Bernoulli distribution, which simplifies the calculation of the marginal. However, in general, this computation is not always possible.

Gaussian Mixture Inference

Take the Bayesian mixture of Gaussians model for $K$ gaussian and fix and constant variance $\tau^2$:

$\mu_k \sim N(0, \tau^2)$ for $k=1,…K$ (unobserved)
$K_i \sim Cat(\pi)$ (unobserved)
$X_i \sim N(\mu_{K_i}, \sigma^2)$

The posterior distribution is equal to:

$$ p(\Mu = \mu, K = k | X = {x_i}) = \frac{\prod_k f_\mu (\mu_k) \prod_i f_K (k_i) f_{X_i | K_i = k, \mu=\mu}}{\int_\mu \prod_k f_\mu (\mu_k) \prod_i f_K (k_i) f_{X_i | K_i = k, \mu=\mu} d\mu} $$

The integral is not computable analytically and the problem becomes intractable. In this post, we propose to use VB to solve this inference problem.

Variational Inference

The technique of Variational Inference (VB) offers a way to sidestep the computation of the marginal $f_X(x)$ and instead transforms the problem into an optimization one. The key is to approximate the posterior density $f_{Z | X = x} (z)$ directly using a density function $g: \mathcal{R} \rightarrow [0,1]$ selected from a family of density functions $\mathcal{L}$. The objective is to achieve a close match between $g$ and $f_{Z | X = x}$.

The Kullback-Leiber divergence

The Kullback-Leibler (KL) divergence is a measure of how different two probability distributions are from each other, defined for any distributions $f$ and $g$ over $\mathcal{R}$ as:

$$ KL(g || f) = \int_{\mathcal{R}} g(z) \log \frac{g(z)}{f(z)} dz = \mathbf{E}_g [\log \frac{g(Z)}{f(Z)}] $$

This measure is asymmetric, meaning that $KL(g || f)$ is not necessarily equal to $KL(f || g)$. If $g(z)$ is high and $f(z)$ is high, then the KL divergence is low, indicating a good match between the two distributions. Conversely, if $g(z)$ is high and $f(z)$ is low, then the KL divergence is high, indicating a bad match. If $g(z)$ is low and $f(z)$ is high, then the KL divergence is not impacted.

The VB problems

After finding a suitable measure to match $g \in \mathcal{L}$ and $f_{Z | X = x}$ using the KL divergence, we can rephrase the problem as follows: $$ g^* = \argmin_{g\in\mathcal{L}} (KL(g || f_{Z | X = x})) $$

To solve this problem, we can study some properties of the KL divergence:

$$ \begin{aligned} KL(g || f_{Z | X = x}) &= E_g [\log \frac{g(Z)}{f_{Z | X = x}(Z)}] \\ &= E_g [\log g(Z)] - E_g [\log f_{Z | X = x}(Z)] \\ &= E_g [\log g(Z)] - E_g [\log f_{Z, X}(Z, x)] + E_g [\log f_X (x)] \\ &= E_g [\log g(Z)] - E_g [\log f_{Z, X}(Z, x)] + \log f_X (x) \end{aligned} $$

The KL divergence is composed of three terms, one of which is a constant $\log f_X (x)$ that does not depend on $Z$ and represents the log of the marginal that we cannot compute. However, we can optimize the KL divergence up to a constant. We can now define our final objective function:

$$ ELBO(g) = E_g [\log f_{Z, X}(Z, x)] - E_g [\log g(Z)] $$

called the Evidence Lower BOund (ELBO). The ELBO is so named because it is a lower bound on the evidence $\log f_X (x)$: $$ \begin{aligned} \log f_X (x) &= KL(g || f_{Z | X = x}) + ELBO(g) \\ &\geq ELBO(g) \end{aligned} $$

We can now reformulate the problem as finding the optimal $g$ function within a family of functions $\mathcal{L}$: $$ g^* = \argmax_{g\in \mathcal{L}}(ELBO(g))) $$

The final step is to choose an appropriate family of functions $\mathcal{L}$ for the optimization of the ELBO.

Mean-Field hypothesis

Definition of Mean-Field Variational families

One of the most generic variational families are the mean-fields family density functions. For a vector $Z=(Z_1, … Z_m)$ of latent variables, we suppose that the approximate density function $g(z)$ follows:

$$ g(z) = g(z_1, … z_m) = \prod^m_i g_i (z_j) $$

which means that the density can be approximated with a density function of independent variables. This hypothesis is very useful and allows to not take into account more complex phenomenons and can apply to a large family of problems.

Coordinate Ascent Optimization

With the Mean-Field hypothesis and using the chain rule, we can transform $ELBO$ objective into:

$$ \begin{aligned} ELBO(g) &= E_g [\log f_{Z, X}(Z, x)] - E_g [\log g(Z)] \\ &= E_g [\log f_{X} (x) \prod^m_i f_{Z_i | Z_{1:i-1}}(z_i)] - \sum_i E_g [\log g_i(Z_i)] \\ &= \log f_{X} (x) + \sum^m_i E_g [\log f_{Z_i | Z_{1:i-1}}(Z_i)] - \sum_i E_g [\log g_i(Z_i)] \end{aligned} $$

The goal is to iterate on each random variable $Z_i$ and optimize only this dimension. This algorithm is called the coordinate ascent in optimization as we are optimizing one variable at a time. For each step $i$, we will solve the following problem

$$ \begin{aligned} \argmax_{g_i \in \mathcal{L}}(ELBO(g_i, g_{-i}))) &= \argmax_{g_i \in \mathcal{L}}[E_g [\log f_{Z_i | Z_{-i}}(Z_i)] - E_g [\log g_i(Z_i)]] \\ &= \argmax_{g_i \in \mathcal{L}}[\int_{\mathcal{R}} g_i(z_i) E_{g_{-i}}[\log f_{Z_i | Z_{-i}}(z_i)]dz_i - \int_{\mathcal{R}} \log g_i(z_i) dz_i] \\ \end{aligned} $$

Also, since $g$ is a density the problem is subject to the constraint $\int_{\mathcal{R}} g_i(z_i) d_{z_i}= 1$. The lagrangian of this problem can be written:

$$ F(g_i, \lambda) = [\int_{\mathcal{R}} g_i(z_i) E_{g_{-i}}[\log f_{Z_i | Z_{-i}}(z_i)]dz_i - \int_{\mathcal{R}} \log g_i(z_i) dz_i + \lambda (\int_{\mathcal{R}} g_i(z_i) d_{z_i} - 1) $$

Setting the derivatives of the lagrangian to zeros gives the following solution:

$$ \begin{aligned} \forall z_j, g_j^*(z_j) &= \frac{\exp {E_{g_{-i}}[\log f_{Z, X}(Z_{-i}, z_i, x)]}}{\int \exp {E_{g_{-i}}[\log f_{Z, X}(Z_{-i}, z_i, x)] dz_i}} \\ &\propto \exp {E_{g_{-i}}[\log f_{Z, X}(Z_{-i}, z_i, x)]} \end{aligned} $$

We now have an update rule for optimizing the ELBO objective with Coordinate Ascent optimization. For a chosen probability density $f_{Z, X}$ the final algorithm is the following:

Initialize the $g_i^0$
While ELBO has not converged. For every value $i$, set $g_{i}^{n+1}$ update $\exp {E_{g_{-i}}[\log f_{Z, X}(Z_{-i}, z_i, x)]}$
Return $g$

This algorithm and computations can give easy updates to compute for the Gaussian Mixture example, and for bayesian models part of Exponential Family distributions. Here is a good reference studying the problem: https://www.cs.cmu.edu/~epxing/Class/10708-17/notes-17/10708-scribe-lecture13.pdf

Variational Inference and Bayesian Neural Network

Variational Inference provides a way to turn neural networks from a frequentist approach into a Bayesian one by treating the weights of the network as random variables. The joint distribution of these random variables represents a probability measure on the set of neural networks. This approach offers several benefits, including the ability to compute uncertainties of predictions and outputs and the ability to easily incorporate priors. For example, in a simple classification problem, we have a:

A dataset $D = Y | X$
Some weights or parameters $\Theta$ of a neural network with a predefined architecture which are considered random.
A likelihood $f_{Y |X, \Theta}$ represented with a categorical softmax distribution.
Prior to the neural network parameters, $P(\Theta)$ is represented with a normal distribution
Posterior $f_{\Theta |Y, X}$ on our NN parameters (after seeing the data)

In the regression setup for neural network, we will keep the mean-field hypothesis ($g(\Theta)=\prod_i g_i(\Theta_i)$), and try to derive the ELBO objective for the problem. We will also use the same notation as in the previous section. The ELBO objective can be written as:

$$ \begin{aligned} Elbo(g) &= E_g[\log f_{\Theta, Y | X}(\Theta, Y, X)] - E_g[\log g(\Theta)] \\ &= E_g[\log f_{Y | X, \Theta}(Y, X, \Theta)] + E_g[\log f_{\Theta}(\Theta)] - E_g[\log g(\Theta)] \\ &= E_g[\log f_{Y | X, \Theta}(Y, X, \Theta)] - KL(g || f_{\Theta}) \end{aligned} $$

In the formula presented, the first term $E_g[\log f_{Y | X, \Theta}(Y, X, \Theta)]$ represents the log-likelihood function of the model, which can be the softmax likelihood function for a classification problem. The second term, $KL(g || f_{\Theta})$, is the Kullback-Leibler divergence between the distribution $g$ and the prior distribution $f_{\Theta}$ of the model parameters. This term serves as a regularization of the problem, as a smaller variance in $f_{\Theta}$ leads to a higher value of $KL(g || f_{\Theta})$ in the ELBO function, implying stronger prior knowledge.

To compute the ELBO, we use an analytical formula based on the Gaussian distribution to estimate the KL divergence between the approximated posterior distribution $g$ and the prior distribution $f_{\Theta}$. Meanwhile, to estimate the log-likelihood function $E_g[\log f_{Y | X, \Theta}(Y, X, \Theta)]$, we sample multiple $\Theta$ values from the current approximated posterior distribution $g$ using Monte-Carlo simulation. Usually, only one sample is used for this Monte-Carlo estimation.

For making predictions, we will sample multiple $\Theta$ values from the final posterior distribution $g^* $ and calculate the mean and variance of the posterior prediction $(Y^0 | X^0, X)$ with $X^0$ representing a new predictor. The optimal posterior distribution $g^*$ will be achieved by using a regular optimizer for neural networks, such as ADAM, rather than coordinate ascent optimization.

For a Python implementation of this method, you can refer to the blog post https://neptune.ai/blog/bayesian-neural-networks-with-jax, which provides a good example. However, there are some issues with the terminology used in this post, as the author seems to confuse the prior and initial values, and what is referred to as $\beta$ is actually the prior variance. Variational Bayesian and neural networks can be applied to various types of problems, such as Variational Autoencoders. We may explore this topic further in a future post.

Conclusion

In this post, we have explored the concept of Variational Inference and how it can be used for complex Bayesian Inference problem and how it can be used to transform a frequentist approach of neural networks into a Bayesian problem.

To optimize the problem, we introduced the Evidence Lower Bound (ELBO) as an objective function, and described how it can be estimated using Monte Carlo methods.

Finally, let’s mention the Numpyro library, which provides an efficient and user-friendly way to perform Bayesian inference, including the implementation of Variational Inference for Bayesian neural networks.

For further reading, I recommend the following resources:

Variational Inference: A Review for Statisticians, by David Blei, Alp Kucukelbir, and Jon McAuliffe
Deep Learning and Bayesian Methods, by Yarin Gal and Zoubin Ghahramani
The Numpyro documentation and tutorials, available at https://num.pyro.ai/

#Programming #Mathematics #Machine Learning #Statistics