Variational Inference: Kullback-Leibler Divergence and Likelihood
An important relationship in variational learning is that between the Kullback-Leibler divergence and the likelihood. It also happens to be a relation that I always fumble and have to scramble through paper to find when I need it. So I'm writing it here once and for all.
Consider a dataset $X$. It might be difficult to model the probability distribution of $X$, $P(X)$, so we assume that $X$ is generated by a set of random variables, $Z$. Which may be easier to model or offer a more succinct description of the model of $X$.
Now the likelihood of $X$, is thus given by
$$P(X) = \int P(X~|~Z) P(Z) dZ$$
Now we don't know the true distribution of $Z$, $P(Z)$. So we again make the assumption that there is another distribution $Q(Z)$ which can be easily modeled. The intuition is then to make sure that $Q(Z)$ closely approximate $P(Z)$ so that $P(X)$ can be properly calculated and maximized. This this is equivalent to minimizing the Kullback-Leibler divergence between the two.
So let's consider the Kullback-Leibler divergence between $Q(Z)$ and $P(Z)$.
$$\begin{align} \mathrm{KL}(Q(Z) || P(Z|X)) =& \mathbb{E}_{z\sim Q}\left[\log Q(Z) - \log P(Z|X)\right] \\ =& \mathbb{E}_{z\sim Q}\left[\log Q(Z) - \log P(X|Z) - \log P(Z)\right] + \log P(X) \\ =& \mathrm{KL}(Q(Z) || P(Z))-\mathbb{E}_{z\sim Q}\left[\log P(X|Z)\right] + \log P(X) \end{align} $$
Note $Q(Z)$ may or may not depend on $X$.
Rearranging the equation, we get
$$\log P(X) = \mathrm{KL}(Q(Z|X) || P(Z|X)) - \mathrm{KL}(Q(Z|X) || P(Z))+\mathbb{E}_{z\sim Q}\left[\log P(X|Z)\right]$$
Here, I've made the dependence of $Q$ on $X$ explicit since in actual computation, we really only want to see values of $Z$ which are significant given the dataset.
The above equation can also be written as $\log P(X) = \mathrm{KL}(Q(Z|X) || P(Z|X)) + \mathcal{L}$ where $\mathcal{L} = - \mathrm{KL}(Q(Z|X) || P(Z))+\mathbb{E}_{z\sim Q}\left[\log P(Z)\right]$ is also known as the variational lower bound. This is because $\mathrm{KL}(\cdot)$ can only be positive.
By maximising $\mathcal{L}$, we are in effect choosing $Q$ such that $\mathrm{KL}(Q(Z|X) || P(Z))$ is as close to zero as possible, on top of maximising the log probability of the dataset given $Z$.