Deep Probabilistic Models

Tutorial 2: An Introduction to Pyro and Variational Inference

In this tutorial, you will cover some Pyro basics and see two types of model implemented.

Recall, that the general framework for training models via Variational Inference is to solve the problem

$$\arg \max_{\boldsymbol{\phi}, \boldsymbol{\theta}}\mathbb{E}_{q_{\boldsymbol{\phi}}}\big[\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{Z}) - \log q_{\boldsymbol{\phi}}(\mathbf{Z}|\mathbf{x}) \big]$$

via (stochastic) gradient methods.

In Pyro:

First, we shall generate some binary-valued data...

Beta-Bernoulli Model

We begin with a classic model, a Bernoulli likelihood with a Beta prior. The generative form of the model is

$$ p \sim {\rm Beta}(1,1)$$$$X_k | p \sim_{\rm iid} {\rm Bernoulli}(p),\quad k=1,\ldots,N.$$

Thus, the variable $p$ which is the parameter of the Bernoulli variable is latent. However, we have observed $X_1, \ldots, X_N$ and wish to perform posterior inference on $p$, i.e., infer the distribution of $p | X_1, \ldots, X_N$.

The model is implemented in Pyro below. Note that we specify obs for x, which are the observations. This tells Pyro that ${\bf X}$ is an observed variable in our directed graphical model. Note also that we use pyro.plate as a context. To learn more about context managers in Python, have a read here.

Bayesian Inference with VI

Next, we will perform "fully Bayesian" inference over the parameter (as opposed to maximizing the likelihood). To do so, we need to specify the variational family $q_{\phi}$. In Pyro, anything that provides an approximate posterior is called a guide (variational approximations are one example of this, and the main use). One can specify a variational distribution for any model in Pyro but creating a guide model that gives the generative form of the variational approximation (thus, you can have all types of interesting structure in your approximation if you desire).

As $p \in (0,1)$, we will use the Beta distribution family as our approximation (actually it is a very good one, as it is possible to derive by hand in this case that the posterior over $p$ is actually a Beta distribution!

Note that below, we tell Pyro our variational parameters (elements of $\phi$) and the constraints we wish them to have. In the Beta distribution's parametrization, both parameters must be positive. This is easy to specify, note below we tell Pyro this.

What actually happens in the background is Pyro optimizes over unconstrained space and then transforms the variational parameters to constrained space. For example, it may optimize the over $\log(a)$ and $\log(b)$ so there are no constraints, and then transform them at the end.

The following code plots our approximate posterior by simulating from the variational approximation and fitting a kernel density function (we could have also done this by plotting the beta pdf, but more generally you do not know the marginals, so in many cases one needs to do this).

Pyro actually has an MCMC engine as well! It has the same state-of-the-art implementation of the Hamiltonian Monte Carlo No-U-Turn sampler as Stan. Run the code below to approximate the posterior distribution with MCMC.

After running MCMC, we can call .summary() on the sampler to obtain information about the parameters. We also see the effective sample size, as well as the Gelman-Rubin diagnostic r_hat.

Note that both MCMC and VI give roughly the same answer! That is neat, but remember (1) this is a simple 1D example, and (2) we know that our variational family (guide) contains the true posterior.

Autoguide

Generally, we may have a complicated model that we don't want to spend the effort crafting a guide for piece by piece. Fortunately, Pyro provides off-the-shelf variational approximations in the autoguide module.

Below we tell Pyro we want a variational approximation of our model BetaBernoulli that is a multivariate normal with diagonal covariance matrix (i.e., independent normals). However, because this is a one-dimensional example, this is the same as saying we want to just have a 1D ${\cal N}(\mu, \sigma^2)$ approximation.

You may recall however that the variable $p$ needs lie in the interval $(0,1)$, and clearly a normal distribution goes outside that interval. The good news is, Autoguide automatically reparametrizes (transforms) everything behind the scenes, so it fits a normal distribution in unconstrained space, and then transforms it to $(0,1)$.

Other AutoGuides

It is worth noting that Pyro also has an autoguide function for Inverse Autoregressive Flows, and general Normalizing Flows. See here for the autoguide API details.

2. Bayesian Linear Regression featuring Shrinkage Priors

In the next example, we will do Bayesian Linear Regression with two different priors.

The following code defines the Bayesian Linear Regression model.

$$\beta \sim_{\rm iid} {\cal N}(0, 10^2) $$$$\sigma^2 \sim {\rm InverseGamma}(0.1,0.1)$$$$y_i | \beta, \sigma^2 \sim {\cal N}(\beta^\top {\bf x}_k, \sigma^2)$$

As the model is more sophisticated and we do not wish to go to the effort of crafting a guide, we can use an "off the shelf" AutoGuide that fits a multivariate normal.

The code below will plot the median (red) and 95% credible intervals (blue) of the approximate posterior (i.e., the variational approximation). The true parameters that generated the data will be plotted as black stars.

Not too bad! However, you will have noticed that the parameter estimates are all over the place. This is unsurprising as we have a lot of parameters relative to data.

Horseshoe Prior

In settings where there are a lot of parameters and not much data, it is advantageous to use

The Horseshoe Prior is a prior that encourages this behaviour. It is specially designed to shrink the posterior of any parameters that are small toward zero, while simultaneously allowing larger parameters to be unaffected. See this paper if you are interested to learn more about this magical prior.

The model in generative form is:

$$\tau \sim {\rm HalfCauchy}(1)$$

$$\lambda_i \sim_{\rm iid} {\rm HalfCauchy}(1), \quad i = 1,\ldots, d.$$

$$\beta_j | \tau, \lambda_k \sim_{\rm ind} {\cal N}(0, \tau^2 \lambda_k^2), \quad j=1,\ldots,d. $$$$y_k \sim_{\rm ind} {\cal N}(\beta^{\top}\mathbf{x_k}, \sigma^2) , \quad k=1,\ldots, N.$$

Note that more generally we can pick some other prior for $\tau$.

It is important to note that the above model does not put a prior on $\sigma^2$. We could if we wanted to, but instead we will simply optimize over $\sigma^2$ in the empirical Bayes fashion. Note that in the code below, we tell Pyro that we wish to do this by specifying sigma as a param and not by giving it a sample statement (which would require us to specify a prior for it).

Exercise (optional, but fun!): Write the above model as a directed graphical model using Plate Notation. Be sure to include all conditional independences!

N.B. For simplicity, I have not included an intercept term in the model. However, it is easy to add one by adding a column of ones to the matrix $x_{\rm obs}$ and changing $d$ to $d+1$, but we won't worry about doing that in this tutorial.

Above, you will notice the code uses two new commands: expand, and to_event. The former just tells Pyro to replicate the vector, the latter tells Pyro to treat the elements as one object (we need to do this for autoguide to work for technical reasons).

As we are being empirically Bayesian about $\sigma^2$, and also because even if we weren't the Horseshoe prior makes a posterior that is highly problematic for HMC (due to funnel-like behaviours), we are going to use Variational Learning to fit out model.

In a little bit, you will get to experiment by changing the dimension of the data. We will assume all of our latents (in this case, parameters we are being Bayesian about) can be fit with a variational approximation that is of the form ${\cal N}(\mu, {\rm diag}({\bf v}) + {\bf b}{\bf b}^\top)$, where ${\bf b}$ is a $d$-dimensional vector. For further discussion of variational distributions of this form, see the paper by Ong et al. (2017).

An Aside:

It is worth noting that Pyro has other variational objective functions besides ELBO: For example Tail-adaptive f-Divergence and the Importance Weighted Variational objective. (I believe doubly reparametrized "Sticking the Landing" gradients for the latter are a feature thats coming soon).

Exercise

Investigate the effectiveness of Gaussian Prior vs Horseshoe Prior for different dimensions of data d and different number of samples n_samples. As Variational Inference/Learning is quite quick, you should be able to do it in high dimensions and with bigger datasets.

3. A (very) Brief Introduction to torch.nn.Module and its use in Pyro

In PyTorch, a very important aspect are module objects. A key aspect of Pyro is that one can use modules everywhere. This allows for very flexible models, and approaches such as amortized inference (using a neural network to determine the variational parameters given the data).

However, it suffices to note that a module can be anything as simple as an activation function (e.g., nn.Sigmoid) , a layer of a neural network (eg., nn.Linear, nn.Conv2D), or even the composition of other module objects (e.g., a whole neural network can be a single module object).

For those interested in how they are used within Pyro, see the online tutorial Modules in Pyro. Though note that this does require a more advanced knowledge of how to create and use module objects.

Using module objects within ones model (e.g., Deep Bernoulli model) or one's variational approximation (e.g., neural nets used in Amortized Inference) is the core of deep probabalistic programming. Pyro can also have module objects that are fully Bayesian, i.e., have priors on all of their parameters. Another example of how one may use modules in the guide is through flexible conditionally structured approximations. For example, our posterior approximating variational family may take the form:

$${\bf Z}_1 \sim {\cal N}({\bf \mu_1}, \Sigma_1)$$$${\bf Z}_2 | {\bf Z}_1 \sim {\cal N}(g_{\xi}({\bf Z_1}), \Sigma_2),$$

where ${\bf g}_{\xi}: \mathbb{R}^{{\rm dim}({\bf Z}_1)} \rightarrow \mathbb{R}^{{\rm dim}({\bf Z}_2)} $ is a neural network, say. Of course, designing such approximations that optimize well and capture the posterior characteristics of interest is the topic of ongoing research, but Pyro in principle can easily implement such things.

These incredible features are beyond the scope of this tutorial, however, in this section we provide a brief overview of how module objects work in base PyTorch. This is also useful to know if you

Below is a very basic walkthrough of creating a module object and how it can be used.

Below we create a new module type called Linear that mimics how torch.nn.Linear module works under the hood.

Below, I create a new object LR of type Linear.

The forward method of a nn.Module object is its default method. That is, if you call the module as it it is a function

Calling parameters() will not actually show them, because it is a generator object. This just means it is something that can output things when iterated through.

We can view all the parameters in a module by iterating through its parameters() generator

We can do the above and also show the names by iterating with the named_parameters() generator. Having all parameters contained in a module object makes it easy to pass things to anoptimizer object (e.g., torch.optim.adam. We just need to give it the parameters generator upon initialization. It is also easy to tell Pyro that all parameters in a module should be part of $\theta$ in a model or $\phi$ in the variational approximation that need to be optimized over.

Note that by default all parameters in the layer have requires_grad set to true, so PyTorch watches operations made on them for backpropagation.

The purpose of the above was to demonstrate constructing a custom module. As a linear operation is very simple, we could have just done it with PyTorch's built in module.

Note that nn.modules.Sigmoid is a nn.Module, as are most activation funcions.

Vectorization

While it is very natural to think of a linear module applying the transformation ${\bf y} = \mathrm{A}{\bf x} + {\bf b},$ above, we (and in its own linear model PyTorch) do not actually do it that way. The reason why is that we want to vectorize, that is, to do computations for several vectors of $\mathbf{x}$ in one go.

Tricks like this are very important for speeding up PyTorch's performance when using GPUs (as GPUs can do tensor operations very fast, but other things not so much).

It should hopefully not surprise you that using

does not yield the same answer, as the latter computes a product of two matrices (treats x as a matrix and not as two rows).

The advantage of this vectorization is that we can compute everything with one tensor operation and no for loops, GPUs are very good at such operations!

4. Pyro Tutorials on Variational Autoencoders

Where Pyro really shines is implementing sophisticated models with techniques such as amortized inference. This is quite advanced. However, above tutorial should make you well equipped to begin to understand some of the more advanced Pyro examples. Have a read through the following two examples, and see how much you can now follow!

N.B. In the Pyro tutorials below, the plate diagram for the VAE has a circle around the parameter vector $\theta$. If one follows the convention that variables that are random (have a prior) are circled in the plate diagram, then the diagram they have is incorrect. The implemented VAE is not "fully Bayesian" with respect to $\theta$, but simply maximizes the ELBO over the parameter values (recall we discussed why this may be a better choice in the lecture).

Variational Autoencoders Tutorial

Conditional Variational Autoencoders Tutorial