Recall, that the general framework for training models via Variational Inference is to solve the problem

$$\arg \max_{\boldsymbol{\phi}, \boldsymbol{\theta}}\mathbb{E}_{q_{\boldsymbol{\phi}}}\big[\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{Z}) - \log q_{\boldsymbol{\phi}}(\mathbf{Z}|\mathbf{x}) \big]$$via (stochastic) gradient methods.

In Pyro:

- The
`model`

function specifies your model, whereas the`guide`

function specifies you posterior approximation. `sample`

denotes a random variable in the model (i.e., elements of the latent variables $\mathbf{z}$ or observed variables $\mathbf{x}$)`param`

denotes something you wish to optimize over (i.e., an element of $\phi$ or $\theta$).- All
`param`

in the model function are elements of $\theta$ - All
`param`

in the guide function are elements of $\phi$

- All
Inputs to the model function denote deterministic things that will

**not**be optimized.`pyro.plate`

is used to denote conditional independence

In [1]:

```
import pyro
import torch as t
import matplotlib.pyplot as plt
t.manual_seed(123)
```

Out[1]:

<torch._C.Generator at 0x7ffe06b6a830>

First, we shall generate some binary-valued data...

In [2]:

```
distX = t.distributions.Bernoulli(0.65)
x_obs = distX.sample([100])
print(x_obs)
```

We begin with a classic model, a Bernoulli likelihood with a Beta prior. The generative form of the model is

$$ p \sim {\rm Beta}(1,1)$$$$X_k | p \sim_{\rm iid} {\rm Bernoulli}(p),\quad k=1,\ldots,N.$$Thus, the variable $p$ which is the parameter of the Bernoulli variable is **latent**. However, we have **observed** $X_1, \ldots, X_N$ and wish to perform **posterior inference** on $p$, i.e., infer the distribution of $p | X_1, \ldots, X_N$.

The model is implemented in Pyro below. Note that we specify `obs`

for `x`

, which are the observations. This tells Pyro that ${\bf X}$ is an observed variable in our directed graphical model. Note also that we use `pyro.plate`

as a **context**. To learn more about *context managers* in Python, have a read here.

In [3]:

```
import pyro.distributions as dist
import torch
from torch.distributions import constraints
def BetaBernoulli():
# "sample" statements put variables in the graphical model
p = pyro.sample('p', dist.Beta(1,1))
with pyro.plate("data", x_obs.shape[0]):
x = pyro.sample('x', dist.Bernoulli(probs = p), obs = x_obs)
return x
```

Next, we will perform "fully Bayesian" inference over the parameter (as opposed to maximizing the likelihood). To do so, we need to specify the variational family $q_{\phi}$. In Pyro, anything that provides an approximate posterior is called a `guide`

(variational approximations are one example of this, and the main use). One can specify a variational distribution for any `model`

in Pyro but creating a `guide`

model that gives the generative form of the variational approximation (thus, you can have all types of interesting structure in your approximation if you desire).

As $p \in (0,1)$, we will use the Beta distribution family as our approximation (actually it is a very good one, as it is possible to derive by hand in this case that the posterior over $p$ is actually a Beta distribution!

Note that below, we tell Pyro our **variational parameters** (elements of $\phi$) and the constraints we wish them to have. In the Beta distribution's parametrization, both parameters **must** be positive. This is easy to specify, note below we tell Pyro this.

What actually happens in the background is Pyro optimizes over unconstrained space and then transforms the variational parameters to constrained space. For example, it may optimize the over $\log(a)$ and $\log(b)$ so there are no constraints, and then transform them at the end.

In [4]:

```
def guide():
a = pyro.param("a", torch.tensor(1.), constraint=constraints.positive)
b = pyro.param("b", torch.tensor(1.), constraint=constraints.positive)
p = pyro.sample('p', dist.Beta(a,b))
return p
```

In [5]:

```
pyro.clear_param_store()
svi = pyro.infer.SVI(model=BetaBernoulli,
guide=guide,
optim=pyro.optim.Adam({"lr": 0.1}),
loss=pyro.infer.Trace_ELBO())
ELBO = []
num_steps = 1000
for t in range(num_steps):
ELBO.append(svi.step())
plt.plot(ELBO)
plt.ylabel("Estimated ELBO")
plt.xlabel("Iteration")
```

Out[5]:

Text(0.5, 0, 'Iteration')

The following code plots our approximate posterior by simulating from the variational approximation and fitting a kernel density function (we could have also done this by plotting the beta pdf, but more generally you do not know the marginals, so in many cases one needs to do this).

In [6]:

```
import torch as t
from seaborn import kdeplot as kde
vals = t.tensor([guide().item() for k in range(5000)]).numpy()
kde(x = vals)
```

Out[6]:

<AxesSubplot:ylabel='Density'>

Pyro actually has an MCMC engine as well! It has the same state-of-the-art implementation of the Hamiltonian Monte Carlo No-U-Turn sampler as **Stan**. Run the code below to approximate the posterior distribution with MCMC.

In [7]:

```
mySampler = pyro.infer.MCMC(
kernel = pyro.infer.NUTS(model=BetaBernoulli, jit_compile=True),
num_samples=5000,
warmup_steps=500,
num_chains=1)
mySampler.run()
```

After running MCMC, we can call `.summary()`

on the sampler to obtain information about the parameters. We also see the effective sample size, as well as the Gelman-Rubin diagnostic `r_hat`

.

In [8]:

```
mySampler.summary()
```

In [9]:

```
x = mySampler.get_samples()['p'].numpy()
kde(x)
```

Out[9]:

<AxesSubplot:ylabel='Density'>

Note that both MCMC and VI give roughly the same answer! That is neat, but remember (1) this is a simple 1D example, and (2) we know that our variational family (guide) contains the true posterior.

Generally, we may have a complicated model that we don't want to spend the effort crafting a guide for piece by piece. Fortunately, `Pyro`

provides off-the-shelf variational approximations in the `autoguide`

module.

Below we tell `Pyro`

we want a variational approximation of our model `BetaBernoulli`

that is a multivariate normal with diagonal covariance matrix (i.e., independent normals). However, because this is a one-dimensional example, this is the same as saying we want to just have a 1D ${\cal N}(\mu, \sigma^2)$ approximation.

You may recall however that the variable $p$ needs lie in the interval $(0,1)$, and clearly a normal distribution goes outside that interval. The good news is, `Autoguide`

automatically reparametrizes (transforms) everything behind the scenes, so it fits a normal distribution in unconstrained space, and then transforms it to $(0,1)$.

In [10]:

```
from pyro.infer.autoguide import AutoDiagonalNormal
myAutoGuide = AutoDiagonalNormal(BetaBernoulli)
```

In [11]:

```
pyro.clear_param_store()
svi = pyro.infer.SVI(model=BetaBernoulli,
guide=myAutoGuide,
optim=pyro.optim.Adam({"lr": 0.01}),
loss=pyro.infer.Trace_ELBO())
ELBO = []
num_steps = 1000
for t in range(num_steps):
ELBO.append(svi.step())
plt.plot(ELBO)
plt.ylabel("Estimated ELBO")
plt.xlabel("Iteration")
```

Out[11]:

Text(0.5, 0, 'Iteration')

In [12]:

```
import torch as t
from seaborn import kdeplot as kde
vals = t.tensor([myAutoGuide()["p"].item() for k in range(5000)]).numpy()
kde(x = vals)
```

Out[12]:

<AxesSubplot:ylabel='Density'>

In the next example, we will do Bayesian Linear Regression with two different priors.

In [13]:

```
import numpy as np
import torch as t
import torch
#-------- we will change this in the exercise ---
d=100 # number of parameters
n_samples = 100 # number of samples
#-----------------------------------------------
beta = t.zeros(d)
for i in range(d):
if i < d/2:
if np.random.rand()<.5:
beta[i] = 5 + torch.randn(1).reshape(-1,1)
else:
beta[i] = -5 + torch.randn(1).reshape(-1,1)
else:
beta[i] = 0
X = t.randn([n_samples,d])
y_obs = X @ beta.reshape(-1,1) + 0.5 * t.randn(n_samples,1)
y_obs.shape
```

Out[13]:

torch.Size([100, 1])

The following code defines the Bayesian Linear Regression model.

$$\beta \sim_{\rm iid} {\cal N}(0, 10^2) $$$$\sigma^2 \sim {\rm InverseGamma}(0.1,0.1)$$$$y_i | \beta, \sigma^2 \sim {\cal N}(\beta^\top {\bf x}_k, \sigma^2)$$In [14]:

```
import pyro
from pyro.distributions import Normal
from pyro import sample
import pyro.distributions as dist
def LR():
# ---- p(theta) -----------------------
beta = sample('beta', Normal(loc=0. , scale=10.).expand([d]).to_event(1) )
sigma_sq = sample('sigma_sq', dist.InverseGamma(.1,.1))
# ---- p(y|theta) ---------------------
with pyro.plate("data", X.shape[0]):
y = pyro.sample('y', Normal(X @ beta.reshape(-1,1), t.sqrt(sigma_sq)).to_event(1), obs = y_obs)
return y
```

As the model is more sophisticated and we do not wish to go to the effort of crafting a guide, we can use an "off the shelf" `AutoGuide`

that fits a multivariate normal.

In [15]:

```
guide = pyro.infer.autoguide.AutoMultivariateNormal(LR)
pyro.clear_param_store()
svi = pyro.infer.SVI(model=LR,
guide=guide,
optim=pyro.optim.Adam({"lr": 1e-3}),
loss=pyro.infer.Trace_ELBO())
ELBO = []
num_steps = 10000
for s in range(num_steps):
ELBO.append(svi.step())
if s % 250 == 0: print(s, round(ELBO[-1],2))
plt.plot(ELBO)
```

0 1093.88 250 993.49 500 957.6 750 879.38 1000 834.18 1250 803.74 1500 789.23 1750 767.76 2000 731.52 2250 692.86 2500 669.57 2750 657.27 3000 631.49 3250 607.38 3500 602.58 3750 590.15 4000 593.4 4250 571.58 4500 576.04 4750 577.53 5000 577.43 5250 565.42 5500 565.96 5750 566.99 6000 558.22 6250 573.35 6500 556.48 6750 564.27 7000 553.37 7250 559.26 7500 560.0 7750 564.82 8000 547.85 8250 561.31 8500 551.63 8750 555.76 9000 551.26 9250 554.2 9500 546.67 9750 548.38

Out[15]:

[<matplotlib.lines.Line2D at 0x7ffdf2cb56d0>]

The code below will plot the median (red) and 95% credible intervals (blue) of the approximate posterior (i.e., the variational approximation). The true parameters that generated the data will be plotted as black stars.

In [16]:

```
l = guide.quantiles([0.025])['beta'][0].numpy()
m = guide.quantiles([0.5])['beta'][0].numpy()
u = guide.quantiles([0.975])['beta'][0].numpy()
plt.plot(l, 'b.', m, 'r.', u, 'b.')
plt.plot(beta.numpy(), 'k*')
```

Out[16]:

[<matplotlib.lines.Line2D at 0x7ffdf31390d0>]

Not too bad! However, you will have noticed that the parameter estimates are all over the place. This is unsurprising as we have a lot of parameters relative to data.

In settings where there are a lot of parameters and not much data, it is advantageous to use

The **Horseshoe Prior** is a prior that encourages this behaviour. It is specially designed to shrink the posterior of any parameters that are small toward zero, while simultaneously allowing larger parameters to be unaffected. See this paper if you are interested to learn more about this magical prior.

The model in generative form is:

$$\tau \sim {\rm HalfCauchy}(1)$$$$\lambda_i \sim_{\rm iid} {\rm HalfCauchy}(1), \quad i = 1,\ldots, d.$$

$$\beta_j | \tau, \lambda_k \sim_{\rm ind} {\cal N}(0, \tau^2 \lambda_k^2), \quad j=1,\ldots,d. $$$$y_k \sim_{\rm ind} {\cal N}(\beta^{\top}\mathbf{x_k}, \sigma^2) , \quad k=1,\ldots, N.$$Note that more generally we can pick some other prior for $\tau$.

It is important to note that the above model does not put a prior on $\sigma^2$. We could if we wanted to, but instead we will simply optimize over $\sigma^2$ in the **empirical Bayes** fashion. Note that in the code below, we tell `Pyro`

that we wish to do this by specifying `sigma`

as a `param`

and not by giving it a `sample`

statement (which would require us to specify a prior for it).

N.B. *For simplicity, I have not included an intercept term in the model. However, it is easy to add one by adding a column of ones to the matrix $x_{\rm obs}$ and changing $d$ to $d+1$, but we won't worry about doing that in this tutorial.*

In [17]:

```
def HorseshoeLR():
# ---- p(theta) -----------------------
tau = pyro.sample('tau', dist.HalfCauchy(scale=1.))
lam = pyro.sample('lambda', dist.HalfCauchy(scale=1.).expand([d]).to_event(1))
beta = sample('beta', Normal(loc=0. , scale=tau*lam).to_event(1) )
sigma = pyro.param("sigma", torch.tensor(1.), constraint=dist.constraints.positive)
# ---- p(y|theta) ---------------------
with pyro.plate("data", X.shape[0]):
y = pyro.sample('y', Normal(X @ beta.reshape(-1,1), sigma**2).to_event(1), obs = y_obs)
return y
```

Above, you will notice the code uses two new commands: `expand`

, and `to_event`

. The former just tells `Pyro`

to replicate the vector, the latter tells `Pyro`

to treat the elements as one object (we need to do this for `autoguide`

to work for technical reasons).

As we are being empirically Bayesian about $\sigma^2$, and also because even if we weren't the Horseshoe prior makes a posterior that is highly problematic for HMC (due to funnel-like behaviours), we are going to use Variational Learning to fit out model.

In a little bit, you will get to experiment by changing the dimension of the data. We will assume all of our latents (in this case, parameters we are being Bayesian about) can be fit with a variational approximation that is of the form ${\cal N}(\mu, {\rm diag}({\bf v}) + {\bf b}{\bf b}^\top)$, where ${\bf b}$ is a $d$-dimensional vector. For further discussion of variational distributions of this form, see the paper by Ong et al. (2017).

In [18]:

```
from pyro.infer.autoguide import AutoLowRankMultivariateNormal
guideHS = AutoLowRankMultivariateNormal(HorseshoeLR, rank=1)
svi = pyro.infer.SVI(model=HorseshoeLR,
guide=guideHS,
optim=pyro.optim.Adam({"lr": 1e-3}),
loss=pyro.infer.Trace_ELBO())
ELBO = []
num_steps = 50000
for t in range(num_steps):
pyro.clear_param_store()
ELBO.append(svi.step())
if t % 500 == 0: print(t, round(ELBO[-1],2))
```

In [19]:

```
plt.plot(ELBO)
plt.xlabel("Iteration")
plt.ylabel("Estimated ELBO")
```

Out[19]:

Text(0, 0.5, 'Estimated ELBO')

In [36]:

```
plt.rcParams['figure.figsize'] = [12, 8]
ax1 = plt.subplot(121)
l = guide.quantiles([0.025])['beta'][0].numpy()
m = guide.quantiles([0.5])['beta'][0].numpy()
u = guide.quantiles([0.975])['beta'][0].numpy()
plt.plot(l, 'b.', m, 'r.', u, 'b.', markersize=14)
plt.plot(beta.numpy(), 'k*', alpha=0.5, markersize=14)
plt.title("Gaussian Prior w/VI")
ax2 = plt.subplot(122, sharey = ax1)
lHS = guideHS.quantiles([0.025])['beta'][0].numpy()
mHS = guideHS.quantiles([0.5])['beta'][0].numpy()
uHS = guideHS.quantiles([0.975])['beta'][0].numpy()
plt.plot(lHS, 'b.', mHS, 'r.', uHS, 'b.', markersize=14)
plt.plot(beta.numpy(), 'k*', alpha = 0.5, markersize=14)
plt.title("Horseshoe Prior w/VI")
```

Out[36]:

Text(0.5, 1.0, 'Horseshoe Prior w/VI')

It is worth noting that Pyro has other variational objective functions besides ELBO: For example Tail-adaptive f-Divergence and the Importance Weighted Variational objective. (I believe doubly reparametrized "Sticking the Landing" gradients for the latter are a feature thats coming soon).

Investigate the effectiveness of Gaussian Prior vs Horseshoe Prior for different dimensions of data `d`

and different number of samples `n_samples`

. As Variational Inference/Learning is quite quick, you should be able to do it in high dimensions and with bigger datasets.

`torch.nn.Module`

and its use in `Pyro`

¶In `PyTorch`

, a very important aspect are `module`

objects. A key aspect of `Pyro`

is that one can use modules everywhere. This allows for **very flexible** models, and approaches such as amortized inference (using a neural network to determine the variational parameters given the data).

However, it suffices to note that a `module`

can be anything as simple as an activation function (e.g., `nn.Sigmoid`

) , a layer of a neural network (eg., `nn.Linear`

, `nn.Conv2D`

), or even the composition of other `module`

objects (e.g., a whole neural network can be a single `module`

object).

For those interested in how they are used within `Pyro`

, see the online tutorial Modules in Pyro. Though note that this does require a more advanced knowledge of how to create and use `module`

objects.

Using `module`

objects within ones model (e.g., Deep Bernoulli model) or one's variational approximation (e.g., neural nets used in Amortized Inference) is the core of **deep probabalistic programming**. `Pyro`

can also have `module`

objects that are fully Bayesian, i.e., have priors on all of their parameters. Another example of how one may use `modules`

in the `guide`

is through flexible conditionally structured approximations. For example, our posterior approximating variational family may take the form:

where ${\bf g}_{\xi}: \mathbb{R}^{{\rm dim}({\bf Z}_1)} \rightarrow \mathbb{R}^{{\rm dim}({\bf Z}_2)} $ is a neural network, say. Of course, designing such approximations that optimize well and capture the posterior characteristics of interest is the topic of ongoing research, but `Pyro`

in principle can easily implement such things.

These **incredible features** are beyond the scope of this tutorial, however, in this section we provide a brief overview of how `module`

objects work in base PyTorch. This is also useful to know if you

Below is a **very basic** walkthrough of creating a module object and how it can be used.

In [37]:

```
import torch
import torch.nn as nn
```

Below we create a new `module`

type called `Linear`

that mimics how `torch.nn.Linear`

module works under the hood.

In [38]:

```
class Linear(nn.Module):
def __init__(self, dim_in, dim_out):
# runs constructor for the class we are inheriting from
super().__init__()
# initialize the weight and bias parameters
self.W = nn.Parameter(torch.randn(dim_out, dim_in))
self.b = nn.Parameter(torch.zeros(dim_out))
def forward(self, X):
return X @ self.W.T + self.b
```

Below, I create a new object `LR`

of type `Linear`

.

In [39]:

```
LR = Linear(dim_in=5, dim_out = 1)
```

In [40]:

```
V = torch.randn(3,5)
```

The `forward`

method of a `nn.Module`

object is its *default* method. That is, if you call the module as it it is a function

In [41]:

```
# these two commands do the same thing
LR(V)
```

Out[41]:

tensor([[ 1.4408], [ 1.8738], [-1.6663]], grad_fn=<AddBackward0>)

Calling `parameters()`

will not actually show them, because it is a `generator`

object. This just means it is something that can output things when iterated through.

In [42]:

```
LR.forward(V)
```

Out[42]:

tensor([[ 1.4408], [ 1.8738], [-1.6663]], grad_fn=<AddBackward0>)

In [43]:

```
LR.parameters()
```

Out[43]:

<generator object Module.parameters at 0x7ffdfd3856d0>

We can view all the parameters in a module by iterating through its `parameters()`

generator

In [44]:

```
for p in LR.parameters(): print(p)
```

We can do the above and also show the names by iterating with the `named_parameters()`

generator. Having all parameters contained in a module object makes it easy to pass things to an`optimizer`

object (e.g., `torch.optim.adam`

. We just need to give it the `parameters`

generator upon initialization. It is also easy to tell `Pyro`

that all parameters in a module should be part of $\theta$ in a model or $\phi$ in the variational approximation that need to be optimized over.

Note that by default all parameters in the layer have `requires_grad`

set to true, so PyTorch watches operations made on them for backpropagation.

In [45]:

```
for p in LR.named_parameters(): print(p)
```

The purpose of the above was to demonstrate constructing a custom module. As a linear operation is very simple, we could have just done it with `PyTorch`

's built in module.

In [46]:

```
L = torch.nn.modules.Linear(5,1)
```

In [47]:

```
for p in L.named_parameters(): print(p)
```

Note that `nn.modules.Sigmoid`

is a `nn.Module`

, as are most activation funcions.

While it is very natural to think of a linear module applying the transformation
${\bf y} = \mathrm{A}{\bf x} + {\bf b},$
above, we (and in its own linear model `PyTorch`

) do not actually do it that way. The reason why is that we want to **vectorize**, that is, to do computations for several vectors of $\mathbf{x}$ in one go.

Tricks like this are very important for speeding up `PyTorch`

's performance when using GPUs (as GPUs can do tensor operations very fast, but other things not so much).

In [48]:

```
import torch as t
A = t.tensor([[1,2], [3,4]])
X = t.tensor([[1,1], [3,2]])
```

In [49]:

```
# the tedious way
h = t.zeros(2,2)
for i in [0,1]:
h[i,:] = A @ X[i,:]
print(h)
```

tensor([[ 3., 7.], [ 7., 17.]])

In [50]:

```
X @ A.T # the smart way to do the same thing
```

Out[50]:

tensor([[ 3, 7], [ 7, 17]])

It should hopefully not surprise you that using

In [51]:

```
A @ X
```

Out[51]:

tensor([[ 7, 5], [15, 11]])

does not yield the same answer, as the latter computes a product of two matrices (treats x as a matrix and not as two rows).

The advantage of this **vectorization** is that we can compute everything with **one** tensor operation and no for loops, GPUs are very good at such operations!

Where `Pyro`

really shines is implementing sophisticated models with techniques such as **amortized inference**. This is quite advanced. However, above tutorial should make you well equipped to begin to understand some of the more advanced `Pyro`

examples. Have a read through the following two examples, and see how much you can now follow!

N.B. In the `Pyro`

tutorials below, the plate diagram for the VAE has a circle around the parameter vector $\theta$. If one follows the convention that variables that are random (have a prior) are circled in the plate diagram, then the diagram they have is incorrect. The implemented VAE is not "fully Bayesian" with respect to $\theta$, but simply maximizes the ELBO over the parameter values (recall we discussed why this may be a better choice in the lecture).