# Deep Probabalistic Models: Tutorial 1¶

## Tutorial 1: Flow-Based Models and GAN¶

Welcome to the first tutorial for this week! This tutorial is a brief computational exploration of flow-based models and GANs within PyTorch.

The goal of this tutorial is simple: for you to have you play around with flows and GANs on some simple examples (the full Iris data, and a swiss roll dataset).

This notebook has also been written in a manner that will serve as a nice reference implementation for you in the future.

N.B. Please be sure to run each code cell as your progress through the notebook.

# Flow-Based Models¶

## Guided Learning¶

The following code cell will import the famous "Iris" data.

Flows typically fit quicker if the data is first standardized (remove the sample mean and divide by sample variance for each column), so we do that below using a StandardScaler from sklearn

Run the code below to visualize the data with four different bivariate plots (visualizing two dimensions at a time). You will notice that the data has an interesting shape and thus will require a flexible model!

Next, we will fit a flow based model to the four-dimensional data (note that the figure in the lectures was only fit to the 2D data of petal length vs. petal width, so we are doing something a little different to that).

The code below creates a 4D base distribution, as well as a single autoregressive transform with each of the four splines having 8 bins.

The code below computes the total number of parameters used...

That is a lot of parameters for fitting a 4D density (by contrast, you can fit a mixture of two Gaussians with only 16!), but such is the nature of using neural networks.

The train_flow_model function below takes five inputs:

• dataset : this is a dataset in the form of a rectangular arrar (numpy or torch.Tensor)
• params : a collection of parameters which the optimizer will optimize over
• num_samples : the number of subsampled observations from the data at each iteration
• steps : the total number of steps that the optimizer will take
• lr: the learning rate parameter of the optimizer. Note that when deeper neural nets or longer flows are used, this will need to be set smaller to ensure stable training!

The code below will train the flow we set up for Xdist previously. It may take a minute or so depending on the number of steps and subsample batch size.

Plotting the (estimated) loss over time seems to imply that the training has converged.

Pro Tip: If training is not performing well, or is unstable, try reducing the learning rate, and/or increasing the number of subsamples used at each iteration. The former is typically required for deeper (more transforms) flows.

### Transformation Objects¶

Prior to looking at our fit, we will talk about how to use the transformation object we created that defines our distributional family.

To create new data (or assess how well the original data gets transformed back to a normal), we wish to send samples through the learned $T$ and its inverse.

Recall that we created a transformation object T above.

Transformation objects have both a forward and inverse method defined on them. The following methods are worth noting:

1. T(Z): returns $T(Z)$
2. T.inv(X): returns $T^{-1}(X)$

The above is demonstrated below by generating a sample $Z \sim {\cal N}(0,{\rm I})$, computing $X = T(Z)$, and then computing $Z = T^{-1}(X)$. Note that putting the sample through the transform and then the resulting sample through the inverse transform correctly yields the original sample back.

The above also works with an arbitrary amount of samples. The .sample method of the base distribution just needs to be passed the requested number of samples inside of [ ]. Calling distZ.sample() is the same as distZ.sample(). Below, we generate 3 samples.

Below, we sample 500 times from the learned distribution, and plot the observations in bivariate plots with the original data.

Note that, as is the case with minimizing $KL(p||q)$, the learned distribution is very conservative where it places mass.

If the model has fit the training data well, we should expect all the bivariate plots of the inverse transformed data to look like samples from a $N(\mathbf{0}, I)$ distribution. Below we plot the inverse transformed dataset, along with samples from a $N(\mathbf{0}, I)$ for comparison.

## Multiple Transformations¶

For the next example, we will use multiple iterations of Real NVP with reverse permutation operations to fit data of a very challenging shape.

In this section, you will explore using flows of simpler transformations. The code below only does one Real NVP transformation, so it will leave half the variables unchanged as is.

Note that training using Real NVP without splines is very fast!

Run the cell below to plot the result.Recall that the variable on the $x$-axis is marginally ${\cal N}(0,1)$ as we only used one layer of Real NVP!

### Flow-Based Models: Exercise (Practical)¶

Modify the Real NVP flow code above and train it so you obtain a good fit to the points.

You may wish to change:

• The number of transforms (layers)
• The number of hidden units in the nueral nets (i.e., instead of , maybe ) in each layer.
• The number of hidden layers in the neural nets (e.g., use [25,25,25] to have three hidden layers)
• A different transform from Real NVP (e.g., pyro.distributions.affine_autoregressive - but for a challenge, no splines are allowed!)
• The learning rate so training is more stable (this may necessitate more steps)
• The random seed (maybe after changing the above you have a good model and just got unlucky if you didn't get a good fit)

### Flow-Based Models: Exercise (Analytical)¶

Johnson's SU-distribution is a four-parameter one-dimensional distribution arising from the transformation of a standard normal:

$$X = \mu + \sigma \sinh\left(\frac{Z - \gamma}{\delta} \right)$$

where $\mu \in \mathbb{R}$, $\gamma \in \mathbb{R}$, $\sigma > 0$, $\delta >0$, and $Z \sim {\cal N}(0,1)$.

Using that $\sinh^{-1}(x) = \log\left(x + \sqrt{x^2+1}\right)$, derive the probability density function of $X$ using the change of variables theorem.