# Journal club, 18.05

#### Paper info

- Original VAE paper
Auto-Encoding Variational Bayes
Diederik P Kingma, Max Welling, ICLR 2014
https://arxiv.org/abs/1312.6114

https://openreview.net/forum?id=33X9fd2-9FyZd 14 pages incl. appendix - Recent VAE review / tutorial An Introduction to Variational Autoencoders Diederik P. Kingma, Max Welling (2019) Foundations and Trends in Machine Learning. 12. 307-392. 10.1561/2200000056. https://arxiv.org/abs/1906.02691 86 pages

#### Discussion

JJ : Tutorial paper, p.34, Eq. (2.71) : decomposition of VAE loss into reconstruction and regularization terms

VAE and (its intractable) marginal likelihood:
"Marginal likelihood. For very low-dimensional latent space it is possible to estimate the marginal likelihood of the learned generative models using an MCMC estimator. More information about the marginal likelihood estimator is available in the appendix. For the encoder and decoder we again used neural networks, this time with 100 hidden units, and 3 latent variables; for **higher dimensional latent space** the **estimates became unreliable**. Again, the MNIST dataset was used. The AEVBand Wake-Sleep methods were compared to Monte Carlo EM (MCEM) with a Hybrid Monte Carlo(HMC) [DKPR87] sampler; details are in the appendix. We compared the convergence speed for the three algorithms, for a small and large training set size. Results are in figure"

MC:This is only an estimate of marginal likelihood with very few latent variables, not applicable in general. source: https://arxiv.org/pdf/1312.6114.pdf (the original VAE paper)

Q: Jenia, you compared VAE to GANs. Can you also compare to normalizing flows in 2 or 3 sentences? AFAIK normalizing flows allow you to access the underlying distribution, but I thought VAEs couldnt

- JJ, MC: P(X) accessible as estimate in autoregressive models (PixelCNN); in VAE marginal likelihood is intractable, density P(X) is not available; in normalizing flows, marginal likelhood P_{\theta}(X) is tractable - the advantage of NF over VAE is also the use of a more expressive inference model (employing a more complex prior)
- TT: pθ(x) = \int pθ(z)pθ(x|z) dz is intractable in our case where pθ(x|z) is given by a NN (the decoder) (see Paper Sect. 2.1).
- JJ: Normalizing Flows were motivated by necessity to have more expressive inference and posterior model (beyound simple (gaussian, uniform), very low dimensional ones in VAE). It has tractable marginal likelihood P_{\theta}(X). See section 3.2 "Improving the Flexibility of Inference Models" in tutorial paper

JJ: "Autoregressive models provide tractable likelihoods, whereas variational autoencoders have intractable marginal likelihoods." MC: There is a nice diagram from Ian Goodfellow in his tutorial about generative models about the different kinds of generative models and when the density can be calculated exactly, approximated, or is only implicit. Check https://arxiv.org/pdf/1701.00160.pdf, Figure 9.

https://deepgenerativemodels.github.io/notes/flow/

JJ: From tutorial paper, 2.6 Estimation of the Marginal Likelihood --> marginal likelihood can be only estimated (in low-dimensional toy cases) , otherwise intractable; ELBO (Evidence Lower Bound) is available, can be used as lower bound for inaccessible marginal likelihood (therefore, VAE can be seen as explicit likelihood model, as likelihood form as available, in contrast e.g to GANs) JJ: see Figure 2.1 in tutorial paper for very good VAE schematic overview

Q: latent space dimension or in general, any characteristics: are there models that detemine those from data - JJ : non parametric models; have a look whether something was done in that direction - MC: I would say in general that this problem can be thought of in terms of rate-distortion theory in information theory for lossy compression. There is a tradeoff between how much you compress your input (dimension of latent space) and the reconstruction error ("distortion"). In other terms, you can control how much reconstruction error you loose by controlling the size of the latent space. In practice, it all depends on your task and for what you use the generative model for. See https://arxiv.org/abs/1511.01844 for a discussion about evaluating generative models, which should be task specific. The "downstream" task needs to be considered. If the goal is to learn good features in an unsupervised way for a classification task, then classification accuracy should be the metric to optimize. If the goal is to generate "realistic" images, there are options such as perceptual metrics, a standard metric used for images is PSNR, https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio), other ones which are more recent are based on neural nets https://arxiv.org/abs/1801.03924. Frechet Inception Distance (https://arxiv.org/abs/1706.08500)is standard in generative model literature, but I would say it's still an open question so a lot of papers still include human evaluation, for instance they ask people to judge if an image is fake or generated. Once te metric is figured out for the task, it is then possible to optimize the hyper-parameters( such as the dimension of the latent space) in a validation set.

JJ: here we also have to keep in mind that creating a large latent space may enable to reduce reconstruction loss in training, but will often lead to huge generalization error (because without enforcing compression, we do not learn anything usefull about true underlying latent causes that generate the original observation signal in data space). In trivial case, imagine you learn during training to make 1:1 copy of the original signal, without learning anything about underlying generative model. So latent space has to be rather much more compact that original data space, and equipped with regularizing priors

JJ : see Section 3.2 Improving the Flexibility of Inference Models in tutorial paper on examples to go beyound simple prior assumptions (e.g Gaussian), also with discussion on normalizing flows (which have tractable marginal likelihood p_{\theta}(X))

Q: During training, how to solve trade-off between KL div. and reconstr loss? JJ: Grid search (simplest and roughest way to do it) or meta leanring, i.e. learing to learn hyperparams

Scheduling KL vs reconstruction term: - https://arxiv.org/pdf/1511.06349.pdf (section 3.1, 'KL cost annealing') - https://arxiv.org/abs/1704.03477 - in general the keyword to search for is 'KL annealing', there are other works about this