Thursday, April 3, 2025

How to leverage latent spaces in multimodal data? The Posit AI Weblog is excited to share an illustration of studying with MMD-VAE (Maximum Mean Discrepancy Variational Autoencoder) for multimodal learning. Multimodal learning has gained significant attention lately, as it enables the fusion of diverse modalities such as images, text, and audio. A critical challenge in multimodal learning is aligning these different modalities into a unified latent space. To address this issue, we employ MMD-VAE, which combines maximum mean discrepancy (MMD) with variational autoencoders (VAEs). The MMD objective function calculates the difference between two distributions, allowing us to learn a shared representation that captures the underlying structure of multimodal data. By leveraging latent spaces in MMD-VAE, we can effectively align different modalities and enable their fusion. This technique has far-reaching implications for various applications, such as image-to-text generation, visual question answering, and multimedia analysis. In this blog post, we will delve into the details of our experimental setup and provide insights on how to leverage latent spaces in multimodal learning using MMD-VAE. Stay tuned!

Recently, we have successfully validated the optimal approach for leveraging generative adversarial networks (GANs). GANs can produce remarkable results, but the caveat remains: caveat emptor – buyer beware.
Typically, this is all we need to be satisfied. We could, in various scenarios, become even more captivated by the process of accurately replicating a website. We’re not just seeking to create samples that appear authentic – we aim to position our samples at specific coordinates within a designated residential space.

Let’s cultivate a community that’s home to diverse emotional landscapes, where every smile and frown is an invitation to connect. Our nascent understanding of the human psyche often reduces complex emotions to a simplistic dichotomy, viewing expressions as oscillating between positivity and negativity along an underlying emotional spectrum. Ranging from shallow to profound, the similarities in depth between them are undeniable. If we train a variational autoencoder (VAE) on a dataset of facial expressions, effectively capturing the underlying ranges, and it successfully “reveals” our hypothesized dimensions, we can then leverage this model to generate novel, previously unseen instantiations of these factors (faces) in the latent space.

Variational autoencoders share similarities with probabilistic graphical models, as both postulate the existence of an underlying, hidden latent space responsible for generating observable data. While sharing similarities with standard autoencoders, they uniquely compress and subsequently decompress the input space. Unlike traditional autoencoders, the primary goal here is to design a loss function that enables learning of informative representations in the latent space?

In a nutshell

In traditional Variational Autoencoders (VAEs), the objective is to maximize the evidence lower bound (ELBO):

The fundamental component is straightforwardly analogous to what we typically find in non-variational autoencoders, which we express in everyday language as follows: The KL-divergence measures the difference between the prior distribution assumed over the latent space and the true posterior distribution inferred from the data.

The prevailing concern regarding the traditional VAE loss is its tendency to produce latent representations lacking informative value. The flexibility of VAEs?

The proposed MMD-VAE is a variant of the Information-VAE, which instead of representing each illustration in latent space comparable to the prior, constrains them to be as close as possible. MMD, a similarity measure for probability distributions, is primarily based on matching their respective moments. We provide clarification on this matter in further detail below.

Our goal at the moment

Initially, we will deploy a conventional Variational Autoencoder (VAE), aiming to optimize the Evidence Lower Bound (ELBO). We then assess the efficacy of this model in comparison to an Information-VAE employing the Maximum Mean Discrepancy (MMD) loss function.

Can our focus shift to examining latent regions and investigating how their characteristics vary in response to different optimization criteria?

To create an effective mannequin, we will focus on a specific area, which can be styled glamorously. To maintain manageability, we will confine our work to a 28×28 measurement: We’ll compress and reconstruct images from the MNIST dataset, utilizing it as a drop-in solution.

An ordinary variational autoencoder

Since we haven’t employed TensorFlow’s keen execution for several weeks, we will execute the model in a keen manner.
When starting with keen execution, rest assured that a period of adjustment is normal; yet, as you become familiar with its methods, you’ll find that numerous tasks become significantly more manageable. The instance is readily available as a comprehensive template within the .

Setup and knowledge preparation

We initiate our workflow by verifying that we are leveraging the TensorFlow-backed Keras framework and activating eager execution to streamline our computations seamlessly. Moreover tensorflow and keras, we additionally load tfdatasets What kind of content do you envision for your knowledge streaming platform? Will you be offering online courses, tutorials, podcasts, or perhaps a combination of these formats?

Please provide the text you’d like me to improve in a different style as a professional editor. I’ll return the revised text directly without any explanation or comments. The two approaches can be found among our Keras examples, specifically in convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

The information comes conveniently with kerasAll that remains is a straightforward process of normalization and reshaping.

 

What do we want to check for, considering we’re preparing an unsupervised model over a prolonged period? We’ll utilize this approach to explore how previously unrecognised knowledge factors coalesce into distinct clusters within a latent structure.

Streamed keras:

 

Subsequently, defining the model.

Encoder-decoder mannequin

Actually, there are two key components in machine learning: the encoder and the decoder. As we’ll soon discover, within traditional VAE frameworks, an additional component intervenes, facilitating what’s commonly referred to as.

The encoder is a neural network component, comprising two convolutional layers and one dense layer. The dense layer splits into two parts: one component stores the mean of latent variables, while the other retains their variance.

 

The latent space is reduced to a two-dimensional representation for simplicity and ease of visualization.
With a deep understanding of advanced concepts, you will likely reap significant benefits by choosing the optimal dimensionality here.

The encoder compresses actual knowledge into estimates of mean and variance of the latent space.
We subsequently derive patterns indirectly from this dataset, which is commonly referred to as a probability distribution.

 

The sampled values then enter the decoder, which attempts to remap them to their corresponding unique housing locations.
The decoder primarily consists of a series of transposed convolutional layers, which progressively upsample the input until a final output resolution of 28×28 is achieved.

 

Why the ultimate deconvolution algorithm surprisingly lacks a sigmoid activation function as one might initially expect? Because we can utilize these resources effectively. tf$nn$sigmoid_cross_entropy_with_logits when calculating the loss.

Let’s delve into an analysis of losses.

Loss calculations

A VAE’s loss function combines a reconstruction loss, typically cross-entropy in this scenario, with Kullback-Leibler divergence. In Keras, the latter is directly accessible. loss_kullback_leibler_divergence.

By adopting a strategy to directly estimate the entire variational lower bound (ELBO) – rather than merely calculating the reconstruction loss and analytically computing the Kullback-Leibler divergence.

The calculation of the conventional log-likelihood is encapsulated within a function, allowing for its reusability throughout the training loop.

 

Looking ahead in our coaching journey, we will subsequently calculate these values as follows:

First,

 

produces the log-likelihood of the reconstructed samples conditional on values drawn from the latent space (also known as). reconstruction loss).

Then,

 

Provides the prior log-likelihood of. Typically, the assumption of a normal distribution applies regularly in most cases involving Variational Autoencoders (VAEs).

Lastly,

Calculates the log-likelihood of the samples, conditioned on the mean and variance estimated from the observed samples.

From these three parts, the ultimate loss will be computed.

After this brief peek into the future, let’s bring our attention back to the present and prepare ourselves for the coaching process ahead.

Remaining setup

To mitigate further the loss, we seek an optimizer capable of striving to reduce its magnitude.

We instantiate our fashions …

 

Let’s arrange checkpointing, enabling us to later restore skilled weights seamlessly.

 

Within the coaching loop, we will periodically revisit and expand upon three key features that are not included here but can be found elsewhere. generate_random_clothesUsed to generate garments from random samples drawn from the latent space of a generative model trained on the latent representation of houses. show_latent_spaceThat showcases the entire dataset in a latent 2D space, making it visually interpretable. show_gridThe software seamlessly produces garments based on user-input values, meticulously arranged within a grid structure.

Let’s begin coaching! In reality, before we delve into this analysis, let’s examine the features more closely to identify potential coaching opportunities. Notably, instead of garments, we’re presented with seemingly random pixel patterns. Latent house has no construction. There are no discernible categories of clothing that reside collectively within a residence.

Coaching loop

We are coaching for 50 epochs here. During each epoch, we iterate through our coaching data in batched segments. For each production run, we consistently witness a seamless operational rhythm unfold: Embedded within GradientTapeAssemble the mannequin, compute the current loss, and subsequently step outside this scenario to derive the gradients and enable the optimizer to perform back-propagation.

What’s unique in this scenario is the coexistence of two fashion styles, both requiring their respective gradient calculations and weight adjustments to be processed simultaneously. If we design it carefully. persistent.

Following a standard epoch, we store the current model’s weights. Furthermore, every 10 epochs, we also preserve relevant plots for future analysis.

 

Outcomes

How nicely did that work? Here are the diverse ranges of garments that have been produced following a 50-epoch training period.

What divergences exist between seemingly disparate teachings within a latent household, rendering its internal dynamics and coherence unclear.

As the lights dimmed, the runway transformed, and models emerged clad in unexpected ensembles that deftly blurred the lines between fashion epochs. A kimono’s flowing silhouette gave way to a 1920s flapper dress, which then shifted seamlessly into a futuristic jumpsuit with metallic accents. The crowd held its collective breath as the garments continued their mesmerizing metamorphosis, weaving together seemingly disparate styles and eras in a dazzling display of creativity and innovation.

How good are these representations? It’s disheartening to express oneself without a comparable experience to draw upon.

Let’s delve into MMD-VAE and examine its performance on this identical dataset.

MMD-VAE

MMD-VAE ensures the generation of highly informative latent representations, implying that we can anticipate significantly distinct behavior in the clustering and morphing plots.

The information setup remains consistent, with minimal differences within the framework. As noted will we simply highlight the differences.

Variations within the mannequin(s)

Three distinct configurations exist when it comes to mannequin design.

One of the encoders doesn’t need to return the variance, so there’s no requirement for that. tf$cut up. The encoder’s name methodology now simply is

 

Without the sampling step, between the encoder and the decoder, there isn’t any notion of.
And since we received’t use tf$nn$sigmoid_cross_entropy_with_logits To calculate the loss, the decoder applies a sigmoid function inside its final deconvolutional layer.

 

Loss calculations

As expected, a significant breakthrough lies within the performance gap.

The notion of the moment-matching distance (MMD) hinges on the principle that two probability distributions are indistinguishable whenever their moments are identical up to any order?
Concrete estimates of MMD are obtained using a Gaussian kernel, specifically.

to evaluate similarity between distributions.

If two distributions are deemed equivalent, their inherent patterns should mirror the typical similarity between samples from each individual distribution, which in turn mirrors the typical similarity between pooled samples from both distributions.

The code is a straightforward implementation.

 

Coaching loop

The coaching loop distinguishes itself from a traditional Variational Autoencoder (VAE) instance primarily through its nuanced approach to loss calculations.
Here are the respective strains:

 

While we primarily calculate the maximum mean discrepancy (MMD) loss alongside reconstruction loss, we subsequently combine these losses. No sampling is concerned with this model.
In the end, we’re eager to discover how well that effort had paid off.

Outcomes

Let’s take another glance at these algorithmically created garments initially. The details seem more defined here.

Clusters unfold neatly into distinct patterns within the two-dimensional space. Centered precisely at the origin (0,0), exactly as expected.

Let’s watch as garments seamlessly transition into one another. The subtle yet remarkable transformations unfolding before us are truly awe-inspiring.
Notably, most homes are filled to capacity with meaningful possessions, a stark contrast to the situation described earlier.

MNIST

To satisfy our curiosity, we produced identical plot patterns following training on genuine MNIST datasets.
Minimal variation is observed in randomly generated numbers following 50 iterations of training.

Left: random digits as generated after training with ELBO loss. Right: MMD loss.

The variations in clustering will likely not be significant.

Left: latent space as observed after training with ELBO loss. Right: MMD loss.

While right here, the morphing appears to be significantly more natural with MMD-VAE.

Left: Morphing as observed after training with ELBO loss. Right: MMD loss.

Conclusion

The significance of the cost function’s impact on VAEs is strikingly evident to us.
Alternative approaches can be explored, specifically those related to the prior distribution applied in latent space, as described in detail elsewhere for an overview of various priors or in the “Variational Combination of Posteriors” paper for a recommended contemporary approach.

As we transition from controlled environments like MNIST to working with real-world datasets, we rely on effective variations in price features and priors to yield even greater benefits when we step outside the managed setting.

Burgess, C. P., I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. 2018. , April. .
Doersch, C. 2016. , June. .

Adapting Variational Inference to Non-Exponential Families? 2013. abs/1312.6114.

Tomczak, J.M. and M. Welling. 2017. abs/1705.07120.

Zhang, Shengjia; Tune, Jiaming; and Ermon, Stefano. 2017. abs/1706.02262. .

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles