Variational Autoencoder

Introduction Variational Inference Variational Autoencoder (VAE) VAE Uncertainty in Robotic Manipulation

Introduction

A central challenge in machine learning systems that interact with the physical world — robotics, autonomous vehicles, embodied agents that the industry has lately taken to calling "Physical AI" — is uncertainty. Real-world interaction is inherently stochastic: sensor noise, unobserved physical properties, and environmental variability all conspire against deterministic models. Even classical robotic systems handle this with probabilistic methods (Kalman filtering being the canonical example); modern approaches push further by treating system states as full probability distributions rather than point estimates.

The Variational Autoencoder (VAE) is one such approach. It synthesizes ideas from information theory and Bayesian inference into a model that learns a continuous latent distribution characterized by a mean (\(\mu\)) and a variance (\(\sigma\)), providing a quantitative handle on what a learned system knows and what it does not.

Variational Inference

In Bayesian statistics, Variational Inference (VI) transforms the problem of posterior inference - which typically involves solving high-dimensional, intractable integrals - into a constrained optimization problem.

While deterministic control algorithms rely on point-estimates (e.g., "the center of mass is exactly at coordinates \(\mathbf{r}\)"), VI treats the state as a probability distribution. The structural shift is this: instead of seeking a single value, we search for the parameters of a distribution \(q_{\boldsymbol{\theta}}(\mathbf{z})\) that minimize the Kullback-Leibler (KL) divergence from the true, unknown posterior \(p(\mathbf{z} \mid \mathbf{x})\).

Since the true posterior is analytically intractable for complex models, we restrict our search to a family of simpler distributions (such as Gaussians). By maximizing the Evidence Lower Bound (ELBO), the system finds the "best fit" distribution that balances data fidelity with prior beliefs.

For a robot, this approximation is one ingredient in uncertainty-aware control. The posterior variance — the "spread" of the distribution — provides one quantitative signal that can inform decisions such as whether to proceed with an action or trigger a fallback. We note, however, that the relation between the VAE's posterior spread and the conceptual notion of epistemic uncertainty (uncertainty due to limited model knowledge, as distinct from aleatoric uncertainty inherent in the observation noise) is not automatic; the VAE's encoder learns a single distribution per input and does not by construction separate these two sources. Established methods for principled epistemic-uncertainty estimation — Monte Carlo dropout, deep ensembles, Bayesian neural networks — operate on different principles, and the literature on the reliability of generative-model latent variances as uncertainty proxies is active and unsettled (e.g., Kendall & Gal, 2017). The posterior-spread signal complements rather than replaces classical safety mechanisms (limit switches, mechanical fail-safes, model-based control bounds), and the demo below should be read as a pedagogical illustration of one such complement, not a deployment-ready uncertainty-quantification scheme.

The variational framework introduced here rests on measure-theoretic foundations that are developed across our probability section. Speaking of densities \(q(\boldsymbol{z})\) and \(p(\boldsymbol{z} \mid \boldsymbol{x})\) with respect to a common dominating measure is licensed by the Radon-Nikodym theorem, and conditional expectation — defined rigorously as a Radon-Nikodym derivative — provides the idealised abstract framework that the densities and ELBO decomposition implemented here approximate. The implementation works directly with explicit Gaussian densities, Monte Carlo samples, and reparameterized gradients; the measure-theoretic construction is what licenses these computations as approximations to a well-defined object, not a layer that the code instantiates directly. The full VI framework derives the ELBO using these tools and situates the VAE's amortized inference as a special case of the broader variational principle.

Variational Autoencoder (VAE)

From Factor Analysis to Nonlinear Generative Models

In classical factor analysis (FA), we model the observed data \(\boldsymbol{x} \in \mathbb{R}^D\) as a linear function of a latent variable \(\boldsymbol{z} \in \mathbb{R}^K\) with \(K \ll D\): \[ p(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid \boldsymbol{Wz}, \sigma^2 \boldsymbol{I}) \] where \(\boldsymbol{W} \in \mathbb{R}^{D \times K}\) is the factor loading matrix. The linearity of \(\boldsymbol{W}\) makes posterior inference tractable: given an observation \(\boldsymbol{x}\), the posterior \(p(\boldsymbol{z} \mid \boldsymbol{x})\) is Gaussian in closed form. However, this linearity also limits the model's expressiveness - real-world data distributions are rarely well-captured by linear mappings.

The Variational Autoencoder (VAE) extends factor analysis by replacing the linear mapping \(\boldsymbol{Wz}\) with an arbitrary nonlinear function parameterized by a neural network: \[ p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid f_d (\boldsymbol{z}; \boldsymbol{\theta}), \, \sigma^2 \boldsymbol{I}) \] where \(f_d(\cdot\,; \boldsymbol{\theta})\) is a decoder network with parameters \(\boldsymbol{\theta}\). This nonlinear generative model can represent far richer data distributions, but it comes at a cost: the posterior \(p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})\) is no longer analytically tractable, because computing it requires the marginal likelihood \[ p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x}) = \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\, p(\boldsymbol{z})}{p_{\boldsymbol{\theta}}(\boldsymbol{x})} \quad \text{where} \quad p_{\boldsymbol{\theta}}(\boldsymbol{x}) = \int p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\, p(\boldsymbol{z})\, d\boldsymbol{z} \] which involves an intractable integral over the latent space when \(f_d\) is a deep network. This intractability is the central challenge that motivates the variational approach.

Amortized Variational Inference

Since the true posterior is intractable, we introduce an approximate posterior - a recognition network (also called the inference network or encoder) - that is trained simultaneously with the generative model. We restrict the approximate posterior to a tractable family: a Gaussian with diagonal covariance, yielding \[ q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x}) = \mathcal{N}\!\left(\boldsymbol{z} \;\middle|\; f_{e,\mu} (\boldsymbol{x} ; \boldsymbol{\phi}), \, \text{diag}\!\left(f_{e,\sigma}(\boldsymbol{x};\boldsymbol{\phi})\right)\right) \] where \(f_{e,\mu}\) and \(f_{e,\sigma}\) are the encoder's output heads that produce the mean vector and the diagonal covariance respectively, and \(\boldsymbol{\phi}\) denotes all encoder parameters.

This approach is called amortized inference: rather than running a separate optimization procedure to compute the posterior for each individual data point (as in classical variational inference), we train a single neural network that directly maps any input \(\boldsymbol{x}\) to the parameters of its approximate posterior. The cost of inference at test time is thus reduced to a single forward pass through the encoder, with the computational investment "amortized" across the entire training set.

The Evidence Lower Bound (ELBO)

Since the marginal likelihood \(p_{\boldsymbol{\theta}}(\boldsymbol{x})\) is intractable, we cannot maximize it directly. Instead, we derive a tractable lower bound. For a single observation \(\boldsymbol{x}\), define the evidence lower bound (ELBO) as follows.

Definition: Evidence Lower Bound(ELBO)

For a generative model \(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\) and approximate posterior \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\), the ELBO is \[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \right]. \]

To see that this is indeed a lower bound on the log-evidence, apply Jensen's inequality. Since \(\log\) is concave, moving it outside the expectation can only increase the value: \[ \begin{align*} L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) &= \int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\, d\boldsymbol{z} \\\\ &\leq \log \int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\, d\boldsymbol{z} \\\\ &= \log \int p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\, d\boldsymbol{z} \\\\ &= \log p_{\boldsymbol{\theta}}(\boldsymbol{x}). \end{align*} \] That is, \(L(\boldsymbol{\theta}, \boldsymbol{\phi} \mid \boldsymbol{x}) \leq \log p_{\boldsymbol{\theta}}(\boldsymbol{x})\) for any choice of \(q_{\boldsymbol{\phi}}\). Maximizing the ELBO with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\) therefore simultaneously pushes up the log-likelihood and tightens the bound by making \(q_{\boldsymbol{\phi}}\) a better approximation to the true posterior.

ELBO Decomposition: Reconstruction + Regularization

The ELBO can be decomposed into two interpretable terms by expanding the joint \(\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) = \log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) + \log p(\boldsymbol{z})\) and rearranging.

Theorem: ELBO Decomposition

\[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) = \underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right]}_{\text{reconstruction}} \;-\; \underbrace{D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right)}_{\text{regularization}} \]

Proof:

Expanding the joint inside the ELBO definition, \[ \begin{align*} L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) &= \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) + \log p(\boldsymbol{z}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\right] \\\\ &= \mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log \frac{p(\boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\\\ &= \mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right] - D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right). \end{align*} \]

The first term is the expected reconstruction log-likelihood: it encourages the decoder to reconstruct the input \(\boldsymbol{x}\) from latent samples \(\boldsymbol{z} \sim q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\). The second term is the KL divergence between the approximate posterior and the prior \(p(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), which regularizes the latent space by penalizing approximate posteriors that deviate from the prior. This regularization is what gives the VAE its structured, continuous latent space - without it, the encoder could collapse each data point to an isolated delta function, losing the ability to generate new data by sampling from \(p(\boldsymbol{z})\).

The Reparameterization Trick

To train the VAE end-to-end via gradient-based optimization, we need to differentiate the ELBO with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\). The gradient with respect to the decoder parameters \(\boldsymbol{\theta}\) poses no difficulty, since \(\boldsymbol{\theta}\) appears only inside the expectation. However, differentiating with respect to the encoder parameters \(\boldsymbol{\phi}\) is problematic: the expectation \(\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}[\cdot]\) is taken with respect to a distribution that itself depends on \(\boldsymbol{\phi}\), so we cannot simply interchange the gradient and the expectation.

The reparameterization trick resolves this by expressing the stochastic latent variable \(\boldsymbol{z}\) as a deterministic, differentiable transformation of a parameter-free noise variable. Since the encoder outputs a diagonal Gaussian, we write

Reparameterization Trick:

Sample \(\boldsymbol{z}\) from \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\) via \[ \boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}} (\boldsymbol{x})\odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \] where \(\mu_{\boldsymbol{\phi}} = f_{e,\mu}(\boldsymbol{x}; \boldsymbol{\phi})\) and \(\sigma_{\boldsymbol{\phi}} = f_{e,\sigma}(\boldsymbol{x}; \boldsymbol{\phi})\) are the encoder outputs, and \(\odot\) denotes element-wise multiplication.

Under this reparameterization, the ELBO becomes \[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) = \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})} \!\left[\log p_{\boldsymbol{\theta}}\!\left(\boldsymbol{x} \mid \boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}}(\boldsymbol{x})\odot \boldsymbol{\epsilon}\right)\right] - D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right). \] The key observation is that the expectation is now taken with respect to the fixed distribution \(\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), which does not depend on \(\boldsymbol{\phi}\). This means we can interchange the gradient and the expectation: \(\nabla_{\boldsymbol{\phi}} \mathbb{E}_{\boldsymbol{\epsilon}}[\cdot] = \mathbb{E}_{\boldsymbol{\epsilon}}[\nabla_{\boldsymbol{\phi}}(\cdot)]\), and estimate the gradient via Monte Carlo sampling of \(\boldsymbol{\epsilon}\). In practice, even a single sample per data point provides a sufficiently low-variance gradient estimate for stochastic gradient descent.

The full VAE training objective over a dataset \(\mathcal{D}\) is therefore \[ \min_{\boldsymbol{\theta}, \, \boldsymbol{\phi}} \;-\, \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) \right] \] which is optimized end-to-end using standard backpropagation through the reparameterized sampling step.

Putting It All Together

The components developed above - the nonlinear generative model, the amortized inference network, the ELBO objective, and the reparameterization trick - collectively define the Variational Autoencoder.

Variational Autoencoder (VAE):

A VAE is a latent variable model consisting of:

  1. A prior over the latent space: \(p(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\)
  2. A decoder (generative model): \(p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid f_d(\boldsymbol{z}; \boldsymbol{\theta}), \, \sigma^2 \boldsymbol{I})\)
  3. An encoder (inference network): \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) = \mathcal{N}\!\left(\boldsymbol{z} \mid f_{e,\mu}(\boldsymbol{x}; \boldsymbol{\phi}), \, \text{diag}(f_{e,\sigma}(\boldsymbol{x}; \boldsymbol{\phi}))\right)\)

The parameters \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\) are trained jointly by maximizing the ELBO \[ \max_{\boldsymbol{\theta}, \, \boldsymbol{\phi}} \; \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})} \!\left[\log p_{\boldsymbol{\theta}}\!\left(\boldsymbol{x} \mid \boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}}(\boldsymbol{x}) \odot \boldsymbol{\epsilon}\right)\right] - D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right) \right] \] where gradients with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\) are computed via backpropagation through the reparameterized sampling step.

VAE Uncertainty in Robotic Manipulation

Overview

This simulation demonstrates how a Variational Autoencoder (VAE) can be used as a real-time safety monitor during robotic manipulation. A 2-link planar arm attempts to lift a box with an off-center mass. The VAE's posterior distribution \(q(\mathbf{z} \mid \mathbf{x})\) encodes the robot's internal uncertainty about the physical state of the grasp. When this uncertainty exceeds a learned threshold, the system aborts the lift to prevent a catastrophic drop.

The Physical Setup

The robot is a 3-DOF articulated arm. While its spatial positioning is achieved via a rotating base and two revolute joints, its reaching kinematics are governed by two primary links with lengths \(L_1 = 24\) and \(L_2 = 22\). The end-effector pose is calculated through analytic inverse kinematics that maps 3D target coordinates \((x, y, z)\) to the arm's joint angles.

A rigid box of adjustable mass \(m\) sits on the ground plane. Its center of mass (CoM) is offset from its geometric center by a user-controlled displacement \(\mathbf{c} = (c_x, c_y, c_z)\). This offset serves as the primary source of epistemic uncertainty: the robot cannot directly observe the true CoM location and must infer the resulting torque anomalies from sensor feedback during the initial lift phase.

Torque as Sensory Input

Once the gripper secures the box and begins to lift, gravity acting on the offset CoM produces a torque about the grip point:

\[ \boldsymbol{\tau} = (\mathbf{r}_{\text{CoM}} - \mathbf{r}_{\text{grip}}) \times (-mg\,\hat{\mathbf{y}}) \]

where \(\mathbf{r}_{\text{CoM}}\) is the world-space CoM position and \(\mathbf{r}_{\text{grip}}\) is the grip point. In this demo the torque vector serves as the input feature on which the encoder is defined; in a deployed robotic system the input would typically be a higher-dimensional sensor stream (force-torque sensor readings, joint encoder readings, vision) processed by a learned encoder rather than the closed-form mapping below. The torque vector is displayed as a green arrow on the box during the lift.

Rotational Dynamics

The torque drives the box's rotational dynamics via the damped Euler equation:

\[ I\,\dot{\boldsymbol{\omega}} = \boldsymbol{\tau} - C_d\,\boldsymbol{\omega} \]

where \(I\) is the moment of inertia, \(\boldsymbol{\omega}\) is the angular velocity, and \(C_d\) is the damping coefficient. This produces the visible tilt of the box during lifting. Higher mass or larger CoM offset leads to larger torques, faster rotation, and more pronounced tilt - all signals that feed into the VAE.

VAE Posterior: \(q(\mathbf{z} \mid \mathbf{x})\)

The sensory input \(\mathbf{x}\) (here, the torque feedback) is encoded into a 2D latent space \(\mathbf{z} = (z_1, z_2)\) via a diagonal Gaussian posterior:

\[ q(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\mathbf{z};\; \boldsymbol{\mu}(\mathbf{x}),\; \sigma(\mathbf{x})^2 \mathbf{I}) \]

For the purposes of this pedagogical demo, we do not train an encoder neural network end-to-end. Instead we hand-design a fixed, interpretable mapping from torque to posterior parameters, chosen so that the qualitative behaviour of an uncertainty-aware encoder is visible at a glance:

Under this hand-designed mapping, \(\sigma\) is by construction monotone in \(\|\boldsymbol{\tau}\|\), and we use it in this demo as a stand-in for an uncertainty signal: when the torque is small (CoM near grip axis), \(\sigma\) is small; when the torque is large (CoM far off-center), \(\sigma\) is large. In a learned VAE encoder the relationship between input features and posterior spread would emerge from training rather than being prescribed; the qualitative dynamics shown in the simulation are intended to convey how such a learned signal could be consumed by downstream safety logic, not to claim that this fixed mapping is itself a learned uncertainty estimate.

The Latent Space Visualization

The top-right canvas shows the 2D latent space in real time:

The display uses dynamic scaling: both the \(\sigma\) circle and the threshold circle always fit within the canvas, preserving their true ratio. When the lift fails, the orange circle is visibly larger than the red circle, making the safety violation immediately apparent.

KL Divergence

The telemetry panel reports the KL divergence, which acts as a regularizer by measuring the distance between the approximate posterior \(q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})\) and the prior \(p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})\). The general closed-form expression for a \(K\)-dimensional Gaussian with diagonal covariance is: \[ D_{\mathbb{KL}}\!\left[\,q(\mathbf{z} \mid \mathbf{x})\;\|\;p(\mathbf{z})\,\right] = \frac{1}{2} \sum_{j=1}^{K} \left( \mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1 \right) \]

In this specific simulation, where our latent space is 2-dimensional (\(K=2\)) and uses isotropic variance (\(\sigma_1^2 = \sigma_2^2 = \sigma^2\)), this simplifies to the formula used in our telemetry engine: \[ \begin{aligned} D_{\mathbb{KL}} &= \frac{1}{2} \left[ (\mu_1^2 + \sigma^2 - \ln \sigma^2 - 1) + (\mu_2^2 + \sigma^2 - \ln \sigma^2 - 1) \right] \\ &= \frac{1}{2} \left[ (\mu_1^2 + \mu_2^2) + 2\sigma^2 - 2\ln \sigma^2 - 2 \right] \\ &= \frac{1}{2} \left( \|\boldsymbol{\mu}\|^2 + 2\sigma^2 - 2\ln \sigma^2 - 2 \right) \end{aligned} \]

This quantity measures the "information gain" or surprise from the sensory input. A large KL divergence indicates that the physical interaction has pushed the robot's belief significantly away from its prior expectations - a sign of an unusual or difficult grasp configuration.

Ghost Arms: Sampling Uncertainty

The five semi-transparent ghost arms visualize samples from the posterior. Each ghost arm \(i\) receives a perturbed end-effector target:

\[ \mathbf{t}^{(i)} = \mathbf{t}_{\text{primary}} + \boldsymbol{\mu} \cdot s_{\mu} + \sigma \cdot \boldsymbol{\epsilon}^{(i)} \cdot s_{\sigma} \]

where \(\boldsymbol{\epsilon}^{(i)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) are smoothed Gaussian samples (using exponential moving average with \(\alpha = 0.92\) for visual stability). When \(\sigma\) is small, the ghosts cluster tightly around the primary arm. When \(\sigma\) is large, they splay outward, creating a visual "cloud of possible robot states" that communicates the degree of uncertainty.

The Safety Decision

The lift sequence proceeds through a state machine:

  1. APPROACH → Arm moves from home to hover above the box
  2. DESCEND → Arm lowers to grasp height
  3. GRASP → Fingers close on the box
  4. PRE_LIFT → Small test lift (5 units). During this phase, the system assesses \(\sigma\) for 0.5 seconds. If \(\sigma > \sigma_{\text{thr}}\), the lift is aborted immediately.
  5. LIFT_OK → Full lift if \(\sigma\) stayed below threshold. Continuous safety monitoring continues; an abort can still trigger if \(\sigma\) spikes during the lift.
  6. ABORT → Three-phase emergency: lower the box, release grip, return home.

The critical decision in this demo is the comparison \(\sigma \lessgtr \sigma_{\text{thr}}\). This is a pedagogical analogy to a class of out-of-distribution (OOD) detection heuristics in which a learned uncertainty score gates further action — if the score exceeds a threshold, the system flags the observation as untrusted. We caution that the actual reliability of generative-model latent variances or likelihoods as OOD detectors is an active research debate, and not a settled solved problem: deep generative models (including VAEs and normalizing flows) have been shown to assign higher likelihoods to clearly out-of-distribution inputs than to in-distribution ones in published experiments (Nalisnick et al., 2019, "Do Deep Generative Models Know What They Don't Know?"). The simulation here is best read as a clean illustration of how an uncertainty signal can be wired into a control state machine, not as evidence that posterior-spread thresholding is a deployment-ready OOD method.

What the Parameters Control

CoM Offset \((c_x, c_y, c_z)\): Shifts the center of mass away from the box's geometric center. Larger offsets produce larger torques, higher \(\sigma\) and a higher likelihood of abort. This simulates real-world scenarios where the load distribution inside a package is unknown.

Box Mass: Scales the gravitational force \(mg\), amplifying the torque for a given CoM offset. Heavy objects are harder to lift safely.

\(\sigma\) Threshold: The abort boundary. Lowering it makes the robot more cautious (aborting earlier); raising it makes the robot more risk-tolerant. This models the engineering trade-off between safety and task completion in autonomous systems.

Damping \(C_d\): Controls how quickly the box's rotational oscillation decays. High damping suppresses wobble; low damping allows the box to swing more freely, producing a more dynamic and potentially uncertain lift.

Key Takeaways

This demo illustrates three interconnected ideas from modern Physical AI:

1. Uncertainty quantification through latent representations:
The VAE does not output a single point estimate of the physical state - it outputs a distribution. The spread of this distribution (\(\sigma\)) is a principled measure of epistemic uncertainty derived from sensory feedback.

2. Safety-critical decision-making under uncertainty:
Rather than blindly executing a motion plan, the robot uses its uncertainty estimate as a gating signal. When \(\sigma\) exceeds the threshold, the system recognizes that it is operating outside its confident regime and aborts. This pattern — using a learned uncertainty estimate as a gating signal — appears throughout uncertainty-aware reinforcement learning and Bayesian safe control.

3. The connection between physics and information:
The torque - a purely physical quantity governed by Newton's laws - becomes the input to a probabilistic inference engine. The KL divergence between posterior and prior quantifies how much information the robot has gained (or how surprised it is) from the physical interaction. Large KL divergence means the physical situation deviates significantly from the robot's expectations.