Variational Autoencoder

Introduction Variational Inference Variational Autoencoder (VAE) VAE Uncertainty in Robotic Manipulation

Introduction

In the intersection of pure mathematics and autonomous systems lies the challenge of uncertainty. While classical robotics often relies on deterministic models, real-world interaction is inherently stochastic. Sensory noise, unobservable physical properties, and environmental variances require a framework that does not merely calculate values, but manages probabilities.

The Variational Autoencoder (VAE) represents a profound synthesis of information theory and Bayesian inference. Unlike standard autoencoders that map data to discrete points in a latent space, the VAE learns the underlying structural manifold of the data. By encoding inputs into a continuous latent distribution characterized by a mean (\(\mu\)) and a variance (\(\sigma\)), the VAE provides a principled way to quantify what the system knows - and, more importantly, what it does not.

Variational Inference

In Bayesian statistics, Variational Inference (VI) transforms the problem of posterior inference - which typically involves solving high-dimensional, intractable integrals - into a constrained optimization problem.

While deterministic control algorithms rely on point-estimates (e.g., "the center of mass is exactly at coordinates \(\mathbf{r}\)"), VI treats the state as a probability distribution. This shift is mathematically profound: instead of seeking a single value, we search for the parameters of a distribution \(q_{\boldsymbol{\theta}}(\mathbf{z})\) that minimize the Kullback-Leibler (KL) divergence from the true, unknown posterior \(p(\mathbf{z} \mid \mathbf{x})\).

Since the true posterior is analytically intractable for complex models, we restrict our search to a family of simpler distributions (such as Gaussians). By maximizing the Evidence Lower Bound (ELBO), the system finds the "best fit" distribution that balances data fidelity with prior beliefs.

For a robot, this mathematical approximation is the key to real-time safety. The robot does not merely estimate a property; it quantifies the epistemic uncertainty (the "spread" of the distribution). This variance serves as a rigorous proxy for risk, allowing the AI to distinguish between a confident execution and a situation requiring an emergency abort.

Variational Autoencoder (VAE)

From Factor Analysis to Nonlinear Generative Models

In classical factor analysis (FA), we model the observed data \(\boldsymbol{x} \in \mathbb{R}^D\) as a linear function of a latent variable \(\boldsymbol{z} \in \mathbb{R}^K\) with \(K \ll D\): \[ p(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid \boldsymbol{Wz}, \sigma^2 \boldsymbol{I}) \] where \(\boldsymbol{W} \in \mathbb{R}^{D \times K}\) is the factor loading matrix. The linearity of \(\boldsymbol{W}\) makes posterior inference tractable: given an observation \(\boldsymbol{x}\), the posterior \(p(\boldsymbol{z} \mid \boldsymbol{x})\) is Gaussian in closed form. However, this linearity also limits the model's expressiveness - real-world data distributions are rarely well-captured by linear mappings.

The Variational Autoencoder (VAE) extends factor analysis by replacing the linear mapping \(\boldsymbol{Wz}\) with an arbitrary nonlinear function parameterized by a neural network: \[ p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid f_d (\boldsymbol{z}; \boldsymbol{\theta}), \, \sigma^2 \boldsymbol{I}) \] where \(f_d(\cdot\,; \boldsymbol{\theta})\) is a decoder network with parameters \(\boldsymbol{\theta}\). This nonlinear generative model can represent far richer data distributions, but it comes at a cost: the posterior \(p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})\) is no longer analytically tractable, because computing it requires the marginal likelihood \[ p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x}) = \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\, p(\boldsymbol{z})}{p_{\boldsymbol{\theta}}(\boldsymbol{x})} \quad \text{where} \quad p_{\boldsymbol{\theta}}(\boldsymbol{x}) = \int p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\, p(\boldsymbol{z})\, d\boldsymbol{z} \] which involves an intractable integral over the latent space when \(f_d\) is a deep network. This intractability is the central challenge that motivates the variational approach.

Amortized Variational Inference

Since the true posterior is intractable, we introduce an approximate posterior - a recognition network (also called the inference network or encoder) - that is trained simultaneously with the generative model. We restrict the approximate posterior to a tractable family: a Gaussian with diagonal covariance, yielding \[ q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x}) = \mathcal{N}\!\left(\boldsymbol{z} \;\middle|\; f_{e,\mu} (\boldsymbol{x} ; \boldsymbol{\phi}), \, \text{diag}\!\left(f_{e,\sigma}(\boldsymbol{x};\boldsymbol{\phi})\right)\right) \] where \(f_{e,\mu}\) and \(f_{e,\sigma}\) are the encoder's output heads that produce the mean vector and the diagonal covariance respectively, and \(\boldsymbol{\phi}\) denotes all encoder parameters.

This approach is called amortized inference: rather than running a separate optimization procedure to compute the posterior for each individual data point (as in classical variational inference), we train a single neural network that directly maps any input \(\boldsymbol{x}\) to the parameters of its approximate posterior. The cost of inference at test time is thus reduced to a single forward pass through the encoder, with the computational investment "amortized" across the entire training set.

The Evidence Lower Bound (ELBO)

Since the marginal likelihood \(p_{\boldsymbol{\theta}}(\boldsymbol{x})\) is intractable, we cannot maximize it directly. Instead, we derive a tractable lower bound. For a single observation \(\boldsymbol{x}\), define the evidence lower bound (ELBO) as follows.

Definition: Evidence Lower Bound(ELBO)

For a generative model \(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\) and approximate posterior \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\), the ELBO is \[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \right]. \]

To see that this is indeed a lower bound on the log-evidence, apply Jensen's inequality. Since \(\log\) is concave, moving it outside the expectation can only increase the value: \[ \begin{align*} L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) &= \int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\, d\boldsymbol{z} \\\\ &\leq \log \int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\, d\boldsymbol{z} \\\\ &= \log \int p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\, d\boldsymbol{z} \\\\ &= \log p_{\boldsymbol{\theta}}(\boldsymbol{x}). \end{align*} \] That is, \(L(\boldsymbol{\theta}, \boldsymbol{\phi} \mid \boldsymbol{x}) \leq \log p_{\boldsymbol{\theta}}(\boldsymbol{x})\) for any choice of \(q_{\boldsymbol{\phi}}\). Maximizing the ELBO with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\) therefore simultaneously pushes up the log-likelihood and tightens the bound by making \(q_{\boldsymbol{\phi}}\) a better approximation to the true posterior.

ELBO Decomposition: Reconstruction + Regularization

The ELBO can be decomposed into two interpretable terms by expanding the joint \(\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) = \log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) + \log p(\boldsymbol{z})\) and rearranging.

Theorem: ELBO Decomposition

\[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) = \underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right]}_{\text{reconstruction}} \;-\; \underbrace{D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right)}_{\text{regularization}} \]

Proof:

Expanding the joint inside the ELBO definition, \[ \begin{align*} L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) &= \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) + \log p(\boldsymbol{z}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\right] \\\\ &= \mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log \frac{p(\boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\\\ &= \mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right] - D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right). \end{align*} \]

The first term is the expected reconstruction log-likelihood: it encourages the decoder to reconstruct the input \(\boldsymbol{x}\) from latent samples \(\boldsymbol{z} \sim q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\). The second term is the KL divergence between the approximate posterior and the prior \(p(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), which regularizes the latent space by penalizing approximate posteriors that deviate from the prior. This regularization is what gives the VAE its structured, continuous latent space - without it, the encoder could collapse each data point to an isolated delta function, losing the ability to generate new data by sampling from \(p(\boldsymbol{z})\).

The Reparameterization Trick

To train the VAE end-to-end via gradient-based optimization, we need to differentiate the ELBO with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\). The gradient with respect to the decoder parameters \(\boldsymbol{\theta}\) poses no difficulty, since \(\boldsymbol{\theta}\) appears only inside the expectation. However, differentiating with respect to the encoder parameters \(\boldsymbol{\phi}\) is problematic: the expectation \(\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}[\cdot]\) is taken with respect to a distribution that itself depends on \(\boldsymbol{\phi}\), so we cannot simply interchange the gradient and the expectation.

The reparameterization trick resolves this by expressing the stochastic latent variable \(\boldsymbol{z}\) as a deterministic, differentiable transformation of a parameter-free noise variable. Since the encoder outputs a diagonal Gaussian, we write

Reparameterization Trick:

Sample \(\boldsymbol{z}\) from \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\) via \[ \boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}} (\boldsymbol{x})\odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \] where \(\mu_{\boldsymbol{\phi}} = f_{e,\mu}(\boldsymbol{x}; \boldsymbol{\phi})\) and \(\sigma_{\boldsymbol{\phi}} = f_{e,\sigma}(\boldsymbol{x}; \boldsymbol{\phi})\) are the encoder outputs, and \(\odot\) denotes element-wise multiplication.

Under this reparameterization, the ELBO becomes \[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) = \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})} \!\left[\log p_{\boldsymbol{\theta}}\!\left(\boldsymbol{x} \mid \boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}}(\boldsymbol{x})\odot \boldsymbol{\epsilon}\right)\right] - D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right). \] The key observation is that the expectation is now taken with respect to the fixed distribution \(\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), which does not depend on \(\boldsymbol{\phi}\). This means we can interchange the gradient and the expectation: \(\nabla_{\boldsymbol{\phi}} \mathbb{E}_{\boldsymbol{\epsilon}}[\cdot] = \mathbb{E}_{\boldsymbol{\epsilon}}[\nabla_{\boldsymbol{\phi}}(\cdot)]\), and estimate the gradient via Monte Carlo sampling of \(\boldsymbol{\epsilon}\). In practice, even a single sample per data point provides a sufficiently low-variance gradient estimate for stochastic gradient descent.

The full VAE training objective over a dataset \(\mathcal{D}\) is therefore \[ \min_{\boldsymbol{\theta}, \, \boldsymbol{\phi}} \;-\, \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) \right] \] which is optimized end-to-end using standard backpropagation through the reparameterized sampling step.

Putting It All Together

The components developed above - the nonlinear generative model, the amortized inference network, the ELBO objective, and the reparameterization trick - collectively define the Variational Autoencoder.

Variational Autoencoder (VAE):

A VAE is a latent variable model consisting of:

  1. A prior over the latent space: \(p(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\)
  2. A decoder (generative model): \(p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid f_d(\boldsymbol{z}; \boldsymbol{\theta}), \, \sigma^2 \boldsymbol{I})\)
  3. An encoder (inference network): \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) = \mathcal{N}\!\left(\boldsymbol{z} \mid f_{e,\mu}(\boldsymbol{x}; \boldsymbol{\phi}), \, \text{diag}(f_{e,\sigma}(\boldsymbol{x}; \boldsymbol{\phi}))\right)\)

The parameters \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\) are trained jointly by maximizing the ELBO \[ \max_{\boldsymbol{\theta}, \, \boldsymbol{\phi}} \; \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})} \!\left[\log p_{\boldsymbol{\theta}}\!\left(\boldsymbol{x} \mid \boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}}(\boldsymbol{x}) \odot \boldsymbol{\epsilon}\right)\right] - D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right) \right] \] where gradients with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\) are computed via backpropagation through the reparameterized sampling step.

VAE Uncertainty in Robotic Manipulation

Overview

This simulation demonstrates how a Variational Autoencoder (VAE) can be used as a real-time safety monitor during robotic manipulation. A 2-link planar arm attempts to lift a box with an off-center mass. The VAE's posterior distribution \(q(\mathbf{z} \mid \mathbf{x})\) encodes the robot's internal uncertainty about the physical state of the grasp. When this uncertainty exceeds a learned threshold, the system aborts the lift to prevent a catastrophic drop.

The Physical Setup

The robot is a 3-DOF articulated arm. While its spatial positioning is achieved via a rotating base and two revolute joints, its reaching kinematics are governed by two primary links with lengths \(L_1 = 24\) and \(L_2 = 22\). The end-effector pose is calculated through analytic inverse kinematics that maps 3D target coordinates \((x, y, z)\) to the arm's joint angles.

A rigid box of adjustable mass \(m\) sits on the ground plane. Its center of mass (CoM) is offset from its geometric center by a user-controlled displacement \(\mathbf{c} = (c_x, c_y, c_z)\). This offset serves as the primary source of epistemic uncertainty: the robot cannot directly observe the true CoM location and must infer the resulting torque anomalies from sensor feedback during the initial lift phase.

Torque as Sensory Input

Once the gripper secures the box and begins to lift, gravity acting on the offset CoM produces a torque about the grip point:

\[ \boldsymbol{\tau} = (\mathbf{r}_{\text{CoM}} - \mathbf{r}_{\text{grip}}) \times (-mg\,\hat{\mathbf{y}}) \]

where \(\mathbf{r}_{\text{CoM}}\) is the world-space CoM position and \(\mathbf{r}_{\text{grip}}\) is the grip point. This torque is the robot's primary sensory signal - it reveals how much the CoM deviates from the grip axis. The torque vector is displayed as a green arrow on the box during the lift.

Rotational Dynamics

The torque drives the box's rotational dynamics via the damped Euler equation:

\[ I\,\dot{\boldsymbol{\omega}} = \boldsymbol{\tau} - C_d\,\boldsymbol{\omega} \]

where \(I\) is the moment of inertia, \(\boldsymbol{\omega}\) is the angular velocity, and \(C_d\) is the damping coefficient. This produces the visible tilt of the box during lifting. Higher mass or larger CoM offset leads to larger torques, faster rotation, and more pronounced tilt - all signals that feed into the VAE.

VAE Posterior: \(q(\mathbf{z} \mid \mathbf{x})\)

The sensory input \(\mathbf{x}\) (here, the torque feedback) is encoded into a 2D latent space \(\mathbf{z} = (z_1, z_2)\) via a diagonal Gaussian posterior:

\[ q(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\mathbf{z};\; \boldsymbol{\mu}(\mathbf{x}),\; \sigma(\mathbf{x})^2 \mathbf{I}) \]

The encoder maps torque to the posterior parameters as follows:

The key insight is that \(\sigma\) encodes uncertainty. When the torque is small (CoM near grip axis), \(\sigma\) is small and the robot is confident in its grasp. When the torque is large (CoM far off-center), \(\sigma\) grows, reflecting the robot's increasing uncertainty about whether the grasp is stable enough to complete the lift.

The Latent Space Visualization

The top-right canvas shows the 2D latent space in real time:

The display uses dynamic scaling: both the \(\sigma\) circle and the threshold circle always fit within the canvas, preserving their true ratio. When the lift fails, the orange circle is visibly larger than the red circle, making the safety violation immediately apparent.

KL Divergence

The telemetry panel reports the KL divergence, which acts as a regularizer by measuring the distance between the approximate posterior \(q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})\) and the prior \(p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})\). The general closed-form expression for a \(K\)-dimensional Gaussian with diagonal covariance is: \[ D_{\mathbb{KL}}\!\left[\,q(\mathbf{z} \mid \mathbf{x})\;\|\;p(\mathbf{z})\,\right] = \frac{1}{2} \sum_{j=1}^{K} \left( \mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1 \right) \]

In this specific simulation, where our latent space is 2-dimensional (\(K=2\)) and uses isotropic variance (\(\sigma_1^2 = \sigma_2^2 = \sigma^2\)), this simplifies to the formula used in our telemetry engine: \[ \begin{aligned} D_{\mathbb{KL}} &= \frac{1}{2} \left[ (\mu_1^2 + \sigma^2 - \ln \sigma^2 - 1) + (\mu_2^2 + \sigma^2 - \ln \sigma^2 - 1) \right] \\ &= \frac{1}{2} \left[ (\mu_1^2 + \mu_2^2) + 2\sigma^2 - 2\ln \sigma^2 - 2 \right] \\ &= \frac{1}{2} \left( \|\boldsymbol{\mu}\|^2 + 2\sigma^2 - 2\ln \sigma^2 - 2 \right) \end{aligned} \]

This quantity measures the "information gain" or surprise from the sensory input. A large KL divergence indicates that the physical interaction has pushed the robot's belief significantly away from its prior expectations - a sign of an unusual or difficult grasp configuration.

Ghost Arms: Sampling Uncertainty

The five semi-transparent ghost arms visualize samples from the posterior. Each ghost arm \(i\) receives a perturbed end-effector target:

\[ \mathbf{t}^{(i)} = \mathbf{t}_{\text{primary}} + \boldsymbol{\mu} \cdot s_{\mu} + \sigma \cdot \boldsymbol{\epsilon}^{(i)} \cdot s_{\sigma} \]

where \(\boldsymbol{\epsilon}^{(i)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) are smoothed Gaussian samples (using exponential moving average with \(\alpha = 0.92\) for visual stability). When \(\sigma\) is small, the ghosts cluster tightly around the primary arm. When \(\sigma\) is large, they splay outward, creating a visual "cloud of possible robot states" that communicates the degree of uncertainty.

The Safety Decision

The lift sequence proceeds through a state machine:

  1. APPROACH → Arm moves from home to hover above the box
  2. DESCEND → Arm lowers to grasp height
  3. GRASP → Fingers close on the box
  4. PRE_LIFT → Small test lift (5 units). During this phase, the system assesses \(\sigma\) for 0.5 seconds. If \(\sigma > \sigma_{\text{thr}}\), the lift is aborted immediately.
  5. LIFT_OK → Full lift if \(\sigma\) stayed below threshold. Continuous safety monitoring continues; an abort can still trigger if \(\sigma\) spikes during the lift.
  6. ABORT → Three-phase emergency: lower the box, release grip, return home.

The critical decision is the comparison \(\sigma \lessgtr \sigma_{\text{thr}}\). This is a direct analogy to how real-world autonomous systems use learned uncertainty estimates for out-of-distribution detection: if the VAE posterior is too diffuse, the current observation lies far from the training distribution, and the robot should not trust its control policy.

What the Parameters Control

CoM Offset \((c_x, c_y, c_z)\): Shifts the center of mass away from the box's geometric center. Larger offsets produce larger torques, higher \(\sigma\) and a higher likelihood of abort. This simulates real-world scenarios where the load distribution inside a package is unknown.

Box Mass: Scales the gravitational force \(mg\), amplifying the torque for a given CoM offset. Heavy objects are harder to lift safely.

\(\sigma\) Threshold: The abort boundary. Lowering it makes the robot more cautious (aborting earlier); raising it makes the robot more risk-tolerant. This models the engineering trade-off between safety and task completion in autonomous systems.

Damping \(C_d\): Controls how quickly the box's rotational oscillation decays. High damping suppresses wobble; low damping allows the box to swing more freely, producing a more dynamic and potentially uncertain lift.

Key Takeaways

This demo illustrates three interconnected ideas from modern physical AI:

1. Uncertainty quantification through latent representations:
The VAE does not output a single point estimate of the physical state - it outputs a distribution. The spread of this distribution (\(\sigma\)) is a principled measure of epistemic uncertainty derived from sensory feedback.

2. Safety-critical decision-making under uncertainty:
Rather than blindly executing a motion plan, the robot uses its uncertainty estimate as a gating signal. When \(\sigma\) exceeds the threshold, the system recognizes that it is operating outside its confident regime and aborts. This is directly analogous to uncertainty-aware reinforcement learning and Bayesian safe control.

3. The connection between physics and information:
The torque - a purely physical quantity governed by Newton's laws - becomes the input to a probabilistic inference engine. The KL divergence between posterior and prior quantifies how much information the robot has gained (or how surprised it is) from the physical interaction. Large KL divergence means the physical situation deviates significantly from the robot's expectations.