Introduction
In the intersection of pure mathematics and autonomous systems lies the challenge of uncertainty.
While classical robotics often relies on deterministic models, real-world interaction is inherently stochastic.
Sensory noise, unobservable physical properties, and environmental variances require a framework that does not
merely calculate values, but manages probabilities.
The Variational Autoencoder (VAE) represents a profound synthesis of information theory and
Bayesian inference. Unlike standard autoencoders that map data to discrete points in a latent space, the VAE
learns the underlying structural manifold of the data. By encoding inputs into a continuous
latent distribution characterized by a mean (\(\mu\)) and a variance (\(\sigma\)), the VAE provides a principled
way to quantify what the system knows - and, more importantly, what it does not.
Variational Inference
In Bayesian statistics, Variational Inference (VI)
transforms the problem of posterior inference - which typically involves solving high-dimensional, intractable integrals -
into a constrained optimization problem.
While deterministic control algorithms rely on point-estimates (e.g., "the center of mass is exactly
at coordinates \(\mathbf{r}\)"), VI treats the state as a probability distribution. This shift is
mathematically profound: instead of seeking a single value, we search for the parameters of a distribution
\(q_{\boldsymbol{\theta}}(\mathbf{z})\) that minimize the Kullback-Leibler (KL) divergence
from the true, unknown posterior \(p(\mathbf{z} \mid \mathbf{x})\).
Since the true posterior is analytically intractable for complex models, we restrict our search to a family of simpler
distributions (such as Gaussians). By maximizing the Evidence Lower Bound (ELBO), the system finds the
"best fit" distribution that balances data fidelity with prior beliefs.
For a robot, this mathematical approximation is the key to real-time safety. The robot does not merely
estimate a property; it quantifies the epistemic uncertainty (the "spread" of the distribution).
This variance serves as a rigorous proxy for risk, allowing the AI to distinguish between a confident execution and a
situation requiring an emergency abort.
Variational Autoencoder (VAE)
From Factor Analysis to Nonlinear Generative Models
In classical factor analysis (FA), we model the observed data \(\boldsymbol{x} \in \mathbb{R}^D\) as a
linear function of a latent variable \(\boldsymbol{z} \in \mathbb{R}^K\) with \(K \ll D\):
\[
p(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid \boldsymbol{Wz}, \sigma^2 \boldsymbol{I})
\]
where \(\boldsymbol{W} \in \mathbb{R}^{D \times K}\) is the factor loading matrix. The linearity of \(\boldsymbol{W}\)
makes posterior inference tractable: given an observation \(\boldsymbol{x}\), the posterior
\(p(\boldsymbol{z} \mid \boldsymbol{x})\) is Gaussian in closed form. However, this linearity also limits the model's
expressiveness - real-world data distributions are rarely well-captured by linear mappings.
The Variational Autoencoder (VAE) extends factor analysis by replacing the linear mapping
\(\boldsymbol{Wz}\) with an arbitrary nonlinear function parameterized by a neural network:
\[
p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid f_d (\boldsymbol{z}; \boldsymbol{\theta}), \, \sigma^2 \boldsymbol{I})
\]
where \(f_d(\cdot\,; \boldsymbol{\theta})\) is a decoder network with parameters \(\boldsymbol{\theta}\).
This nonlinear generative model can represent far richer data distributions, but it comes at a cost:
the posterior \(p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})\) is no longer analytically tractable,
because computing it requires the marginal likelihood
\[
p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x}) = \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\, p(\boldsymbol{z})}{p_{\boldsymbol{\theta}}(\boldsymbol{x})}
\quad \text{where} \quad
p_{\boldsymbol{\theta}}(\boldsymbol{x}) = \int p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\, p(\boldsymbol{z})\, d\boldsymbol{z}
\]
which involves an intractable integral over the latent space when \(f_d\) is a deep network. This intractability is the central challenge that motivates the variational approach.
Amortized Variational Inference
Since the true posterior is intractable, we introduce an approximate posterior - a
recognition network (also called the inference network or encoder) -
that is trained simultaneously with the generative model. We restrict the approximate posterior to a tractable family:
a Gaussian with diagonal covariance, yielding
\[
q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})
= \mathcal{N}\!\left(\boldsymbol{z} \;\middle|\; f_{e,\mu} (\boldsymbol{x} ; \boldsymbol{\phi}), \, \text{diag}\!\left(f_{e,\sigma}(\boldsymbol{x};\boldsymbol{\phi})\right)\right)
\]
where \(f_{e,\mu}\) and \(f_{e,\sigma}\) are the encoder's output heads that produce the mean vector and the
diagonal covariance respectively, and \(\boldsymbol{\phi}\) denotes all encoder parameters.
This approach is called amortized inference: rather than running a separate optimization procedure
to compute the posterior for each individual data point (as in classical variational inference), we train a single
neural network that directly maps any input \(\boldsymbol{x}\) to the parameters of its approximate posterior.
The cost of inference at test time is thus reduced to a single forward pass through the encoder, with the computational
investment "amortized" across the entire training set.
The Evidence Lower Bound (ELBO)
Since the marginal likelihood \(p_{\boldsymbol{\theta}}(\boldsymbol{x})\) is intractable, we cannot maximize it directly.
Instead, we derive a tractable lower bound. For a single observation \(\boldsymbol{x}\), define the
evidence lower bound (ELBO) as follows.
Definition: Evidence Lower Bound(ELBO)
For a generative model \(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\) and
approximate posterior \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\), the ELBO is
\[
L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x})
= \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \right].
\]
To see that this is indeed a lower bound on the log-evidence, apply Jensen's inequality.
Since \(\log\) is concave, moving it outside the expectation can only increase the value:
\[
\begin{align*}
L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x})
&= \int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\, d\boldsymbol{z} \\\\
&\leq \log \int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\, d\boldsymbol{z} \\\\
&= \log \int p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\, d\boldsymbol{z} \\\\
&= \log p_{\boldsymbol{\theta}}(\boldsymbol{x}).
\end{align*}
\]
That is, \(L(\boldsymbol{\theta}, \boldsymbol{\phi} \mid \boldsymbol{x}) \leq \log p_{\boldsymbol{\theta}}(\boldsymbol{x})\)
for any choice of \(q_{\boldsymbol{\phi}}\). Maximizing the ELBO with respect to both \(\boldsymbol{\theta}\) and
\(\boldsymbol{\phi}\) therefore simultaneously pushes up the log-likelihood and tightens the bound by making
\(q_{\boldsymbol{\phi}}\) a better approximation to the true posterior.
ELBO Decomposition: Reconstruction + Regularization
The ELBO can be decomposed into two interpretable terms by expanding the joint
\(\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) = \log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) + \log p(\boldsymbol{z})\)
and rearranging.
Theorem: ELBO Decomposition
\[
L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x})
= \underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right]}_{\text{reconstruction}}
\;-\; \underbrace{D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right)}_{\text{regularization}}
\]
Proof:
Expanding the joint inside the ELBO definition,
\[
\begin{align*}
L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x})
&= \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) + \log p(\boldsymbol{z}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\right] \\\\
&= \mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right]
+ \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \left[\log \frac{p(\boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\\\
&= \mathbb{E}_{q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x})} \left[\log p_{\boldsymbol{\theta}} (\boldsymbol{x} \mid \boldsymbol{z})\right]
- D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}} (\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right).
\end{align*}
\]
The first term is the expected reconstruction log-likelihood: it encourages the decoder to reconstruct
the input \(\boldsymbol{x}\) from latent samples \(\boldsymbol{z} \sim q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\).
The second term is the KL divergence between the approximate posterior and the prior
\(p(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), which regularizes the latent space by penalizing
approximate posteriors that deviate from the prior. This regularization is what gives the VAE its structured,
continuous latent space - without it, the encoder could collapse each data point to an isolated delta function,
losing the ability to generate new data by sampling from \(p(\boldsymbol{z})\).
The Reparameterization Trick
To train the VAE end-to-end via gradient-based optimization, we need to differentiate the ELBO with respect to
both \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\). The gradient with respect to the decoder parameters
\(\boldsymbol{\theta}\) poses no difficulty, since \(\boldsymbol{\theta}\) appears only inside the expectation.
However, differentiating with respect to the encoder parameters \(\boldsymbol{\phi}\) is problematic: the expectation
\(\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}[\cdot]\) is taken with respect to a distribution
that itself depends on \(\boldsymbol{\phi}\), so we cannot simply interchange the gradient and the expectation.
The reparameterization trick resolves this by expressing the stochastic latent variable
\(\boldsymbol{z}\) as a deterministic, differentiable transformation of a parameter-free noise variable.
Since the encoder outputs a diagonal Gaussian, we write
Reparameterization Trick:
Sample \(\boldsymbol{z}\) from \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})\) via
\[
\boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}} (\boldsymbol{x})\odot \boldsymbol{\epsilon},
\quad \boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})
\]
where \(\mu_{\boldsymbol{\phi}} = f_{e,\mu}(\boldsymbol{x}; \boldsymbol{\phi})\) and
\(\sigma_{\boldsymbol{\phi}} = f_{e,\sigma}(\boldsymbol{x}; \boldsymbol{\phi})\) are the encoder outputs,
and \(\odot\) denotes element-wise multiplication.
Under this reparameterization, the ELBO becomes
\[
L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x})
= \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}
\!\left[\log p_{\boldsymbol{\theta}}\!\left(\boldsymbol{x} \mid \boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}}(\boldsymbol{x})\odot \boldsymbol{\epsilon}\right)\right]
- D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right).
\]
The key observation is that the expectation is now taken with respect to the fixed distribution
\(\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), which does not depend on \(\boldsymbol{\phi}\). This means we
can interchange the gradient and the expectation:
\(\nabla_{\boldsymbol{\phi}} \mathbb{E}_{\boldsymbol{\epsilon}}[\cdot] = \mathbb{E}_{\boldsymbol{\epsilon}}[\nabla_{\boldsymbol{\phi}}(\cdot)]\),
and estimate the gradient via Monte Carlo sampling of \(\boldsymbol{\epsilon}\). In practice, even a single sample
per data point provides a sufficiently low-variance gradient estimate for stochastic gradient descent.
The full VAE training objective over a dataset \(\mathcal{D}\) is therefore
\[
\min_{\boldsymbol{\theta}, \, \boldsymbol{\phi}} \;-\, \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ L(\boldsymbol{\theta}, \, \boldsymbol{\phi}\mid \boldsymbol{x}) \right]
\]
which is optimized end-to-end using standard backpropagation through the reparameterized sampling step.
Putting It All Together
The components developed above - the nonlinear generative model, the amortized inference network, the ELBO objective,
and the reparameterization trick - collectively define the Variational Autoencoder.
Variational Autoencoder (VAE):
A VAE is a latent variable model consisting of:
- A prior over the latent space: \(p(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\)
- A decoder (generative model): \(p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) = \mathcal{N}(\boldsymbol{x} \mid f_d(\boldsymbol{z}; \boldsymbol{\theta}), \, \sigma^2 \boldsymbol{I})\)
- An encoder (inference network): \(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) = \mathcal{N}\!\left(\boldsymbol{z} \mid f_{e,\mu}(\boldsymbol{x}; \boldsymbol{\phi}), \, \text{diag}(f_{e,\sigma}(\boldsymbol{x}; \boldsymbol{\phi}))\right)\)
The parameters \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\) are trained jointly by maximizing the ELBO
\[
\max_{\boldsymbol{\theta}, \, \boldsymbol{\phi}} \; \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[
\mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}
\!\left[\log p_{\boldsymbol{\theta}}\!\left(\boldsymbol{x} \mid \boldsymbol{z} = \mu_{\boldsymbol{\phi}}(\boldsymbol{x}) + \sigma_{\boldsymbol{\phi}}(\boldsymbol{x}) \odot \boldsymbol{\epsilon}\right)\right]
- D_{\mathbb{KL}}\!\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \,\|\, p(\boldsymbol{z})\right)
\right]
\]
where gradients with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\) are computed via
backpropagation through the reparameterized sampling step.
VAE Uncertainty in Robotic Manipulation
Overview
This simulation demonstrates how a Variational Autoencoder (VAE) can be used as a real-time safety monitor during
robotic manipulation. A 2-link planar arm attempts to lift a box with an off-center mass. The VAE's posterior distribution
\(q(\mathbf{z} \mid \mathbf{x})\) encodes the robot's internal uncertainty about the physical state of the grasp. When this uncertainty exceeds a
learned threshold, the system aborts the lift to prevent a catastrophic drop.
The Physical Setup
The robot is a 3-DOF articulated arm. While its spatial positioning is achieved via a rotating base and two revolute joints,
its reaching kinematics are governed by two primary links with lengths \(L_1 = 24\) and \(L_2 = 22\). The end-effector pose is
calculated through analytic inverse kinematics that maps 3D target coordinates \((x, y, z)\) to the arm's joint angles.
A rigid box of adjustable mass \(m\) sits on the ground plane. Its center of mass (CoM)
is offset from its geometric center by a user-controlled displacement \(\mathbf{c} = (c_x, c_y, c_z)\).
This offset serves as the primary source of epistemic uncertainty: the robot
cannot directly observe the true CoM location and must infer the resulting torque anomalies
from sensor feedback during the initial lift phase.
Torque as Sensory Input
Once the gripper secures the box and begins to lift, gravity acting on the offset CoM produces a torque about
the grip point:
\[
\boldsymbol{\tau} = (\mathbf{r}_{\text{CoM}} - \mathbf{r}_{\text{grip}}) \times (-mg\,\hat{\mathbf{y}})
\]
where \(\mathbf{r}_{\text{CoM}}\) is the world-space CoM position and \(\mathbf{r}_{\text{grip}}\) is the grip point.
This torque is the robot's primary sensory signal - it reveals how much the CoM deviates from the grip axis.
The torque vector is displayed as a green arrow on the box during the lift.
Rotational Dynamics
The torque drives the box's rotational dynamics via the damped Euler equation:
\[
I\,\dot{\boldsymbol{\omega}} = \boldsymbol{\tau} - C_d\,\boldsymbol{\omega}
\]
where \(I\) is the moment of inertia, \(\boldsymbol{\omega}\) is the angular velocity, and \(C_d\) is the damping coefficient.
This produces the visible tilt of the box during lifting. Higher mass or larger CoM offset leads to larger torques, faster rotation,
and more pronounced tilt - all signals that feed into the VAE.
VAE Posterior: \(q(\mathbf{z} \mid \mathbf{x})\)
The sensory input \(\mathbf{x}\) (here, the torque feedback) is encoded into a 2D latent space \(\mathbf{z} = (z_1, z_2)\) via a
diagonal Gaussian posterior:
\[
q(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\mathbf{z};\; \boldsymbol{\mu}(\mathbf{x}),\; \sigma(\mathbf{x})^2 \mathbf{I})
\]
The encoder maps torque to the posterior parameters as follows:
- \(\boldsymbol{\mu} = (0.07\,\tau_x,\; 0.07\,\tau_z)\) - the posterior mean shifts proportionally to the horizontal torque components
- \(\sigma = 0.011\,\|\boldsymbol{\tau}\|\) - the posterior spread grows with the torque magnitude
The key insight is that \(\sigma\) encodes uncertainty.
When the torque is small (CoM near grip axis), \(\sigma\) is small and the robot is confident in its grasp.
When the torque is large (CoM far off-center), \(\sigma\) grows, reflecting the robot's increasing uncertainty
about whether the grasp is stable enough to complete the lift.
The Latent Space Visualization
The top-right canvas shows the 2D latent space in real time:
- Blue dot at the origin - the prior \(p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})\)
- Orange dot - the posterior mean \(\boldsymbol{\mu}\), which drifts as torque changes
- Orange circle - the 1 \(\sigma\) contour of the posterior distribution
- Orange scatter points - stochastic samples \(\mathbf{z}^{(i)} \sim q(\mathbf{z} \mid \mathbf{x})\), visualized as ghost arms in 3D
- Red dashed circle - the abort threshold on \(\sigma\)
The display uses dynamic scaling: both the \(\sigma\) circle and the threshold circle always fit
within the canvas, preserving their true ratio. When the lift fails, the orange circle is visibly larger than the
red circle, making the safety violation immediately apparent.
KL Divergence
The telemetry panel reports the KL divergence, which acts as a regularizer by measuring the distance between
the approximate posterior \(q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})\) and the prior \(p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})\).
The general closed-form expression for a \(K\)-dimensional Gaussian with diagonal covariance is:
\[
D_{\mathbb{KL}}\!\left[\,q(\mathbf{z} \mid \mathbf{x})\;\|\;p(\mathbf{z})\,\right] = \frac{1}{2} \sum_{j=1}^{K} \left( \mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1 \right)
\]
In this specific simulation, where our latent space is 2-dimensional (\(K=2\)) and uses isotropic variance (\(\sigma_1^2 = \sigma_2^2 = \sigma^2\)),
this simplifies to the formula used in our telemetry engine:
\[
\begin{aligned}
D_{\mathbb{KL}} &= \frac{1}{2} \left[ (\mu_1^2 + \sigma^2 - \ln \sigma^2 - 1) + (\mu_2^2 + \sigma^2 - \ln \sigma^2 - 1) \right] \\
&= \frac{1}{2} \left[ (\mu_1^2 + \mu_2^2) + 2\sigma^2 - 2\ln \sigma^2 - 2 \right] \\
&= \frac{1}{2} \left( \|\boldsymbol{\mu}\|^2 + 2\sigma^2 - 2\ln \sigma^2 - 2 \right)
\end{aligned}
\]
This quantity measures the "information gain" or surprise from the sensory input. A large KL divergence indicates that the
physical interaction has pushed the robot's belief significantly away from its prior expectations - a sign of an unusual
or difficult grasp configuration.
Ghost Arms: Sampling Uncertainty
The five semi-transparent ghost arms visualize samples from the posterior. Each ghost arm \(i\) receives a
perturbed end-effector target:
\[
\mathbf{t}^{(i)} = \mathbf{t}_{\text{primary}} + \boldsymbol{\mu} \cdot s_{\mu} + \sigma \cdot \boldsymbol{\epsilon}^{(i)} \cdot s_{\sigma}
\]
where \(\boldsymbol{\epsilon}^{(i)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) are smoothed Gaussian samples
(using exponential moving average with \(\alpha = 0.92\) for visual stability). When \(\sigma\) is small, the ghosts cluster
tightly around the primary arm. When \(\sigma\) is large, they splay outward, creating a visual "cloud of possible robot states"
that communicates the degree of uncertainty.
The Safety Decision
The lift sequence proceeds through a state machine:
- APPROACH → Arm moves from home to hover above the box
- DESCEND → Arm lowers to grasp height
- GRASP → Fingers close on the box
- PRE_LIFT → Small test lift (5 units). During this phase, the system assesses \(\sigma\) for 0.5 seconds. If \(\sigma > \sigma_{\text{thr}}\), the lift is aborted immediately.
- LIFT_OK → Full lift if \(\sigma\) stayed below threshold. Continuous safety monitoring continues; an abort can still trigger if \(\sigma\) spikes during the lift.
- ABORT → Three-phase emergency: lower the box, release grip, return home.
The critical decision is the comparison \(\sigma \lessgtr \sigma_{\text{thr}}\). This is a direct analogy to how real-world
autonomous systems use learned uncertainty estimates for out-of-distribution detection: if the VAE posterior
is too diffuse, the current observation lies far from the training distribution, and the robot should not trust its control policy.
What the Parameters Control
CoM Offset \((c_x, c_y, c_z)\): Shifts the center of mass away from the box's
geometric center. Larger offsets produce larger torques, higher \(\sigma\) and a higher likelihood of abort. This simulates
real-world scenarios where the load distribution inside a package is unknown.
Box Mass: Scales the gravitational force \(mg\), amplifying the torque for a given CoM offset. Heavy objects
are harder to lift safely.
\(\sigma\) Threshold: The abort boundary. Lowering it makes the robot more cautious (aborting earlier);
raising it makes the robot more risk-tolerant. This models the engineering trade-off between safety and task completion
in autonomous systems.
Damping \(C_d\): Controls how quickly the box's rotational oscillation decays. High damping suppresses
wobble; low damping allows the box to swing more freely, producing a more dynamic and potentially uncertain lift.
Key Takeaways
This demo illustrates three interconnected ideas from modern physical AI:
1. Uncertainty quantification through latent representations:
The VAE does not output a single point estimate of the physical state - it outputs a distribution.
The spread of this distribution (\(\sigma\)) is a principled measure of epistemic uncertainty derived from sensory feedback.
2. Safety-critical decision-making under uncertainty:
Rather than blindly executing a motion plan, the robot uses its uncertainty estimate as a gating signal. When \(\sigma\) exceeds
the threshold, the system recognizes that it is operating outside its confident regime and aborts. This is directly analogous to
uncertainty-aware reinforcement learning and Bayesian safe control.
3. The connection between physics and information:
The torque - a purely physical quantity governed by Newton's laws - becomes the input to a probabilistic inference engine.
The KL divergence between posterior and prior quantifies how much information the robot has gained (or how surprised it is)
from the physical interaction. Large KL divergence means the physical situation deviates significantly from the robot's expectations.