HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Hao Zhang1,2, Yaru Niu2, Yikai Wang2, Ding Zhao2, H. Eric Tseng1
1The University of Texas at Arlington, 2Carnegie Mellon University

Overall demonstration of HALyPO in human-robot collaboration tasks.

Abstract

To improve generalization and resilience in human–robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process--a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

Experimental Results
Figure 1: Sim-to-Real Deployment Across Embodied Tasks: macro-view of deployment on OSP (top), SCT (Middle) and SLH (Bottom). Temporal progression is indicated by color gradients; arrows trace the stable trajectories maintained by HALyPO despite complex physical coupling and human-induced perturbations.

Experimental Demonstrations A

Scenario A.1: Path-constrained pushing
Scenario A.2: Orientation-sensitive pushing
Scenario A.3: Orientation-sensitive pushing
Scenario A.4: Orientation-sensitive pushing.
Scenario A.5: Orientation-sensitive pushing.
Scenario A.6: Orientation-sensitive pushing.

HALyPO Methodology & Framework

HALyPO Framework
Figure 2: Overview of the HALyPO Methodology and Framework. This architecture combining the transition from standard decentralized learning to Lyapunov policy optimization for real-world HRC. Key components include the computation of the rationality gap and the stability normal vector to derive the final analytic closed-form projection.

Experimental Demonstrations B

Scenario B.1: Spatially-confined transport
Scenario B.2: Spatially-confined transport
Scenario B.3: Spatially-confined transport
Scenario B.4: Spatially-confined transport
Scenario B.5: Super-long object handling
Scenario B.6: Super-long object handling

Cognition to Control (C2C) Framework

Stability Proof
Figure 3: Cognition to Control Framework for HRC. The proposed hierarchical HRC framework for humanoid-object coordination, partitioning decision-making into three cascade layers: a cognition layer (VLM) generates semantic-aware object moving direction (anchors) from visual input; a skill policy layer (MARL), where agents maintain independent, to derive tactical coordination commands; and a cerebellum Layer (WBC) for high-frequency whole-body stabilization and joint-level execution.

Experimental Demonstrations C

Scenario C.1: Movement synchronization: obstruction resilience
Scenario C.2: Vertical synchronization: adaptive squatting
Simulation Environments
Figure 4: Simulation Benchmark and Performance Matrix. Comprehensive performance matrix and global optimization analysis of heterogeneous coordination: the upper part evaluates the scenario-specific success rate across nine representative coordination challenges in OSP, SCT and SLH tasks, reported as mean std. The lower part provides a synchronized mechanism analysis at the 2B-step steady state, correlating overall task proficiency with fundamental optimization metrics including Overall success rate, convergence, final return, gradient alignment, rationality gap and gradient conflict rate. Bold and underlined indicate first and second best, respectively.

BibTeX

@article{zhang2026halypo,
  title={HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration},
  author={Zhang, Hao and Niu, Yaru and Wang, Yikai and Zhao, Ding and Tseng, H Eric},
  journal={arXiv preprint arXiv:2603.03741},
  year={2026}
}
Total Views: | Total Visitors: