HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Overall demonstration of HALyPO in human-robot collaboration tasks.

Abstract

To improve generalization and resilience in human–robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process--a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

Figure 1: Sim-to-Real Deployment Across Embodied Tasks: macro-view of deployment on OSP (top), SCT (Middle) and SLH (Bottom). Temporal progression is indicated by color gradients; arrows trace the stable trajectories maintained by HALyPO despite complex physical coupling and human-induced perturbations.

Experimental Demonstrations A

Scenario A.1: Path-constrained pushing

Scenario A.2: Orientation-sensitive pushing

Scenario A.3: Orientation-sensitive pushing

Scenario A.4: Orientation-sensitive pushing.

Scenario A.5: Orientation-sensitive pushing.

Scenario A.6: Orientation-sensitive pushing.

HALyPO Methodology & Framework

Figure 2: Overview of the HALyPO Methodology and Framework. This architecture combining the transition from standard decentralized learning to Lyapunov policy optimization for real-world HRC. Key components include the computation of the rationality gap and the stability normal vector to derive the final analytic closed-form projection.

Experimental Demonstrations B

Scenario B.1: Spatially-confined transport

Scenario B.2: Spatially-confined transport

Scenario B.3: Spatially-confined transport

Scenario B.4: Spatially-confined transport

Scenario B.5: Super-long object handling

Scenario B.6: Super-long object handling

Cognition to Control (C2C) Framework

Figure 3: Cognition to Control Framework for HRC. The proposed hierarchical HRC framework for humanoid-object coordination, partitioning decision-making into three cascade layers: a cognition layer (VLM) generates semantic-aware object moving direction (anchors) from visual input; a skill policy layer (MARL), where agents maintain independent, to derive tactical coordination commands; and a cerebellum Layer (WBC) for high-frequency whole-body stabilization and joint-level execution.

Experimental Demonstrations C

Scenario C.1: Movement synchronization: obstruction resilience

Scenario C.2: Vertical synchronization: adaptive squatting

Figure 4: Simulation Benchmark and Performance Matrix. Comprehensive performance matrix and global optimization analysis of heterogeneous coordination: the upper part evaluates the scenario-specific success rate across nine representative coordination challenges in OSP, SCT and SLH tasks, reported as mean std. The lower part provides a synchronized mechanism analysis at the 2B-step steady state, correlating overall task proficiency with fundamental optimization metrics including Overall success rate, convergence, final return, gradient alignment, rationality gap and gradient conflict rate. Bold and underlined indicate first and second best, respectively.

BibTeX

@article{zhang2026halypo,
  title={HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration},
  author={Zhang, Hao and Niu, Yaru and Wang, Yikai and Zhao, Ding and Tseng, H Eric},
  journal={arXiv preprint arXiv:2603.03741},
  year={2026}
}