T A R L o c o

🐾 Teacher-Aligned Representations via Contrastive Learning for Quadrupedal Locomotion

Amr Mousa¹, Neil Karavis², Michele Caprio¹, Wei Pan¹, Richard Allmendinger¹

¹ University of Manchester, UK | ² BAE Systems, UK
Conference: IROS 2025 (Accepted)

📢 News

2025‑08‑18 — Project website goes live with videos and results.
2025‑06‑15 — Accepted at IROS 2025.

📌 Abstract

Quadrupedal locomotion via reinforcement learning (RL) is commonly addressed using the teacher–student paradigm, where a privileged teacher guides a proprioceptive student policy. However, key challenges such as representation misalignment between privileged teacher and proprioceptive‑only student, covariate shift due to behavioral cloning, and lack of deployable adaptation lead to poor generalization in real‑world scenarios. We propose Teacher‑Aligned Representations via Contrastive Learning (TAR), a framework that leverages privileged information with self‑supervised contrastive learning to bridge this gap. By distilling from a privileged teacher in simulation and constructing structured latent spaces through contrastive objectives, our student policy surpasses the fully privileged “Teacher” and exhibits robust generalization to out‑of‑distribution (OOD) scenarios. Results showed 2× faster training to reach peak performance compared to state‑of‑the‑art baselines, and ~40% better OOD generalization on average. Additionally, TAR transitions seamlessly into privileged‑free fine‑tuning during deployment, enabling continual adaptation in the real world.

3× Faster Convergence

7,500 Iterations to Peak

+61% Return vs Baselines

103% Return vs Teacher

⚙️ Training Framework Overview

The core idea of our method is to leverages contrastive learning to align latent representations between a privileged teacher and a proprioceptive student within RL paradigm. By structuring a shared latent space, the student utilizes the teacher’s privileged signals during training, enabling improved generalization and sim2real transfer. At deployment, the student operates with proprioception only, maintaining robust performance in diverse and dynamic environments.

Pipeline summary

Teacher encoder consumes privileged states $S_t$ to produce structured embeddings $Z^{T}_t$.
Student encoder consumes proprioceptive inputs $O_t$ and hidden state $h_{t-1}$ to produce $Z^{S}_t$.
Contrastive alignment (triplet loss): the student’s next-state prediction $\tilde{Z}^{+}_{t+1}$ is pulled towards the teacher’s future code $Z_{t+1}$ and away from negatives $Z^{-}_{t+1}$ sampled from other contexts.
Policy optimization: actor–critic is trained with policy gradients; the critic additionally leverages the contrastive signal for representation shaping.
Velocity estimator: trained via regression and frozen post-training to stabilize deployment.

Design goals

Robust latent structure that transfers to diverse terrains and dynamics.
Student policy that remains privileged-free at test time without performance collapse.

Privileged-Free Fine-Tuning & Adaptation

During adaptation (or privileged-free learning), the teacher encoder is removed:

The student forms positive/negative pairs from its own proprioceptive rollouts:
- Positives from temporally adjacent observations $O_{t+1}$ with consistent hidden state context.
- Negatives from other agents or distant contexts $O^{j\neq i}_{t+1}$.
This self-supervised contrastive sampling enforces temporal consistency and context separation without external supervision.
The method is off-policy compatible, enabling efficient fine-tuning in dynamic, non-stationary environments.

📈 Results

🔬 Training Protocol: All methods were trained under identical conditions in Isaacsim with curriculum learning, and domain randomization. Optimization used PPO with GAE over 20k iterations, and each experiment was repeated with 3 seeds.

Training conducted in IsaacSim with 4096 parallel environments

Baselines

Hybrid Internal Model: HIMLoco .
Self-learning Latent Representation: SLR.
Teacher: Privileged expert policy with full state access (upper bound).

Our Ablations

Ours w/ MLP: Student encoder replaced by a 10-step MLP.
Ours w/ TCN: Student encoder replaced by a temporal convolutional network.
Ours w/o Priv: Same architecture but trained without privileged states.
Ours w/o Priv Vel: No privileged states + no velocity inputs.

📊 Click to Explore Interactive W&B Training Results

Our method achieves:

✅ Faster convergence than HIM, SLR, and all ablations.
✅ Higher final returns, surpassing even the privileged Teacher.
✅ Robust OOD generalization, maintaining strong performance under terrain, friction, and payload shifts.

🎬 Evaluation

🔬 Evaluation Protocol: Scenarios include both in-distribution cases (similar to training conditions) and challenging out-of-distribution scenarios that test generalization capabilities.

In-Distribution Testing

Ours

Error: = 0.29

HIM

Error: = 0.32 (-10.3%)

SLR

Error: = 0.42 (-44.8%)

Model: 7500 | Friction Coefficient: 0.1 | Payload: 0 kg | Max Velocity: ±1.0 m/s

Out-of-Distribution Testing

Ours

Error: = 0.39

HIM

Error: = 0.47 (-21%)

SLR

Error: = 0.63 (-64.53%)

Model: 20000 | Friction Coefficient: 1.0 | Payload: 7.5 kg | Max Velocity: ±2.0 m/s

🐾 Real-World Deployment

The videos below showcase TARLoco in action on the Unitree Go2 robot, completely BLIND 🧑🏻‍🦯.

Dense Vegetation

Different Terrains

High-Step Descent

External Pushes

Soft Mattress

10kg Payload

Joint Degradation

* Simulating actuator degradation by reducing the joint torque by 90%. Inspired by ADAPT—but without custom policy training, we just let the robot figure it out 😎!

📚 Citation

If you find this work useful, please consider citing our paper:

@misc{mousa2025tar,
      title={TAR: Teacher-Aligned Representations via Contrastive Learning for Quadrupedal Locomotion}, 
      author={Amr Mousa and Neil Karavis and Michele Caprio and Wei Pan and Richard Allmendinger},
      year={2025},
      eprint={2503.20839},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2503.20839}, 
}

🙏🏻 Acknowledgments and Community

Special thanks to Bruno Adorno, Amy Johnson, Darren Cunningham and Lesley Pater from the University of Manchester for providing hardware for testing and their unwavering support.

This work builds upon IsaacLab, RSL-RL and the broader research community. The original licenses apply; new contributions are under CC BY-NC-SA 4.0.

For technical questions and implementation support:

GitHub Issues: Report bugs and request features
Discussions: Ask questions and share experiences
Email: Direct contact for collaboration opportunities

🚀 Ready to Transform Your Quadrupedal Robotics Research?

Discover how TAR can improve your approach to sim2real transfer and robust locomotion.

⭐ Star on GitHub 📄 Read the Paper