Human-Aligned Policy Learning
Reinforcement Learning
Overview
As a researcher at Duke’s General Robotics Lab, I explored training methods for reinforcement learning agents that incorporate multi-modal human feedback. The motivation: every RL system needs a reward function, and designing a good one is often the hardest part of the problem.
This matters because reward functions are easy to underspecify. A trivial example: if you’re training an agent to autonomously dock a ship with a reward of “don’t crash,” you might end up with a policy that sails off into the void (technically optimal, completely useless).
The Algorithm
The algorithm starts with imitation learning, a method to train a policy to mimic exactly what a human would do. This avoids the time-lag problem (more on that below) and gives the agent a strong starting policy. The limitation: imitation learning alone can never surpass human performance.
To go beyond human performance, the algorithm defines a standard reward function alongside a human reward function . Early in training, it prioritizes , then gradually shifts weight toward .
The intuition: even if is underspecified and has many optima, the imitation learning and human feedback from earlier in training pushed the policy closer to a human-aligned optimum. So as the algorithm transitions to the standard reward function, the agent converges to that aligned optimum rather than an arbitrary one. This also solves a convergence problem. Human rewards aren’t a true mathematical function (humans are inconsistent), so relying on them alone can’t guarantee convergence. Eventually prioritizing gives us that guarantee.
The practical upside: because the algorithm tolerates underspecified reward functions, it opens the door to training agents in more complex environments where crafting a precise reward function is prohibitively difficult. It also means engineers spend less time iterating on reward design and more time on the problem itself.
Human Feedback Challenges
Working with human feedback surfaced several hard problems:
- Inconsistency: humans rate the same behavior differently across sessions, making the reward signal noisy and non-stationary
- Time lag: in real-time scenarios, there’s always a delay between when a human rates something and the state they intended to rate. Learning that offset is surprisingly difficult
- Data cost: human feedback is expensive to collect, so reducing model size to require less data (fighting the curse of dimensionality) was critical
We explored multiple feedback modalities to help with the noise problem:
- Written feedback: natural language corrections and suggestions
- Discrete feedback: categorical judgments (good, bad, neutral)
- Continuous feedback: real-valued signals capturing degrees of preference
Training Environments
We built several games in Unity to train and evaluate our agents:
- Multi-player hide and seek: human players could team up with or compete against RL agents
- Bowling and Tetris: simpler environments for controlled experiments
This research contributed to the “Generative AI for Human-AI Teaming” conference that the Duke General Robotics Lab hosted in 2023.