- Just takes the dataset generated by the expert policy and fits it to the learner policy
- Problem: The learner policy might encounter states not in the dataset by the expert policy, leading to a compounded error problem
- Solution: DAgger
-
Init dataset
-
First we start with expert dataset and train learner policy with only dataset by expert policy Slowly we shift to taking samples from learner policy. That's the reason we have beta, it tends to zero over time
-
Get states from the dataset visited by the "beta policy", but actions for them from the expert to avoid the compounded error problem
-
Add them to the dataset and train the learner policy