Behavioural Cloning

Just takes the dataset generated by the expert policy and fits it to the learner policy
Problem: The learner policy might encounter states not in the dataset by the expert policy, leading to a compounded error problem
Solution: DAgger

DAgger

Init dataset
First we start with expert dataset and train learner policy with only dataset by expert policy Slowly we shift to taking samples from learner policy. That's the reason we have beta, it tends to zero over time
Get states from the dataset visited by the "beta policy", but actions for them from the expert to avoid the compounded error problem
Add them to the dataset and train the learner policy