Closed
Description
🚀 Feature
Add double variant of the dqn algorithm.
Motivation
It's in the roadmap #1.
Pitch
I suggest we go from:
with th.no_grad():
# Compute the next Q-values using the target network
next_q_values = self.q_net_target(replay_data.next_observations)
# Follow greedy policy: use the one with the highest value
next_q_values, _ = next_q_values.max(dim=1)
to:
with th.no_grad():
# Compute the next Q-values using the target network
next_q_values = self.q_net_target(replay_data.next_observations)
if self.double_dqn:
# use current model to select the action with maximal q value
max_actions = th.argmax(self.q_net(replay_data.next_observations), dim=1)
# evaluate q value of that action using fixed target network
next_q_values = th.gather(next_q_values, dim=1, index=max_actions.unsqueeze(-1))
else:
# Follow greedy policy: use the one with the highest value
next_q_values, _ = next_q_values.max(dim=1)
with double_dqn
as additional flag to be passed to DQN init.
### Checklist
- [ x] I have checked that there is no similar issue in the repo (required)