Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
Hi this is some cool stuff! Feel free to run some benchmarks with mujoco to see how it performs. |
sontungkieu
left a comment
There was a problem hiding this comment.
Issue: When running with
num_envs > 1, this linenew_pg_loss = (advantages[mb_inds] * ratio).mean()fails because
advantages[mb_inds]has shape[batch, action_dim]whileratiois[batch], causing a dimension mismatch.Proposed fix: Use the flattened
b_advantages(shape[batch]) instead ofadvantagesso both tensors align:- new_pg_loss = (advantages[mb_inds] * ratio).mean() + mb_advantages = b_advantages[mb_inds] # shape [batch] + new_pg_loss = (mb_advantages * ratio).mean()This ensures that
mb_advantagesandratioare both 1-D tensors of lengthbatch, resolving the error whennum_envs > 1.
| _, newlogprob, entropy = actor.get_action(b_obs[mb_inds], b_actions[mb_inds]) | ||
| logratio = newlogprob - b_logprobs[mb_inds] | ||
| ratio = logratio.exp() | ||
| new_pg_loss = (advantages[mb_inds] * ratio).mean() |
There was a problem hiding this comment.
Hello, I tried your code and it worked with mujoco environments listed on gymnasium when the number of environments is one. When I increased it had an error.
Traceback (most recent call last):
File "/home/tung/practice-gymnasium/TRPO.py", line 405, in <module>
new_pg_loss = (advantages[mb_inds] * ratio).mean()
Changed it to
new_pg_loss = (mb_advantages * ratio).mean()
solve the issue. (switch to flattened advantages so dims line up 😊)
Description
TRPO is a representative algorithm of policy gradient in reinforcement learning. Although it is no longer practical, its ideas and mathematical principles are still worth considering. Currently, I haven't seen a single-file implementation of TRPO. I'm here to implement a single-file version of TRPO to help beginners understand it.
Types of changes
Checklist:
pre-commit run --all-filespasses (required).mkdocs serve.If you need to run benchmark experiments for a performance-impacting changes:
--capture_video.python -m openrlbenchmark.rlops.python -m openrlbenchmark.rlopsutility to the documentation.python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.