Adding TRPO by Jackory · Pull Request #435 · vwxyzjn/cleanrl

Jackory · 2023-11-30T03:12:35Z

Description

TRPO is a representative algorithm of policy gradient in reinforcement learning. Although it is no longer practical, its ideas and mathematical principles are still worth considering. Currently, I haven't seen a single-file implementation of TRPO. I'm here to implement a single-file version of TRPO to help beginners understand it.

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the tests accordingly (if applicable).
I have updated the documentation and previewed the changes via mkdocs serve.
- I have explained note-worthy implementation details.
- I have explained the logged metrics.
- I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel · 2023-11-30T03:12:39Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 6, 2023 1:06pm

vwxyzjn · 2023-12-18T15:35:22Z

Hi this is some cool stuff! Feel free to run some benchmarks with mujoco to see how it performs.

sontungkieu

Issue: When running with num_envs > 1, this line
new_pg_loss = (advantages[mb_inds] * ratio).mean()
fails because advantages[mb_inds] has shape [batch, action_dim] while ratio is [batch], causing a dimension mismatch.

Proposed fix: Use the flattened b_advantages (shape [batch]) instead of advantages so both tensors align:
- new_pg_loss = (advantages[mb_inds] * ratio).mean()
+ mb_advantages = b_advantages[mb_inds]          # shape [batch]
+ new_pg_loss   = (mb_advantages * ratio).mean()
This ensures that mb_advantages and ratio are both 1-D tensors of length batch, resolving the error when num_envs > 1.

sontungkieu · 2025-08-08T17:45:49Z

+                    _, newlogprob, entropy = actor.get_action(b_obs[mb_inds], b_actions[mb_inds])
+                    logratio = newlogprob - b_logprobs[mb_inds]
+                    ratio = logratio.exp()
+                    new_pg_loss = (advantages[mb_inds] * ratio).mean()


Hello, I tried your code and it worked with mujoco environments listed on gymnasium when the number of environments is one. When I increased it had an error.

Traceback (most recent call last): File "/home/tung/practice-gymnasium/TRPO.py", line 405, in <module> new_pg_loss = (advantages[mb_inds] * ratio).mean()

Changed it to
new_pg_loss = (mb_advantages * ratio).mean()
solve the issue. (switch to flattened advantages so dims line up 😊)

Jackory added 2 commits November 30, 2023 10:43

Add a implementation of trpo_continous_action

9bb4216

autoflake

42a469a

vercel bot deployed to Preview November 30, 2023 03:13 View deployment

Jackory mentioned this pull request Nov 30, 2023

Adding TRPO implementation #245

Closed

Merge branch 'vwxyzjn:master' into master

2b513f2

vercel bot deployed to Preview December 6, 2023 13:06 View deployment

sontungkieu suggested changes Aug 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding TRPO#435

Adding TRPO#435
Jackory wants to merge 3 commits intovwxyzjn:masterfrom
Jackory:master

Jackory commented Nov 30, 2023 •

edited

Loading

Uh oh!

vercel bot commented Nov 30, 2023 •

edited

Loading

Uh oh!

vwxyzjn commented Dec 18, 2023

Uh oh!

sontungkieu left a comment

Uh oh!

sontungkieu Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Jackory commented Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Checklist:

Uh oh!

vercel bot commented Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vwxyzjn commented Dec 18, 2023

Uh oh!

sontungkieu left a comment

Choose a reason for hiding this comment

Uh oh!

sontungkieu Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jackory commented Nov 30, 2023 •

edited

Loading

vercel bot commented Nov 30, 2023 •

edited

Loading