Container Slow when used with enroot #2671

OliEfr · 2025-05-22T14:38:05Z

OliEfr
May 22, 2025

Hi all,

I'd like to run an IsaacLab container on a cluster powered by enroot. I build my container with ./docker/container.py start. Then, I have my docker container.

When I run docker run --entrypoint tail --gpus all isaac-lab-base -f /dev/null on my local machine (NOTE: overwriting the entrypoint is important, otherwise I think IsaacSim starts in the container and my trainings are 3x slower.), I get the usual training speed.

However, when I try to start the same container with enroot on the cluster, I get ~3x slower training speed, also when I overwrite the containers entrypoint.

The speed on the cluster should really be the same as on my local machine (using RTX4090 and H100). I made sure that I have GPU access on local and remote. If I train a plain pytorch NN I get the same training speeds.

I tried for about 1.5 days and all options that slurm, docker, and enroot provide.

Did anyone make a similar experience or has an idea what the issue could be?

RandomOakForest · 2025-05-22T15:40:27Z

RandomOakForest
May 22, 2025
Maintainer

Thank you for posting this. Replacing an entrypoint is not typically an easy task. Will review this with the team and follow up.

0 replies

OliEfr · 2025-05-22T15:57:09Z

OliEfr
May 22, 2025
Author

Thanks for the swift response! Replacing the entry point like above seems to work smoothly and is quite common, as far as I understand. Maybe I'm missing something though.

0 replies

RandomOakForest · 2025-05-22T18:16:51Z

RandomOakForest
May 22, 2025
Maintainer

@hhansen-bdai for vis. Thanks for any help with this.

0 replies

OliEfr · 2025-05-23T10:06:22Z

OliEfr
May 23, 2025
Author

We implemented our own minimal docker image for IsaacLab, and the computation is still ~3 times slower on the enroot cluster.

I noticed that when converting from docker to enroot, my enroot container is missing some configurations that are present in the docker container, such as the exported bash aliases in Dockerfile.base:

# aliasing isaaclab.sh and python for convenience
RUN echo "export ISAACLAB_PATH=${ISAACLAB_PATH}" >> ${HOME}/.bashrc && \
    echo "alias isaaclab=${ISAACLAB_PATH}/isaaclab.sh" >> ${HOME}/.bashrc && \
    echo "alias python=${ISAACLAB_PATH}/_isaac_sim/python.sh" >> ${HOME}/.bashrc && \
    echo "alias python3=${ISAACLAB_PATH}/_isaac_sim/python.sh" >> ${HOME}/.bashrc && \
    echo "alias pip='${ISAACLAB_PATH}/_isaac_sim/python.sh -m pip'" >> ${HOME}/.bashrc && \
    echo "alias pip3='${ISAACLAB_PATH}/_isaac_sim/python.sh -m pip'" >> ${HOME}/.bashrc && \
    echo "alias tensorboard='${ISAACLAB_PATH}/_isaac_sim/python.sh ${ISAACLAB_PATH}/_isaac_sim/tensorboard'" >> ${HOME}/.bashrc && \
    echo "export TZ=$(date +%Z)" >> ${HOME}/.bashrc

I am thinking maybe some other important settings from the IsaacSim base container might be missing in my enroot container?

Is there any way we can benchmark an Isaac-Sim simulation in a container on the enroot cluster? (This way we would know if the issue is due to Isaac-Sim container or Isaac-Lab container)

0 replies

OliEfr · 2025-05-23T18:18:38Z

OliEfr
May 23, 2025
Author

I think that some instance of Isaac-Sim is starting in the background. While I can avoid that by overwriting the entrypoint during docker run, I think that enroot somehow doesnt respect that, or our remote cluster setup somehow executes other entrypoints / shell scripts that are present in the container before exposing the container to me.

I identified ENTRYPOINT ["/bin/sh" "-c" "/isaac-sim/runheadless.native.sh"] in the Isaac-Sim Basecontainer, which I think spawns a isaac-sim instance. When I set that file to empty in the dockerfile, and then build the container, I get full speed and no Isaac-Sim Instance in the background.

However, even with that file removed in enroot it is still 3x slower. Might there be any other file starting Isaac-Sim in the background?

0 replies

RandomOakForest · 2025-06-11T11:56:53Z

RandomOakForest
Jun 11, 2025
Maintainer

Thanks for following up. I'll move this to our Discussions for the team to follow up.

1 reply

OliEfr Jun 11, 2025
Author

Thanks! Our current working guess is that IsaacLab / IsaacSim is just slower on cluster GPUs than on Desktop GPUs. We ruled out many other possiblities.

Container Slow when used with enroot #2671

Uh oh!

Uh oh!

OliEfr May 22, 2025

Replies: 6 comments · 1 reply

Uh oh!

RandomOakForest May 22, 2025 Maintainer

Uh oh!

OliEfr May 22, 2025 Author

Uh oh!

RandomOakForest May 22, 2025 Maintainer

Uh oh!

OliEfr May 23, 2025 Author

Uh oh!

Uh oh!

OliEfr May 23, 2025 Author

Uh oh!

RandomOakForest Jun 11, 2025 Maintainer

Uh oh!

Uh oh!

OliEfr Jun 11, 2025 Author

OliEfr
May 22, 2025

Replies: 6 comments 1 reply

RandomOakForest
May 22, 2025
Maintainer

OliEfr
May 22, 2025
Author

RandomOakForest
May 22, 2025
Maintainer

OliEfr
May 23, 2025
Author

OliEfr
May 23, 2025
Author

RandomOakForest
Jun 11, 2025
Maintainer

OliEfr Jun 11, 2025
Author