Skip to content

Commit 2745d01

Browse files
authored
Merge pull request #1855 from oracle-devrel/update-nvidia-megatron
Use better shape count for NVIDIA megatron training.
2 parents 24993a9 + bc2cecf commit 2745d01

File tree

1 file changed

+9
-2
lines changed
  • cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke

1 file changed

+9
-2
lines changed

cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ on the Oracle Container Engine for Kubernetes (OKE) using
88
Reference results from NVIDIA to train Llama 3 can be found on the
99
[NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama3-dgxc-benchmarking).
1010

11-
Reviewed: 18.03.2025
11+
Reviewed: 01.07.2025
1212

1313
# When to use this asset?
1414

@@ -31,7 +31,14 @@ This guide is loosely based on the
3131
[to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity),
3232
importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes.
3333

34-
The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes.
34+
The configuration here assumes a minimum of 1 BM.GPU.H100.8 node for
35+
training with 8B parameters, and a minimum of 8 BM.GPU.H100.8 nodes for 70B
36+
parameters.
37+
38+
If another shape is used, the NCCL and MPI parameters in the Kubernetes
39+
[configuration map](./files/training/templates/mpi.yaml) should be adapted
40+
using the same parameter values as the
41+
[performance testing scripts](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main/manifests/nccl-tests).
3542

3643
- Ensure that the follwing setting is selected under the "OKE Cluster" section:
3744

0 commit comments

Comments
 (0)