-
Notifications
You must be signed in to change notification settings - Fork 31
Ansible resnet #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Ansible resnet #233
Changes from all commits
26a64e5
4a82d2e
48e0ece
e9a15a2
0e15764
8850c1a
e100a96
dcb9451
c8497b6
b952844
efed68f
6ac0f4c
fce3e8d
4284d9d
3f7a512
8911664
cc6baa8
60cde36
0e4498f
ecdbbbb
b77913e
2257c7a
b82da5b
d25bb4a
5aa4ac1
9b29218
cf363d5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
# 使用 Ansible 将 SSH 公钥分发到多个目标主机 | ||
<img width="829" alt="image" src="https://github.com/user-attachments/assets/ec938595-dee4-4f6e-8818-93b3a299020e"> | ||
|
||
## 0. 安装Ansible | ||
|
||
```bash | ||
pip install ansible-vault | ||
``` | ||
|
||
## 1. 创建变量文件并加密 | ||
|
||
创建一个包含密码的变量文件vars.yml: | ||
|
||
```yaml | ||
all: | ||
hosts: | ||
192.168.1.27: | ||
ansible_user: myuser | ||
ansible_password: mypassword | ||
192.168.1.28: | ||
ansible_user: myuser | ||
ansible_password: mypassword | ||
``` | ||
|
||
然后使用Ansible Vault加密这个文件: | ||
|
||
```bash | ||
ansible-vault encrypt vars.yml | ||
``` | ||
|
||
注意: | ||
|
||
1. 执行 `ansible-vault` 的过程中需要设定一个密码,请记住或保存好这个密码 | ||
2. `vars.yml`将被替换为加密后的文件 | ||
|
||
## 2. 创建主机清单文件 | ||
|
||
创建一个主机清单文件`inventory.ini`: | ||
|
||
```ini | ||
[all] | ||
node1 ansible_host=192.168.1.27 ansible_user=myuser | ||
node2 ansible_host=192.168.1.28 ansible_user=myuser | ||
``` | ||
|
||
注:需要根据情况修改 `ansible_user` 的值 | ||
|
||
## 3. 创建Playbook | ||
|
||
如果文件存在,这一步可以忽略。 | ||
|
||
创建一个Playbook distribute_ssh_key.yml: | ||
|
||
```yaml | ||
--- | ||
- name: Distribute SSH key | ||
hosts: all | ||
vars_files: | ||
- vars.yml | ||
tasks: | ||
- name: Create .ssh directory if it doesn't exist | ||
file: | ||
path: /home/{{ ansible_user }}/.ssh | ||
state: directory | ||
mode: '0700' | ||
owner: "{{ ansible_user }}" | ||
group: "{{ ansible_user }}" | ||
|
||
- name: Copy the SSH key to the authorized_keys file | ||
authorized_key: | ||
user: "{{ ansible_user }}" | ||
state: present | ||
key: "{{ lookup('file', '/path/to/id_rsa.pub') }}" | ||
``` | ||
|
||
注:`vars_files` 配置为 `vars.yml` | ||
|
||
## 4. 运行Playbook | ||
|
||
使用以下命令运行Playbook,并解密变量文件: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这个是不是要求其它服务器首先要有主服务器公钥才能执行,不然报错连接失败 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 嗯 |
||
|
||
```bash | ||
ansible-playbook -i inventory.ini distribute_ssh_key.yml --ask-vault-pass | ||
``` | ||
或者运行 | ||
|
||
```bash | ||
./dist_ssh_key.sh | ||
``` | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
ansible-playbook -i inventory.ini distribute_ssh_key.yml --ask-vault-pass |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
--- | ||
- name: Distribute SSH key | ||
hosts: all | ||
vars_files: | ||
- vars.yml | ||
tasks: | ||
- name: Create .ssh directory if it doesn't exist | ||
file: | ||
path: /home/{{ ansible_user }}/.ssh | ||
state: directory | ||
mode: '0700' | ||
owner: "{{ ansible_user }}" | ||
group: "{{ ansible_user }}" | ||
|
||
- name: Copy the SSH key to the authorized_keys file | ||
authorized_key: | ||
user: "{{ ansible_user }}" | ||
state: present | ||
key: "{{ lookup('file', '/home/用户名/.ssh/id_rsa.pub') }}" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[all] | ||
of27 ansible_host=192.168.1.27 ansible_user=myuser | ||
of28 ansible_host=192.168.1.28 ansible_user=myuser |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
all: | ||
hosts: | ||
192.168.1.27: | ||
ansible_user: myuser | ||
ansible_password: mypassword | ||
192.168.1.28: | ||
ansible_user: myuser | ||
ansible_password: mypassword |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# 拉取或导入镜像 | ||
|
||
注: 用户需要有各台机器的docker权限 | ||
|
||
## 拉取镜像 | ||
|
||
适用于直接从 dockerhub 拉取镜像。 | ||
|
||
用法: `./pull.sh [镜像标签]` | ||
|
||
参数说明: | ||
|
||
- 镜像标签 (可选) : 要拉取的Docker镜像标签,例如 alpine:latest。如果未提供,则使用playbook中的默认值。 | ||
|
||
示例: | ||
|
||
- 默认使用: | ||
|
||
```bash | ||
./pull.sh | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 可以提示:需要有docker权限 |
||
``` | ||
|
||
- 指定镜像标签: | ||
|
||
```bash | ||
./pull.sh alpine:latest | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 我这指定标签的看起来会timeout
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这个timeout我也是,所以就开发了 load + commit的方式,后面我们会自定义一个镜像,pull这个可能用不到。 |
||
``` | ||
|
||
## 导入镜像 | ||
|
||
适用于本地共享目录有已经保存镜像的tar文件,使用 `docker load` 导入。 | ||
|
||
用法: `./load.sh [镜像文件路径] [镜像标签] [强制导入]` | ||
|
||
参数说明: | ||
|
||
- 镜像文件路径 (可选) : 要导入的Docker镜像tar文件路径,默认为 `/share_nfs/k85/oneflow.0.9.1.dev20240203-cuda11.8.tar` | ||
- 镜像标签 (可选) : 导入后设置的Docker镜像标签,默认为 `oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8` | ||
- 强制导入 (可选) : 是否强制导入镜像(true 或 false),默认为 false | ||
|
||
示例: | ||
|
||
- 默认使用: | ||
|
||
```bash | ||
./load.sh | ||
``` | ||
|
||
- 指定镜像文件路径和标签: | ||
|
||
```bash | ||
./load.sh /path/to/shared/abc.tar myrepo/myimage:latest | ||
``` | ||
|
||
- 强制导入镜像: | ||
|
||
```bash | ||
./load.sh /path/to/shared/abc.tar myrepo/myimage:latest true | ||
``` | ||
|
||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
#!/bin/bash | ||
|
||
if [ -n "$1" ]; then | ||
docker_image_path=$1 | ||
else | ||
docker_image_path="/share_nfs/k85/oneflow.0.9.1.dev20240203-cuda11.8.tar" | ||
fi | ||
|
||
if [ -n "$2" ]; then | ||
docker_image_tag=$2 | ||
else | ||
docker_image_tag="oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8" | ||
fi | ||
|
||
if [ -n "$3" ]; then | ||
force_load=$3 | ||
else | ||
force_load=false | ||
fi | ||
|
||
ansible-playbook \ | ||
-i ../inventory.ini \ | ||
load_and_tag_docker_image.yml \ | ||
-e "docker_image_path=$docker_image_path" \ | ||
-e "docker_image_tag=$docker_image_tag" \ | ||
-e "force_load=$force_load" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
--- | ||
- name: Load and tag Docker image | ||
hosts: all | ||
vars: | ||
docker_image_path: "/share_nfs/k85/oneflow.0.9.1.dev20240203-cuda11.8.tar" | ||
docker_image_tag: "oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8" | ||
force_load: false | ||
|
||
tasks: | ||
- name: Check if Docker image with the specified tag already exists | ||
command: "docker images -q {{ docker_image_tag }}" | ||
register: image_id | ||
changed_when: false | ||
when: not force_load | ||
|
||
- name: Load Docker image from tar file | ||
command: "docker load -i {{ docker_image_path }}" | ||
when: force_load or image_id.stdout == "" | ||
register: load_output | ||
|
||
- name: Get image ID from load output | ||
set_fact: | ||
loaded_image_id: "{{ load_output.stdout_lines[-1] | regex_search('sha256:[0-9a-f]+') }}" | ||
when: force_load or image_id.stdout == "" | ||
|
||
- name: Tag the loaded Docker image | ||
command: "docker tag {{ loaded_image_id }} {{ docker_image_tag }}" | ||
when: force_load or image_id.stdout == "" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
#!/bin/bash | ||
|
||
if [ -n "$1" ]; then | ||
ansible-playbook -i ../inventory.ini pull_docker_image.yml -e "docker_image=$1" | ||
else | ||
ansible-playbook -i ../inventory.ini pull_docker_image.yml | ||
fi |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
--- | ||
- name: Pull specified Docker image | ||
hosts: all | ||
vars: | ||
docker_image: "oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8" | ||
|
||
tasks: | ||
- name: Check if the Docker image is already present | ||
command: "docker images -q {{ docker_image }}" | ||
register: docker_image_id | ||
changed_when: false | ||
|
||
- name: Pull Docker image if not present | ||
docker_image: | ||
name: "{{ docker_image }}" | ||
source: pull | ||
when: docker_image_id.stdout == "" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# run_dist_training.sh 使用说明 | ||
|
||
`run_dist_training.sh` 是一个 Bash 脚本,用于运行 `ansible-playbook` 命令来启动分布式训练。此脚本支持通过参数指定 Docker 镜像和源目录。 | ||
|
||
## 用法 | ||
|
||
```bash | ||
./run_dist_training.sh [docker_image] [src] | ||
``` | ||
|
||
## 参数 | ||
|
||
- `docker_image` (可选): 要使用的 Docker 镜像名称。默认为 `oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8`。 | ||
- `src` (可选): 要挂载到 Docker 容器的源目录。默认为 `/share_nfs/k85/models/Vision/classification/image/resnet50`。 | ||
|
||
## 示例 | ||
|
||
1. 使用默认值运行: | ||
|
||
```bash | ||
./run_dist_training.sh | ||
``` | ||
|
||
2. 指定 Docker 镜像运行: | ||
|
||
```bash | ||
./run_dist_training.sh "my_custom_image:latest" | ||
``` | ||
|
||
3. 指定 Docker 镜像和源目录运行: | ||
|
||
```bash | ||
./run_dist_training.sh "my_custom_image:latest" "/my/custom/src" | ||
``` | ||
|
||
## 注意 | ||
|
||
如果不提供参数,脚本将使用默认的 Docker 镜像和源目录。 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
--- | ||
- name: Distributed Training Setup | ||
hosts: all | ||
vars: | ||
device_num_per_node: 8 | ||
num_nodes: "{{ groups['all'] | length }}" | ||
master_addr: "{{ hostvars[groups['all'][0]].ansible_host }}" | ||
docker_image: "oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8" | ||
src: "/share_nfs/k85/models/Vision/classification/image/resnet50" | ||
|
||
tasks: | ||
- name: Set node rank | ||
set_fact: | ||
node_rank: "{{ groups['all'].index(inventory_hostname) }}" | ||
|
||
- name: distributed training in Docker container | ||
command: > | ||
docker run --rm --gpus all | ||
--runtime=nvidia --privileged | ||
--network host --ipc=host | ||
-v {{ src }}:/workspace | ||
-w /workspace | ||
{{ docker_image }} /bin/bash -c " | ||
python3 -m oneflow.distributed.launch \ | ||
--nproc_per_node {{ device_num_per_node }} \ | ||
--nnodes {{ num_nodes }} \ | ||
--node_rank {{ node_rank }} \ | ||
--master_addr {{ master_addr }} \ | ||
/workspace/train.py \ | ||
--synthetic-data \ | ||
--batches-per-epoch 1000 \ | ||
--num-devices-per-node {{ device_num_per_node }} \ | ||
--lr 1.536 \ | ||
--num-epochs 1 \ | ||
--train-batch-size 32 \ | ||
--graph \ | ||
--use-fp16 \ | ||
--metric-local False \ | ||
--metric-train-acc True \ | ||
--use-gpu-decode \ | ||
--channel-last \ | ||
--skip-eval | ||
" | ||
register: output | ||
|
||
- name: Display output | ||
debug: | ||
var: output.stdout | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
#!/bin/bash | ||
|
||
DOCKER_IMAGE="oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8" | ||
SRC="/share_nfs/k85/models/Vision/classification/image/resnet50" | ||
|
||
if [ -n "$1" ]; then | ||
DOCKER_IMAGE="$1" | ||
fi | ||
|
||
if [ -n "$2" ]; then | ||
SRC="$2" | ||
fi | ||
|
||
# 运行 ansible-playbook 命令 | ||
ansible-playbook -i ../inventory.ini dist_training.yml -e "docker_image=${DOCKER_IMAGE}" -e "src=${SRC}" |
Uh oh!
There was an error while loading. Please reload this page.