Skip to content

Commit 2590e63

Browse files
committed
Merged PR 32: Support IcM ticket creation for OFR nodes
Support IcM ticket creation for OFR ndoes. # Documents for testing new features ### Description: Add node-recycler service to submit IcM ticket to UA nodes and recycle nodes. ### Test Actions: 1. Build the image for the node-recycler service. 2. Write the configuration file for the node-recycler under alert-manager: ```yaml node-recycler: configured: False kusto: cluster: luciatrainingplatform datebase: "" icm: host: prod.microsofticm.com crt_name: cert.pem crt_path: /icm/cert.pem key_name: key.pem key_path: /icm/key.pem connector_id: "" routing_id: hpclucia://AzureHPC/Lucia/TrainingPlatform vmss: client_id: "" vmss_ids: "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmss}" validation_image: bench.azurecr.io/superbench:0.11-rocm6.3 ``` 3. Prepare the certificate files for IcM connector. 4. Deploy the service to the cluster. 5. Mark some nodes as `triaged_hardware` in the node status table and add corresponding action record in the node action table. 6. Check whether the service can submit corresponding IcM tickets and deallocate/start the node. ### Expected Results: 1. Build the image successfully. 2. The service is deployed successfully. 3. The corresponding node should have the following status and action added to the node status table and node action table: * status: `ua`, action: `triaged_hardware-ua` * status: `deallocated_ua`, action: `ua-deallocated_ua` * status: `allocated_ua`, action: `deallocated_ua-allocated-ua` * status: `validating`, action: `allocated_ua-validating` 4. There exist one superbench job with the name `superbench_uaback_auto_xxxxxxxx` for corresponding node in the cluster.
1 parent 40e4f7d commit 2590e63

File tree

11 files changed

+584
-0
lines changed

11 files changed

+584
-0
lines changed

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[submodule "src/alert-manager/src/node-recycler/python-icm"]
2+
path = src/alert-manager/src/node-recycler/python-icm
3+
url = https://dev.azure.com/msblox/python-icm/_git/python-icm

src/alert-manager/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
# Dependency directories
22
node_modules/
3+
4+
deploy/icm-certs/

src/alert-manager/build/build-pre.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
pushd $(dirname "$0") > /dev/null
77

8+
cp -arfT "../../kusto/sdk" "../src/node-recycler"
89
cp -arfT "../../database-controller/sdk" "../src/job-status-change-notification/openpaidbsdk"
910

1011
popd > /dev/null
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Copyright (c) Microsoft Corporation.
2+
# Licensed under the MIT License.
3+
4+
FROM mcr.microsoft.com/azurelinux/base/python:3.12
5+
6+
WORKDIR /usr/src/app
7+
COPY ./src/node-recycler .
8+
9+
RUN pip3 install -r requirements.txt
10+
11+
ENTRYPOINT ["python3", "recycler.py"]

src/alert-manager/config/alert-manager.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,23 @@ job-status-change-notification:
2323
log-level: 'info'
2424
db-poller-max-db-connection: 10
2525
db-poller-interval-second: 600
26+
node-recycler:
27+
configured: False
28+
kusto:
29+
cluster: luciatrainingplatform
30+
datebase: ""
31+
icm:
32+
host: prod.microsofticm.com
33+
crt_name: cert.pem
34+
crt_path: /icm/cert.pem
35+
key_name: key.pem
36+
key_path: /icm/key.pem
37+
connector_id: ""
38+
routing_id: hpclucia://AzureHPC/Lucia/TrainingPlatform
39+
vmss:
40+
client_id: ""
41+
vmss_ids: ""
42+
validation_image: bench.azurecr.io/superbench:0.11-rocm6.3
2643
use-pylon: False
2744
repeat-interval: '24h'
2845
cert-expiration-checker:

src/alert-manager/deploy/alert-manager-deployment.yaml.template

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,68 @@ spec:
193193
- name: config-volume
194194
mountPath: /app/config
195195
{% endif %}
196+
197+
{% if cluster_cfg["alert-manager"]["node-recycler"]["configured"] %}
198+
- name: node-recycler
199+
image: {{ cluster_cfg['cluster']['docker-registry']['prefix'] }}node-recycler:{{ cluster_cfg['cluster']['docker-registry']['tag'] }}
200+
imagePullPolicy: Always
201+
env:
202+
- name: CLUSTER_ID
203+
value: {{ cluster_cfg["cluster"]["common"]["cluster-id"] }}
204+
- name: ENVIRONMENT
205+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["environment"] | default("prod") }}
206+
- name: KUSTO_USER_ASSIGNED_CLIENT_ID
207+
value: {{ cluster_cfg["alert-manager"]["kusto-user-assigned-client-id"] }}
208+
{% if "kusto" in cluster_cfg["alert-manager"]["node-recycler"] %}
209+
- name: LTP_KUSTO_CLUSTER
210+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["kusto"]["cluster"] }}
211+
- name: LTP_KUSTO_DATABASE_NAME
212+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["kusto"]["database"] }}
213+
- name: KUSTO_NODE_STATUS_TABLE_NAME
214+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["kusto"]["node-status-table"] | default("NodeStatusRecord") }}
215+
- name: KUSTO_NODE_ACTION_TABLE_NAME
216+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["kusto"]["node-action-table"] | default("NodeActionRecord") }}
217+
{% endif %}
218+
{% if "icm" in cluster_cfg["alert-manager"]["node-recycler"] %}
219+
- name: ICM_HOST
220+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["icm"]["host"] }}
221+
- name: ICM_CERT_PATH
222+
value: /etc/icm/certs/{{ cluster_cfg["alert-manager"]["node-recycler"]["icm"]["crt_name"] | default("cert.pem") }}
223+
- name: ICM_KEY_PATH
224+
value: /etc/icm/certs/{{ cluster_cfg["alert-manager"]["node-recycler"]["icm"]["key_name"] | default("key.pem") }}
225+
- name: ICM_CONNECTOR_ID
226+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["icm"]["connector_id"] }}
227+
- name: ICM_ROUTING_ID
228+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["icm"]["routing_id"] }}
229+
{% endif %}
230+
{% if "vmss" in cluster_cfg["alert-manager"]["node-recycler"] %}
231+
- name: AZURE_CLIENT_ID
232+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["vmss"]["client_id"] }}
233+
- name: LTP_VMSS_IDS
234+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["vmss"]["vmss_ids"] }}
235+
- name: LTP_VALIDATION_IMAGE
236+
value: {{ cluster_cfg["alert-manager"]["node-recycler"]["vmss"]["validation_image"] }}
237+
{% endif %}
238+
- name: REST_SERVER_URI
239+
value: {{ cluster_cfg["rest-server"]["uri"] }}
240+
- name: REST_SERVER_TOKEN
241+
value: {{ cluster_cfg["alert-manager"]["pai-bearer-token"] }}
242+
resources:
243+
requests:
244+
memory: "256Mi"
245+
cpu: "100m"
246+
limits:
247+
memory: "512Mi"
248+
cpu: "200m"
249+
volumeMounts:
250+
- name: config-volume
251+
mountPath: /app/config
252+
{% if "icm" in cluster_cfg["alert-manager"]["node-recycler"] %}
253+
- name: icm-certs
254+
mountPath: /etc/icm/certs
255+
{% endif %}
256+
{% endif %}
257+
196258
imagePullSecrets:
197259
- name: {{ cluster_cfg["cluster"]["docker-registry"]["secret-name"] }}
198260
volumes:
@@ -212,6 +274,13 @@ spec:
212274
path: {{ template }}/subject.ejs
213275
{% endfor -%}
214276
{% endif %}
277+
{% endif %}
278+
{% if cluster_cfg["alert-manager"]["node-recycler"]["configured"] %}
279+
{% if "icm" in cluster_cfg["alert-manager"]["node-recycler"] %}
280+
- name: icm-certs
281+
configMap:
282+
name: icm-certs
283+
{% endif %}
215284
{% endif %}
216285
- name: alertmanager
217286
emptyDir: {}

src/alert-manager/deploy/start.sh.template

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,14 @@ kubectl create configmap alert-templates \
3131
{% endif -%}
3232
{% endif -%}
3333

34+
# create configmap for node-recycler
35+
{% if "icm" in cluster_cfg["alert-manager"]["node-recycler"] %}
36+
mkdir -p icm-certs/
37+
cp {{ cluster_cfg["alert-manager"]["node-recycler"]["icm"]["crt_path"] }} icm-certs/
38+
cp {{ cluster_cfg["alert-manager"]["node-recycler"]["icm"]["key_path"] }} icm-certs/
39+
kubectl create configmap icm-certs --from-file=icm-certs/ --dry-run=client -o yaml | kubectl apply --overwrite=true -f - || exit $?
40+
{% endif %}
41+
3442
kubectl apply --overwrite=true -f rbac.yaml || exit $?
3543
kubectl apply --overwrite=true -f alert-manager-configmap.yaml || exit $?
3644
kubectl apply --overwrite=true -f alert-manager-deployment.yaml || exit $?
Submodule python-icm added at bd61c6b

0 commit comments

Comments
 (0)