Skip to content

[Feature] Event record for failed Pod creation #2250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
Eikykun opened this issue Jul 16, 2024 · 5 comments
Closed
1 of 2 tasks

[Feature] Event record for failed Pod creation #2250

Eikykun opened this issue Jul 16, 2024 · 5 comments
Labels

Comments

@Eikykun
Copy link
Contributor

Eikykun commented Jul 16, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Recently, while using RayCluster, a user configured an invalid label in the pod template. I could only discover this issue through the logs of RayOperator. Perhaps, we could use the following methods to help us troubleshoot or avoid such issues more quickly:

  • Record relevant failure information using EventRecorder when pod creation fails.
  • Add validation logic for the pod template in the validating webhook.

Use case

RayCluster troubleshooting

Related issues

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Eikykun Eikykun added enhancement New feature or request triage labels Jul 16, 2024
@Eikykun Eikykun changed the title [Feature] Validation of Pod Template in RayCluster [Feature] Event recorded for failed Pod creation Jul 16, 2024
@Eikykun Eikykun changed the title [Feature] Event recorded for failed Pod creation [Feature] Event record for failed Pod creation Jul 16, 2024
@andrewsykim
Copy link
Member

So far in KubeRay we've tried to avoid validating webhooks for two reasons:

  1. Many users have company-wide policies disallowing webhooks
  2. Webhooks potentially introduce additioanl complexity and reliability issues

We do have a separate effort to improve RayCluster observability via status conditoins: ray-project/enhancements#54

Maybe we can incorporate invalid Pod templates as part of this @kevin85421 @rueian

@andrewsykim
Copy link
Member

Also just worth clarifying that we do have config for validating webhooks but it's optional and you need to manually deploy it like this: https://github.com/ray-project/kuberay/blob/master/ray-operator/Makefile#L132-L134

The validation logic is here: https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/ray/v1/raycluster_webhook.go#L51-L68

@andrewsykim
Copy link
Member

Ah, I see you changed the issue to ask for events instead of validating webhooks. Related issue: #2189

@Eikykun
Copy link
Contributor Author

Eikykun commented Jul 17, 2024

@andrewsykim. Haha. In fact, we didn't enable webhooks, so the ErrorEvent is sufficient for us to troubleshoot issues. Of course, using validating webhooks to verify the RayCluster would be even better if allowed.
A failure to create a pod is not always due to an invalid YAML. Using Events can help us quickly diagnose the issue.

@kevin85421
Copy link
Member

Closed by #2286

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants