74 lines (53 loc) · 2.05 KB

AIOps Interview Questions

🎯 Fundamentals

Q: What is AIOps?

A: AIOps (Artificial Intelligence for IT Operations) uses ML to:

Automate IT operations
Detect anomalies before users notice
Correlate events across systems
Suggest or execute remediation

Q: What are the key components of an AIOps platform?

A:

Data Ingestion: Metrics, logs, traces
Anomaly Detection: ML models for unusual behavior
Event Correlation: Connect related events
Root Cause Analysis: Identify problem source
Automated Remediation: Self-healing actions

📊 Anomaly Detection

Q: How do you detect anomalies in time-series data?

A:

Statistical: Z-score, IQR, ARIMA
ML-based: Isolation Forest, One-class SVM
Deep Learning: LSTM autoencoders, Transformers
Seasonal: STL decomposition + threshold

Q: What's the difference between supervised and unsupervised anomaly detection?

Supervised	Unsupervised
Needs labeled data	No labels needed
Detects known patterns	Detects unknown patterns
Classification problem	Clustering/density based

🔧 Automated Remediation

Q: How do you safely implement auto-remediation?

A:

Start with low-risk actions (restart, scale)
Implement safeguards (cooldowns, limits)
Require human approval for risky actions
Comprehensive logging
Easy rollback mechanism

Q: Design an auto-scaling system using ML.

A:

Collect historical metrics (CPU, requests, latency)
Train model to predict future load
Proactively scale before demand spike
Continuously retrain on new data
Fall back to reactive scaling if prediction fails

🎯 Scenario Questions

Q: How would you reduce alert fatigue?

A:

ML-based alert correlation
Automatic severity adjustment
Deduplication and grouping
Context-aware routing
Feedback loops (was alert actionable?)

Next: Return to Interview Overview.