Skip to content

Latest commit

 

History

History
1086 lines (858 loc) · 26.6 KB

File metadata and controls

1086 lines (858 loc) · 26.6 KB

Istio Traffic Management: Advanced Routing and Resilience

Version: 1.0 Last Updated: 2025-11-13 Status: Production Ready Audience: Platform Engineers, SRE, Developers

Complete guide to Istio traffic management capabilities including intelligent routing, circuit breaking, retries, timeouts, fault injection, and canary deployments for the Kagenti AI Agent Platform.


Table of Contents


Overview

Purpose: Configure advanced traffic management capabilities in Istio to achieve intelligent routing, high availability, and gradual rollouts for AI agents and platform services.

What You Get:

  • ✅ Intelligent request routing (header-based, path-based, weight-based)
  • ✅ Automatic retries with exponential backoff
  • ✅ Request timeouts to prevent cascading failures
  • ✅ Circuit breaking to isolate failing services
  • ✅ Canary deployments for gradual rollouts
  • ✅ Fault injection for chaos engineering
  • ✅ Traffic mirroring for testing
  • ✅ Load balancing algorithms (round-robin, least-conn, consistent hashing)

Key Principle: Istio's traffic management operates at Layer 7 (HTTP), enabling intelligent routing decisions based on request content, unlike traditional load balancers that operate at Layer 4 (TCP).

Source: Based on Istio Traffic Management, Istio Best Practices


Traffic Routing

HTTPRoute vs VirtualService

Kagenti platform uses Gateway API HTTPRoute as the primary routing mechanism, with Istio VirtualService for advanced features.

Comparison:

Feature HTTPRoute (Gateway API) VirtualService (Istio)
Standard Kubernetes SIG standard Istio-specific
Use Case External ingress routing Service-to-service routing
Features Basic routing, rewrites, redirects Advanced: retries, timeouts, fault injection
Future ✅ Recommended for new features ⚠️ Consider migrating to HTTPRoute

Recommendation: Use HTTPRoute for external ingress, VirtualService for internal service mesh routing.

Source: Gateway API Overview, Istio Traffic Management


HTTPRoute Examples

Basic Path-Based Routing

Route traffic based on URL path:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: agent-routes
  namespace: team1
spec:
  parentRefs:
  - name: kagenti-gateway
    namespace: istio-system
  hostnames:
  - "agents.localtest.me"
  rules:
  # Route /research to research-agent
  - matches:
    - path:
        type: PathPrefix
        value: /research
    backendRefs:
    - name: research-agent
      port: 8080

  # Route /code to code-agent
  - matches:
    - path:
        type: PathPrefix
        value: /code
    backendRefs:
    - name: code-agent
      port: 8080

  # Route /orchestrate to orchestrator
  - matches:
    - path:
        type: PathPrefix
        value: /orchestrate
    backendRefs:
    - name: orchestrator-agent
      port: 8080

Source: HTTPRoute Path Matching


Header-Based Routing

Route traffic based on HTTP headers (e.g., API version):

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: versioned-api-routes
  namespace: team1
spec:
  parentRefs:
  - name: kagenti-gateway
    namespace: istio-system
  hostnames:
  - "api.localtest.me"
  rules:
  # Route v2 API requests to new backend
  - matches:
    - headers:
      - name: api-version
        value: v2
    backendRefs:
    - name: research-agent-v2
      port: 8080

  # Default to v1 API
  - backendRefs:
    - name: research-agent
      port: 8080

Use Case: API versioning, A/B testing, feature flags

Source: HTTPRoute Header Matching


VirtualService Examples

Service-to-Service Routing with Subset

Route traffic to different versions based on weight:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: research-agent-vs
  namespace: team1
spec:
  hosts:
  - research-agent.team1.svc.cluster.local
  http:
  - match:
    - headers:
        x-user-type:
          exact: "beta-tester"
    route:
    - destination:
        host: research-agent.team1.svc.cluster.local
        subset: v2
      weight: 100

  # Default traffic goes to v1
  - route:
    - destination:
        host: research-agent.team1.svc.cluster.local
        subset: v1
      weight: 100

DestinationRule (defines subsets):

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: research-agent-dr
  namespace: team1
spec:
  host: research-agent.team1.svc.cluster.local
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Source: Istio VirtualService


Resiliency Patterns

Automatic Retries

Configure retries for transient failures:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: research-agent-retry
  namespace: team1
spec:
  hosts:
  - research-agent.team1.svc.cluster.local
  http:
  - route:
    - destination:
        host: research-agent.team1.svc.cluster.local
    retries:
      attempts: 3               # Retry up to 3 times
      perTryTimeout: 2s         # Timeout per attempt
      retryOn: 5xx,reset,refused-stream,retriable-4xx

Retry Conditions:

  • 5xx: Server errors (500, 502, 503, 504)
  • reset: Connection reset
  • refused-stream: HTTP/2 REFUSED_STREAM
  • retriable-4xx: 409 Conflict

Source: Istio Retries


Request Timeout

Prevent long-running requests from blocking resources:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: research-agent-timeout
  namespace: team1
spec:
  hosts:
  - research-agent.team1.svc.cluster.local
  http:
  - route:
    - destination:
        host: research-agent.team1.svc.cluster.local
    timeout: 10s              # Total timeout for request

Use Case: Prevent slow AI inference from blocking other requests

Source: Istio Timeouts


Circuit Breaking

Connection Pool Limits

Limit concurrent connections to prevent overwhelming backend:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: research-agent-circuit-breaker
  namespace: team1
spec:
  host: research-agent.team1.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100       # Max TCP connections
      http:
        http1MaxPendingRequests: 10   # Max pending HTTP/1.1 requests
        http2MaxRequests: 100         # Max concurrent HTTP/2 requests
        maxRequestsPerConnection: 2   # Max requests per connection (HTTP/1.1)

Why:

  • Prevents resource exhaustion: Limits concurrent connections
  • Protects backend: Prevents overload during traffic spikes
  • Fast fail: Returns 503 immediately when circuit is open

Source: Istio Circuit Breaking


Outlier Detection

Automatically remove failing instances from load balancing pool:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: research-agent-outlier
  namespace: team1
spec:
  host: research-agent.team1.svc.cluster.local
  trafficPolicy:
    outlierDetection:
      consecutiveErrors: 5        # Eject after 5 consecutive errors
      interval: 30s               # Check every 30 seconds
      baseEjectionTime: 30s       # Eject for at least 30 seconds
      maxEjectionPercent: 50      # Eject max 50% of instances
      minHealthPercent: 25        # Keep at least 25% of instances healthy

How It Works:

  1. Istio tracks errors per backend instance
  2. After 5 consecutive errors, instance is ejected from pool
  3. Instance is ejected for 30 seconds (increases with repeated ejections)
  4. After ejection time, instance is re-added to pool

Source: Istio Outlier Detection


Canary Deployments

Gradual Traffic Shift

Incrementally shift traffic from v1 to v2:

Step 1: Deploy v2 (0% traffic)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: research-agent-v2
  namespace: team1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: research-agent
      version: v2
  template:
    metadata:
      labels:
        app: research-agent
        version: v2
    spec:
      containers:
      - name: agent
        image: localhost:5000/research-agent:v0.0.16

Step 2: Route 10% traffic to v2

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: research-agent-canary
  namespace: team1
spec:
  hosts:
  - research-agent.team1.svc.cluster.local
  http:
  - route:
    - destination:
        host: research-agent.team1.svc.cluster.local
        subset: v1
      weight: 90      # 90% to v1
    - destination:
        host: research-agent.team1.svc.cluster.local
        subset: v2
      weight: 10      # 10% to v2 (canary)

Step 3: Monitor metrics

# Check error rate for v2
kubectl exec -n observability deploy/prometheus -- \
  promtool query instant 'http://localhost:9090' \
  'rate(istio_requests_total{destination_version="v2",response_code=~"5.."}[5m])'

# Check latency for v2
kubectl exec -n observability deploy/prometheus -- \
  promtool query instant 'http://localhost:9090' \
  'histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket{destination_version="v2"}[5m]))'

Step 4: Gradually increase traffic (50%)

http:
- route:
  - destination:
      host: research-agent.team1.svc.cluster.local
      subset: v1
    weight: 50
  - destination:
      host: research-agent.team1.svc.cluster.local
      subset: v2
    weight: 50

Step 5: Complete migration (100% to v2)

http:
- route:
  - destination:
      host: research-agent.team1.svc.cluster.local
      subset: v2
    weight: 100

Step 6: Decommission v1

kubectl delete deployment research-agent-v1 -n team1

Source: Istio Traffic Shifting


Fault Injection

HTTP Delay Injection

Inject latency to test timeout handling:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: research-agent-delay
  namespace: team1
spec:
  hosts:
  - research-agent.team1.svc.cluster.local
  http:
  - fault:
      delay:
        percentage:
          value: 10.0         # Inject delay for 10% of requests
        fixedDelay: 5s        # Delay by 5 seconds
    route:
    - destination:
        host: research-agent.team1.svc.cluster.local

Use Case: Test how orchestrator handles slow agent responses

Source: Istio Fault Injection


HTTP Abort Injection

Inject HTTP errors to test error handling:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: research-agent-abort
  namespace: team1
spec:
  hosts:
  - research-agent.team1.svc.cluster.local
  http:
  - fault:
      abort:
        percentage:
          value: 5.0          # Inject error for 5% of requests
        httpStatus: 503       # Return HTTP 503
    route:
    - destination:
        host: research-agent.team1.svc.cluster.local

Use Case: Test retry logic and circuit breaker behavior


Traffic Mirroring

Mirror Production Traffic to Test Environment

Send copy of production traffic to test environment without affecting users:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: research-agent-mirror
  namespace: team1
spec:
  hosts:
  - research-agent.team1.svc.cluster.local
  http:
  - route:
    - destination:
        host: research-agent.team1.svc.cluster.local
        subset: v1
      weight: 100
    mirror:
      host: research-agent.team1.svc.cluster.local
      subset: v2-test         # Mirror to test version
    mirrorPercentage:
      value: 10.0             # Mirror 10% of traffic

How It Works:

  1. Primary request goes to v1 (production)
  2. Copy of 10% of requests goes to v2-test
  3. v2-test response is ignored (fire-and-forget)
  4. Users see only v1 response

Use Case: Test new agent version with real traffic without risk

Source: Istio Traffic Mirroring


Request Routing

URL Rewrite

Rewrite request path before routing:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: agent-rewrite
  namespace: team1
spec:
  parentRefs:
  - name: kagenti-gateway
    namespace: istio-system
  hostnames:
  - "api.localtest.me"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1/research
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          type: ReplacePrefixMatch
          replacePrefixMatch: /a2a/task    # Rewrite /v1/research to /a2a/task
    backendRefs:
    - name: research-agent
      port: 8080

Use Case: API versioning without changing agent implementation

Source: HTTPRoute URL Rewrite


Request Header Manipulation

Add, modify, or remove request headers:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: agent-headers
  namespace: team1
spec:
  parentRefs:
  - name: kagenti-gateway
    namespace: istio-system
  rules:
  - filters:
    - type: RequestHeaderModifier
      requestHeaderModifier:
        add:
        - name: X-Agent-Platform
          value: kagenti
        - name: X-Request-ID
          value: ${UUID}        # Generate UUID per request
        set:
        - name: X-Forwarded-Proto
          value: https
        remove:
        - X-Internal-Debug     # Remove internal headers
    backendRefs:
    - name: research-agent
      port: 8080

Source: HTTPRoute Header Filters


Timeout and Retry

Combined Timeout and Retry Configuration

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: research-agent-resilient
  namespace: team1
spec:
  hosts:
  - research-agent.team1.svc.cluster.local
  http:
  - route:
    - destination:
        host: research-agent.team1.svc.cluster.local
    timeout: 15s              # Total timeout including retries
    retries:
      attempts: 3
      perTryTimeout: 5s       # 5s per attempt (3 attempts = max 15s)
      retryOn: 5xx,reset,refused-stream

Calculation:

  • Total timeout: 15s
  • Per-try timeout: 5s
  • Attempts: 3
  • Maximum time: min(timeout, perTryTimeout * attempts) = min(15s, 15s) = 15s

Source: Istio Timeout and Retry


Circuit Breaking

Advanced Circuit Breaker Configuration

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: research-agent-advanced-cb
  namespace: team1
spec:
  host: research-agent.team1.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 30s
      http:
        http1MaxPendingRequests: 10
        http2MaxRequests: 100
        maxRequestsPerConnection: 2
        h2UpgradePolicy: UPGRADE    # Allow HTTP/1.1 to HTTP/2 upgrade

    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 25
      consecutiveGatewayErrors: 3   # Eject after 3 gateway errors (502, 503, 504)
      consecutive5xxErrors: 5       # Eject after 5 5xx errors

Source: Istio Connection Pool


Load Balancing

Load Balancing Algorithms

Round Robin (default):

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: research-agent-lb-rr
  namespace: team1
spec:
  host: research-agent.team1.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      simple: ROUND_ROBIN

Least Connections:

trafficPolicy:
  loadBalancer:
    simple: LEAST_CONN

Random:

trafficPolicy:
  loadBalancer:
    simple: RANDOM

Consistent Hash (session affinity):

trafficPolicy:
  loadBalancer:
    consistentHash:
      httpHeaderName: "x-user-id"    # Hash based on user ID header

Consistent Hash (Cookie-based):

trafficPolicy:
  loadBalancer:
    consistentHash:
      httpCookie:
        name: session-id
        ttl: 3600s

Source: Istio Load Balancing


Troubleshooting

Issue: High Error Rate After Traffic Shift

Symptoms: 5xx errors increase after canary deployment

Diagnosis:

# Check error rate by version
kubectl exec -n observability deploy/prometheus -- \
  promtool query instant 'http://localhost:9090' \
  'sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_version)'

# Check destination rule applied
kubectl get destinationrule research-agent-dr -n team1 -o yaml

Root Cause: v2 pods not ready, receiving traffic before initialization complete

Solution:

# Add readiness probe to v2 deployment
spec:
  template:
    spec:
      containers:
      - name: agent
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

Source: Kubernetes Probes


Issue: Circuit Breaker Not Triggering

Symptoms: Backend overloaded despite circuit breaker configuration

Diagnosis:

# Check connection pool metrics
kubectl exec -n observability deploy/prometheus -- \
  promtool query instant 'http://localhost:9090' \
  'istio_tcp_connections_opened_total{destination_service="research-agent.team1.svc.cluster.local"}'

# Check circuit breaker stats
istioctl proxy-config clusters deploy/orchestrator-agent -n team1 | grep research-agent

Root Cause: Connection pool limits too high

Solution:

# Lower connection limits
trafficPolicy:
  connectionPool:
    http:
      http2MaxRequests: 50    # Reduced from 100

Issue: Retries Causing Duplicate Requests

Symptoms: AI agent processes same request multiple times

Diagnosis:

# Check retry metrics
kubectl exec -n observability deploy/prometheus -- \
  promtool query instant 'http://localhost:9090' \
  'istio_requests_total{response_flags=~".*R.*"}'

Root Cause: Non-idempotent operations retried on transient errors

Solution:

# Only retry safe methods (GET, HEAD, OPTIONS)
retries:
  attempts: 3
  retryOn: 5xx,reset
  retryRemoteLocalities: false

Or implement idempotency keys in agent:

@app.route('/a2a/task', methods=['POST'])
def create_task():
    idempotency_key = request.headers.get('X-Idempotency-Key')
    if idempotency_key and is_duplicate(idempotency_key):
        return get_cached_response(idempotency_key)
    # Process task...

Best Practices

1. Use Timeouts for All External Calls

✅ Good: Set timeout for agent-to-agent calls

http:
- route:
  - destination:
      host: research-agent.team1.svc.cluster.local
  timeout: 30s    # LLM inference can be slow

❌ Avoid: No timeout (blocks forever on hang)

http:
- route:
  - destination:
      host: research-agent.team1.svc.cluster.local
  # No timeout configured

Why: Prevents cascading failures when downstream service hangs


2. Implement Circuit Breaking for All Services

✅ Good: Circuit breaker with outlier detection

trafficPolicy:
  connectionPool:
    http:
      http2MaxRequests: 100
  outlierDetection:
    consecutiveErrors: 5
    baseEjectionTime: 30s

❌ Avoid: No circuit breaker

# No traffic policy configured

Why: Protects backend from overload and prevents resource exhaustion


3. Use Canary Deployments for Agent Updates

✅ Good: Gradual traffic shift with monitoring

http:
- route:
  - destination:
      host: research-agent.team1.svc.cluster.local
      subset: v1
    weight: 90
  - destination:
      host: research-agent.team1.svc.cluster.local
      subset: v2
    weight: 10    # Start with 10%

❌ Avoid: Big bang deployment (100% at once)

kubectl set image deploy/research-agent agent=research-agent:v2
# All traffic immediately goes to v2

Why: Limits blast radius if v2 has issues


4. Test Retries with Fault Injection

✅ Good: Test retry logic before production

# In test environment
fault:
  abort:
    percentage:
      value: 20.0
    httpStatus: 503

❌ Avoid: Assume retries work without testing

retries:
  attempts: 3
# Never tested with real failures

Why: Discovers retry configuration issues before production


Alternatives

Alternative 1: Linkerd

Process:

  • Install Linkerd control plane
  • Annotate pods for injection
  • Use ServiceProfile for retries/timeouts

Pros:

  • ✅ Simpler than Istio (smaller resource footprint)
  • ✅ Automatic retries for all HTTP requests
  • ✅ Great observability out-of-box

Cons:

  • ❌ Less feature-rich than Istio
  • ❌ No Gateway API support (uses Ingress)
  • ❌ Smaller ecosystem

When to Use: Resource-constrained environments, need simplicity

Source: Linkerd Traffic Management


Alternative 2: Consul Service Mesh

Process:

  • Deploy Consul agents
  • Configure service intentions
  • Use L7 traffic management

Pros:

  • ✅ Strong service discovery
  • ✅ Multi-datacenter support
  • ✅ Integrated with HashiCorp stack

Cons:

  • ❌ Requires Consul infrastructure
  • ❌ Less Kubernetes-native than Istio
  • ❌ Steeper learning curve

When to Use: Already using HashiCorp stack, multi-DC requirements

Source: Consul Service Mesh


Alternative 3: Application-Level Libraries (Resilience4j)

Process:

  • Add Resilience4j dependency to agent code
  • Configure circuit breaker, retry in code
  • No service mesh required

Pros:

  • ✅ No infrastructure overhead
  • ✅ Fine-grained control
  • ✅ Language-specific optimizations

Cons:

  • ❌ Must implement in every microservice
  • ❌ No unified observability
  • ❌ Requires code changes for updates

When to Use: Single-language environment, minimal infrastructure

Source: Resilience4j


Next Steps

Recommended Reading

  1. Istio Traffic Management Concepts - Deep dive into Istio routing
  2. Gateway API Guide - HTTPRoute configuration
  3. Istio Service Mesh Guide - Istio installation and basics

Suggested Improvements

  1. Implement Gradual Rollouts

    • Benefit: Safer deployments with automated rollback
    • Effort: 2-3 days (setup Flagger or Argo Rollouts)
    • Priority: High
  2. Add Traffic Mirroring for Testing

    • Benefit: Test new agent versions with real traffic
    • Effort: 1 day (configure VirtualService)
    • Priority: Medium
  3. Implement Service-Level Objectives (SLOs)

    • Benefit: Quantify service reliability
    • Effort: 3-5 days (define SLIs, implement monitoring)
    • Priority: High

Related Projects

  • Flagger: Automated canary deployments with metrics analysis
  • Argo Rollouts: Progressive delivery for Kubernetes

References

Official Documentation

Community Resources

Internal Documentation

Version Information

  • Istio Version: 1.20+
  • Gateway API Version: v1
  • Kubernetes Version: 1.28+

Last Updated: 2025-11-13 Document Version: 1.0 Maintained By: Platform Engineering Team License: Apache 2.0