Operational Runbook
Common operational procedures and troubleshooting for Constellation Engine.
Health Checks
Verify Server is Running
curl http://localhost:8080/health
# Expected: {"status":"ok"}
Check Readiness
curl http://localhost:8080/health/ready
# Expected: {"ready":true}
# During shutdown: {"ready":false,"reason":"System is draining"}
Detailed Diagnostics
curl -H "Authorization: Bearer YOUR_API_KEY" http://localhost:8080/health/detail
Returns scheduler stats, lifecycle state, and custom check results.
Requires enableDetailEndpoint = true in HealthCheckConfig.
Common Issues
High Compilation Latency
Symptoms: /compile or /run requests are slow.
Always check the cache hit rate first. A hit rate below 0.5 indicates that most requests are triggering full compilation, which is the most expensive operation.
Check cache hit rate:
curl http://localhost:8080/metrics | jq .cache
hitRate< 0.5 — sources are changing frequently or cache is too smallhitRate> 0.8 — cache is working; latency may be from large pipelines
Actions:
- Ensure clients reuse compiled pipelines via
/executeinstead of recompiling - Check pipeline complexity (large DAGs take longer)
- Compilation has a 30-second timeout; if hit, simplify the pipeline
Rate Limit Errors (429)
Symptoms: Clients receive 429 Too Many Requests.
If multiple services share the same API key, they share the same rate limit bucket. Consider using separate API keys for each service to get independent rate budgets.
Diagnose:
- Check
Retry-Afterheader in the 429 response - Identify if it's per-IP or per-API-key limiting
Actions:
- Increase
CONSTELLATION_RATE_LIMIT_RPMfor higher throughput - Increase
CONSTELLATION_RATE_LIMIT_BURSTfor spiky traffic - Distribute clients across multiple IPs if one IP is a bottleneck
- Use different API keys for different services to get separate rate budgets
Request Body Too Large (413)
Symptoms: 413 PayloadTooLarge response.
Cause: Request body exceeds 10MB limit on /compile, /execute, or /run.
Actions:
- Reduce source code size (split large programs)
- Reduce input data size
- The 10MB limit is a server-side safety measure and is not configurable via environment variable
Scheduler Queue Full
Symptoms: Tasks are rejected or queued for a long time.
Check scheduler stats:
curl http://localhost:8080/metrics | jq .scheduler
queuedCountconsistently high — more work than the scheduler can handlestarvationPromotionsincreasing — low-priority tasks waiting too long
Actions:
- Increase
CONSTELLATION_SCHEDULER_MAX_CONCURRENCY - Scale out (add more instances)
- Reduce task concurrency from clients
- Increase
CONSTELLATION_SCHEDULER_STARVATION_TIMEOUTif low-priority delays are acceptable
Memory Issues
Symptoms: OutOfMemoryError, slow GC pauses, process killed by OOM killer.
If the container is killed by the OOM killer without logging an OutOfMemoryError, Kubernetes resource limits may be too low. Check kubectl describe pod for OOMKilled exit reason.
Actions:
- Increase
JAVA_OPTSheap size:-Xmx2gor higher - Check compilation cache size (entries in
/metrics) - Check for large programs with many modules
- Enable G1GC for better latency:
-XX:+UseG1GC - Monitor heap usage with JMX or
jstat
Authentication Failures (401/403)
Symptoms: Clients receive 401 Unauthorized or 403 Forbidden.
Diagnose:
401— Missing or malformedAuthorization: Bearer <key>header403— Key is valid but role doesn't permit the HTTP method
Actions:
- Verify the API key format:
Authorization: Bearer <key> - Check role permissions (ReadOnly can't POST, Execute can't DELETE)
- Verify the key was loaded at startup (check logs for "Loaded N API key(s)")
WebSocket (LSP) Issues
Symptoms: LSP connection drops, messages lost, high latency.
Possible causes:
- Bounded queue full (100 messages) — client or server can't keep up
- Network interruption
- Proxy/load balancer closing idle WebSocket connections
Actions:
- Check server logs for "queue full, message dropped" warnings
- Configure proxy WebSocket timeout (increase from default)
- Ensure load balancer supports WebSocket upgrades
- Client should implement reconnection logic
Operational Procedures
Rolling Restart (Kubernetes)
kubectl rollout restart deployment constellation -n constellation
This performs a zero-downtime restart:
- New pods start and become ready
- Old pods are drained (readiness probe returns 503)
- Old pods shut down after grace period
Scale Up/Down
# Scale to 4 replicas
kubectl scale deployment constellation --replicas=4 -n constellation
# Check rollout status
kubectl rollout status deployment constellation -n constellation
View Logs
# All pods
kubectl logs -l app=constellation -n constellation --tail=100
# Specific pod
kubectl logs constellation-xyz123 -n constellation -f
Check Resource Usage
kubectl top pods -l app=constellation -n constellation
Emergency: Force Restart
Force-deleting a pod immediately terminates all in-flight requests without graceful shutdown. Only use this when the pod is completely unresponsive and liveness probes are not triggering a restart.
If a pod is unresponsive and liveness probe hasn't killed it yet:
kubectl delete pod constellation-xyz123 -n constellation
The deployment controller will create a replacement.
Performance Tuning
Compilation Cache
The compilation cache avoids redundant parsing and type checking. Key metrics:
- hitRate > 0.8 is healthy for production workloads
- Cache is per-instance (not shared across replicas)
- Cache evicts on TTL expiry or when entries exceed limits
Scheduler Concurrency
For CPU-bound pipelines:
- Set
MAX_CONCURRENCYto number of CPU cores - Leave burst room for I/O-bound tasks
For I/O-bound pipelines (external API calls):
- Set
MAX_CONCURRENCYhigher (2-4x CPU cores) - Monitor queue depth to avoid excessive queuing
JVM Tuning
JAVA_OPTS="-Xms512m -Xmx2g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
- Heap sizing: Start with 1-2GB, increase if cache misses cause frequent compilation
- GC: G1GC recommended for production (good latency/throughput balance)
- GC pause target: 200ms is a reasonable starting point
Alerting Thresholds
Start with these thresholds and adjust based on your workload. Pipeline execution times vary significantly - a 30s P99 latency may be normal for complex data processing pipelines but critical for real-time APIs.
| Metric | Warning | Critical |
|---|---|---|
health/ready | N/A | Returns false |
health/live | N/A | No response / 5xx |
cache.hitRate | < 0.5 | < 0.2 |
scheduler.queuedCount | > 50% of maxConcurrency | > 100% of maxConcurrency |
| Response latency (P99) | > 5s | > 30s |
| Error rate (5xx) | > 1% | > 5% |
| Memory usage | > 80% of limit | > 95% of limit |
Next Steps
- Configuration — Environment variables and server settings
- Deployment — Kubernetes deployment and scaling
- Performance Tuning — Scheduler, caching, and JVM optimization
- Graceful Shutdown — Drain behavior during restarts
- JSON Logging — Structured logging for easier debugging