Skip to main content

Monitoring and Observability

CoW Protocol Services provide comprehensive monitoring through several integrated systems.

Core Components

  • Prometheus metrics exposed on port 9586 by default
  • OpenTelemetry distributed tracing with OTLP/gRPC export
  • Structured JSON logging via the tracing crate
  • Runtime diagnostics including dynamic log filtering
  • jemalloc-based heap profiling capabilities

Metrics and Monitoring

Services automatically expose Prometheus-compatible metrics covering:
  • Auction overhead tracking by component and phase
  • Database query performance metrics
  • HTTP request rates and latencies
  • Service-specific counters (orders, quotes, settlements, solutions)

Scrape Configuration

The recommended scrape interval is every 15 seconds. Example Prometheus configuration:
scrape_configs:
  - job_name: 'cow-services'
    scrape_interval: 15s
    static_configs:
      - targets: ['orderbook:9586', 'autopilot:9586', 'driver:9586']
Kubernetes ServiceMonitor deployments are also supported.

Distributed Tracing

Services support OpenTelemetry tracing with OTLP export via gRPC, allowing trace propagation across service boundaries using W3C TraceContext headers. Configuration requires specifying the collector endpoint and trace level.

Logging and Log Management

Structured JSON Logging

The system uses structured JSON logging with fields including:
  • Timestamp
  • Level
  • Target module
  • Message
  • Integrated trace IDs

Log Levels

Log levels range from TRACE (most verbose) to ERROR:
LevelUsage
TRACEDetailed debugging information
DEBUGDevelopment and troubleshooting
INFONormal operational messages
WARNPotential issues
ERRORFailures requiring attention

Runtime Log Filter Adjustment

A notable feature enables runtime log filter adjustment through UNIX sockets without requiring service restarts:
echo "orderbook=debug" | socat - UNIX-CONNECT:/tmp/log_filter_override_orderbook_12345.sock

Advanced Diagnostics

Heap Profiling

Services generate pprof-format heap dumps via UNIX sockets when jemalloc profiling is enabled. Analysis uses the Google pprof tool to identify memory allocation hotspots.

tokio-console

Available exclusively in playground environments for async task inspection. Not recommended for production due to significant memory overhead.

Health Checks and Alerting

Health Check Endpoints

Three endpoints support Kubernetes orchestration:
EndpointPurpose
/livenessService is running
/readyService can accept traffic
/startupService has completed initialization
Set up alerts for:
  • Service availability: Alert if any service is down for more than 2 minutes
  • Error rates: Alert if error rate exceeds baseline
  • Connection pool exhaustion: Alert when pool usage exceeds 80%
  • Query latency: Alert when database queries exceed acceptable thresholds
Last modified on March 4, 2026