Webhook monitoring

Webhook monitoring: what to log, measure, and alert on.

A webhook system is not reliable merely because requests usually return 200. You need enough delivery evidence to detect queue lag, explain failures, control retries, and resolve incidents without searching raw infrastructure logs.

Inspect the dashboard View API reference

API request

HTTPS only

// Illustrative monitoring event — not the schedule API response
{
  "id": "job_123",
  "targetHost": "api.example.com",
  "scheduledFor": "2026-06-11T14:00:00.000Z",
  "attempt": 2,
  "statusCode": 503,
  "durationMs": 842,
  "retryScheduledFor": "2026-06-11T14:02:00.000Z",
  "status": "RETRYING"
}

Per-attempt evidence

Record status code, latency, error category, response context, attempt number, and the final delivery state.

Queue health

Measure overdue scheduled jobs and stale processing locks, not only HTTP success rates after dispatch.

Actionable alerts

Alert on sustained failure patterns and delivery lag with enough context to identify the affected workflow.

Monitor the full delivery lifecycle

Webhook monitoring starts before the HTTP request. Track when the job was created, when it should run, when a worker claimed it, when each attempt started, and when the job reached a final state.

This timeline separates receiver failures from scheduler failures. A 500 response is an endpoint problem. A job that remains pending after its scheduled time points to queue, worker, credential, or deployment trouble.

What to log for every attempt

Store a stable job ID, target host, HTTP method, scheduled time, attempt number, status code, duration, error category, response snippet, retry decision, and final state. Keep timestamps in UTC and make the job ID searchable across application and provider logs.

Avoid logging secrets, authorization headers, full webhook URLs containing tokens, or unrestricted response bodies. Redact sensitive values and cap retained response data. The Webhook Scheduler security model describes the controls used for outbound delivery.

Metrics that reveal real failures

Useful service metrics include delivery success rate, final failure rate, attempts per job, p50 and p95 latency, retry volume, queue lag, stale processing jobs, and time to final state. Break them down by target host and status family when diagnosing an incident.

A global success rate can hide one broken customer endpoint. Conversely, one noisy endpoint should not make the whole platform appear unavailable. Use both system-wide and workflow-level views.

Alerts worth waking someone for

Page on conditions that threaten scheduled delivery: persistent queue lag, workers unable to claim jobs, database or queue unavailability, a sharp rise in final failures, or retry scheduling failures. Use warning-level notifications for a single endpoint returning errors or a short-lived retry spike.

Every alert should include the affected component, first observed time, current count, threshold, recent deployment context, and a link to the relevant job or dashboard view. Alerts without a next debugging step become background noise.

A practical incident workflow

Start by checking queue lag and worker health. Then inspect a representative failed job, compare its attempts, and determine whether failures are global or isolated to one destination. Pause manual retries until the receiver is healthy if the endpoint is consistently failing.

Webhook Scheduler keeps scheduled jobs, attempts, status codes, response context, and recovery actions together in the webhook operations dashboard. For retry design, see the webhook retry logic guide.

FAQ

What should webhook logs include?

Include job and attempt IDs, target host, method, schedule and attempt timestamps, status code, latency, error category, retry decision, and final state without storing secrets.

Which webhook metrics should trigger alerts?

Alert on sustained queue lag, stale processing jobs, queue or database unavailability, retry scheduling failures, and meaningful increases in final delivery failures.

Is HTTP success rate enough for webhook monitoring?

No. Success rate does not reveal jobs that never left the queue, stale workers, excessive retries, or one destination failing inside a healthy global average.

Ship delayed webhooks without maintaining scheduling infrastructure.

Start with the free plan, test a real delivery, then upgrade when the workflow becomes production critical.

Inspect the dashboard