Per-attempt evidence
Record status code, latency, error category, response context, attempt number, and the final delivery state.
Webhook monitoring
A webhook system is not reliable merely because requests usually return 200. You need enough delivery evidence to detect queue lag, explain failures, control retries, and resolve incidents without searching raw infrastructure logs.
{
"jobId": "job_123",
"targetHost": "api.example.com",
"scheduledFor": "2026-06-11T14:00:00.000Z",
"attempt": 2,
"statusCode": 503,
"durationMs": 842,
"retryScheduledFor": "2026-06-11T14:02:00.000Z",
"finalState": "retrying"
}Record status code, latency, error category, response context, attempt number, and the final delivery state.
Measure overdue scheduled jobs and stale processing locks, not only HTTP success rates after dispatch.
Alert on sustained failure patterns and delivery lag with enough context to identify the affected workflow.
Webhook monitoring starts before the HTTP request. Track when the job was created, when it should run, when a worker claimed it, when each attempt started, and when the job reached a final state.
This timeline separates receiver failures from scheduler failures. A 500 response is an endpoint problem. A job that remains pending after its scheduled time points to queue, worker, credential, or deployment trouble.
Store a stable job ID, target host, HTTP method, scheduled time, attempt number, status code, duration, error category, response snippet, retry decision, and final state. Keep timestamps in UTC and make the job ID searchable across application and provider logs.
Avoid logging secrets, authorization headers, full webhook URLs containing tokens, or unrestricted response bodies. Redact sensitive values and cap retained response data. The Webhook Scheduler security model describes the controls used for outbound delivery.
Useful service metrics include delivery success rate, final failure rate, attempts per job, p50 and p95 latency, retry volume, queue lag, stale processing jobs, and time to final state. Break them down by target host and status family when diagnosing an incident.
A global success rate can hide one broken customer endpoint. Conversely, one noisy endpoint should not make the whole platform appear unavailable. Use both system-wide and workflow-level views.
Page on conditions that threaten scheduled delivery: persistent queue lag, workers unable to claim jobs, database or queue unavailability, a sharp rise in final failures, or retry scheduling failures. Use warning-level notifications for a single endpoint returning errors or a short-lived retry spike.
Every alert should include the affected component, first observed time, current count, threshold, recent deployment context, and a link to the relevant job or dashboard view. Alerts without a next debugging step become background noise.
Start by checking queue lag and worker health. Then inspect a representative failed job, compare its attempts, and determine whether failures are global or isolated to one destination. Pause manual retries until the receiver is healthy if the endpoint is consistently failing.
Webhook Scheduler keeps scheduled jobs, attempts, status codes, response context, and recovery actions together in the webhook operations dashboard. For retry design, see the webhook retry logic guide.
Include job and attempt IDs, target host, method, schedule and attempt timestamps, status code, latency, error category, retry decision, and final state without storing secrets.
Alert on sustained queue lag, stale processing jobs, queue or database unavailability, retry scheduling failures, and meaningful increases in final delivery failures.
No. Success rate does not reveal jobs that never left the queue, stale workers, excessive retries, or one destination failing inside a healthy global average.
Start with the free plan, test a real delivery, then upgrade when the workflow becomes production critical.
Inspect the dashboard