Webhook Delivery Failures Troubleshooting Guide

Introduction: What webhook delivery failures look like

A webhook delivery failure happens when a provider sends an event to your endpoint and the request never arrives, returns a non-2xx response, or times out before your server acknowledges it. In dashboards like Stripe’s, this appears as failed attempts, retries, rising latency, or statuses such as pending, failed, and retried. Delivery records usually include a request ID and HTTP status code, which tells you whether the problem is a 4xx, 5xx, or timeout.

These failures matter because webhooks often power billing updates, account changes, order fulfillment, and other customer workflows. When an event is missed, duplicated, or processed with stale data, automation breaks and support issues become hard to trace.

The first question is whether the problem is provider-side or server-side. Provider-side issues usually show up in delivery history before your app sees the request. Server-side issues usually appear in your logs, response codes, or infrastructure alerts.

The troubleshooting order is simple: check delivery logs, inspect the response code, verify the endpoint configuration, then test signatures, payload handling, and infrastructure. For deeper tactics, see webhook debugging techniques and webhook endpoint setup.

Understand the webhook delivery flow

A webhook starts when the provider creates an event, queues it for delivery, sends a request to your endpoint URL, waits for your server to process it, then records the attempt. If you’re troubleshooting delivery failures, isolate the breakage at one of three layers: provider-side, network, or application-side.

Failures can happen before your code runs: DNS errors, TLS handshake issues, a reverse proxy or load balancer misroute, or a provider outage. They can also happen inside your app when an exception returns a 5xx response, or when slow processing causes a timeout.

Most providers treat 2xx responses as success and stop retry logic; 4xx responses, 5xx responses, and timeouts usually trigger more delivery attempts. Repeated failures can end in permanent failure or a dead-letter queue. Check request headers, request ID, and raw response details in webhook debugging techniques and webhook event handling.

Check the event delivery status and endpoint response

Start with the provider dashboard or API logs. In Stripe Dashboard, open the event delivery logs, then match the failed event by timestamp or request ID so you’re not chasing the wrong attempt. Compare delivery attempts to see whether the event is pending, successful, retrying, failed, or permanently failed.

Read the HTTP status codes before changing code: 2xx means the webhook was accepted, 4xx usually points to validation or request-shape problems, and 5xx indicates a server-side failure. No response or a timeout means your endpoint did not acknowledge in time, even if the handler finished later. Check latency, response headers, and response body for clues, and watch for slow database calls, blocked threads, or reverse proxy timeouts that can break delivery. For more webhook debugging techniques and the webhook QA checklist, inspect the delivery timestamp and attempt count before changing code.

Verify configuration, authentication, and signature handling

Confirm the endpoint URL, path, and environment first. A provider can be sending webhooks to /webhooks/stripe in your webhook endpoint setup, while your app only listens on /api/webhooks, or it may still point to development instead of production.

A 401 Unauthorized or 403 Forbidden usually means the request reached your server but failed authentication, authorization, or signature verification. Check the signing secret, any bearer token, and whether the endpoint is enabled for the right webhook event handling subscriptions.

For HMAC-based signature verification, the provider signs the raw request body with the signing secret and sends the result in request headers. Your code must verify that header against the exact raw request body; parsing JSON first, normalizing whitespace, or mutating fields can break the HMAC check even when the payload is valid.

Also watch for secret rotation, timestamp tolerance failures, and missing event subscriptions. A rotated signing secret, an expired timestamp window, or an event type you never subscribed to can look like delivery failures, so compare your setup against the webhook QA checklist.

Inspect payload formatting, schema, and application errors

A 400 Bad Request usually means your server could not parse the JSON payload: malformed JSON, invalid encoding, missing required fields, or a body that arrives empty because middleware consumed the raw request body before your handler read it. A 422 Unprocessable Entity means the request is syntactically valid but fails schema validation or business rules, such as a missing customer_id or an unsupported enum value. API versioning and event types also change payload shape; for example, Stripe and GitHub webhooks add, rename, or nest fields across versions, so hard-coded parsers break when webhook event handling assumes one fixed structure. Use structured logging to store the raw payload safely plus key fields, then compare expected versus actual data while following webhook debugging techniques. A 500 Internal Server Error points to [application exceptions](https://docsfordevs.com/blog/webhook-error-handling-best-practices), dependency failures, or database errors inside your handler.

Diagnose network, DNS, TLS, firewall, and WAF problems

If the provider cannot resolve your endpoint URL through DNS, it never reaches your server. Check for wrong A/AAAA records, stale CNAMEs, or propagation delays after a DNS change; local success from your laptop does not prove the provider sees the same answer. TLS can fail just as early: expired SSL certificates, hostname mismatches, missing intermediates, or unsupported ciphers will stop delivery before your app code runs.

A firewall, WAF, reverse proxy, or load balancer can also block, rewrite, or delay requests. Common examples include IP allowlists that exclude the provider, WAF rules that flag webhook headers, or proxy timeouts that surface as 408 Request Timeout. Rate limiting at the edge can return 429 Too Many Requests, while a misrouted path can produce 404 Not Found even when the host is reachable.

Test from outside your network with curl, a webhook inspector, or provider delivery logs to confirm internet reachability and compare the external response with your local one. For more webhook debugging techniques and webhook reliability best practices, verify the provider can connect, negotiate TLS, and receive the same status your app returns.

Common webhook failure scenarios by status code and how to reproduce locally

Use the status code as a fast diagnostic signal. 401 Unauthorized and 403 Forbidden usually point to authentication, signature, or permission problems. 404 Not Found usually means the provider is calling the wrong path or your router does not match the route. 408 Request Timeout means the endpoint did not acknowledge the request quickly enough. 422 Unprocessable Entity means the payload reached your app but failed validation. 429 Too Many Requests means your endpoint or a downstream dependency is rate limiting. Any 500 Internal Server Error or other 5xx response means your server failed while handling the event.

A healthy webhook endpoint should return a fast 2xx response after minimal validation. Verify the signature, check that the payload parses, enqueue work if needed, and return immediately. Do not wait for slow database writes, third-party API calls, or background jobs before acknowledging the delivery.

Quick status-code guide

401 Unauthorized: missing or invalid secret, signature header, bearer token, or shared key.
- Likely fix: confirm the exact webhook secret for the environment and verify HMAC verification uses the raw request body, not a parsed or reserialized payload.
403 Forbidden: signature is valid but the request is blocked by IP allowlists, auth rules, CSRF middleware, WAF rules, or tenant permissions.
- Likely fix: exempt the webhook route from browser auth and CSRF checks, and confirm any allowlist includes the provider’s source.
404 Not Found: wrong URL, missing trailing path segment, route mismatch, reverse proxy rewrite, or the app is listening on a different path.
- Likely fix: compare the exact provider endpoint URL with the route registered in your app and proxy.
408 Request Timeout: your handler is too slow, a proxy times out, or the provider gives up waiting.
- Likely fix: reduce work in the request path and return a 2xx quickly.
422 Unprocessable Entity: schema validation failed, required field missing, wrong enum, bad date format, or incompatible API version.
- Likely fix: inspect the exact payload shape and align your validator with the provider’s event schema.
429 Too Many Requests: rate limiting at your app, gateway, or downstream service.
- Likely fix: add backpressure, queueing, and retry-friendly handling with exponential backoff.
500 Internal Server Error: unhandled exceptions, null references, DB failures, dependency outages, or bad assumptions about payload shape.
- Likely fix: inspect logs, reproduce with the same payload, and harden error handling around every external dependency.

Reproduce the failure locally with the real request shape

The fastest way to debug webhook delivery failures is to replay the exact request that failed. Keep the raw request body, headers, content-type, and signature conditions as close to production as possible. A payload that “looks similar” is not enough when signature verification, encoding, or middleware behavior differs.

Using `curl`

Save the provider’s payload exactly as delivered, then replay it with the original headers:

curl -i https://localhost:3000/webhooks/provider \
  -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Webhook-Signature: <signature-value>' \
  --data-binary @payload.json

Use --data-binary so curl sends the file as-is. Avoid -d if you need to preserve newlines or exact byte content for signature checks.

Using Postman

Postman works well when you need to inspect headers and iterate quickly.

Create a POST request to your local endpoint.
Paste the exact raw body into the Body tab.
Set the same content-type used in production.
Add the provider’s signature and delivery headers.
Send the request and compare the response status, logs, and any signature verification output.

If the provider signs the exact bytes, make sure Postman is not reformatting the body. For signature-sensitive webhooks, curl or a replay tool is often safer than manual editing.

Using replay tools or a webhook inspector

Tools such as a webhook inspector, request capture proxy, or provider replay feature let you resend the original delivery with the same headers and body. That is the best option when you need to reproduce a failure caused by encoding, compression, or middleware.

A good workflow is:

Capture the failing event in a webhook inspector.
Replay the same delivery to a local tunnel or staging endpoint.
Compare the request body, headers, and response code.
Adjust your handler until the replay succeeds.

Matching the request exactly

When you reproduce locally, match these fields first:

Raw request body: use the exact bytes the provider sent.
Headers: especially signature, event ID, delivery ID, and any timestamp header.
content-type: application/json, form-encoded, or vendor-specific types can change parsing.
HTTP method and path: POST to the exact route.
Environment secret: development and production secrets are often different.

Why it works in development but fails in production

Webhook handlers often pass locally and fail in production because the environments are not actually equivalent. Common differences include different webhook secrets or signing keys, a different endpoint URL or route prefix, proxies that rewrite paths or strip headers, TLS certificates that are valid in one environment but not the other, API version mismatches between sandbox and live traffic, middleware differences, firewall or WAF rules, and load balancer or reverse proxy timeouts.

A request that succeeds in development may fail in production because the provider signs a different body, sends a different event version, or hits a load balancer that changes timeout behavior. Compare the full request path from provider to app, not just the controller code.

Prevention tactics that reduce repeat failures

Use idempotency so retries do not create duplicate side effects. Return a fast 2xx once you have validated the request and recorded enough data to process it safely. Apply exponential backoff for retries, and treat repeated 429 or 5xx responses as a signal to slow down or queue work.

Build strong observability around webhook handling:

structured logging with event ID, delivery ID, status code, and latency
monitoring for spikes in 4xx, 5xx, and timeout rates
alerting when failures repeat or a route stops receiving successful deliveries
dead-letter queue handling for events that fail after multiple retries

For a broader checklist, pair this section with webhook debugging techniques, reliable webhook practices, webhook reliability best practices, webhook QA checklist, and webhook event handling. The practical goal is simple: reproduce the exact failure, fix the root cause, and make the endpoint resilient enough that the same delivery does not fail twice.

Troubleshooting checklist

Use this checklist when a webhook fails in production but works in development:

Confirm the provider’s event delivery logs show the failure and note the request ID.
Check whether the status is a 2xx response, 4xx response, 5xx response, or timeout.
Verify the endpoint URL, route, and API versioning match the live environment.
Inspect request headers, content-type, and the raw request body.
Validate the webhook signature verification against the signing secret using HMAC.
Check for DNS, TLS, and SSL certificates issues.
Review firewall, WAF, reverse proxy, and load balancer rules.
Look for rate limiting and 429 Too Many Requests responses.
Reproduce the request locally with curl, Postman, or a webhook inspector.
Confirm your handler uses schema validation, idempotency, retry logic, and exponential backoff.
Add structured logging, monitoring, alerting, and a dead-letter queue for repeated failures.
Check the provider’s status page if the failure appears provider-side.

When to use exponential backoff for retries

Use exponential backoff when failures are temporary and likely to clear on their own, such as transient 429 Too Many Requests, short-lived dependency failures, or brief network instability. Backoff is useful when the provider or your own queue should retry later instead of hammering an overloaded service.

Do not use aggressive immediate retries for every failure. If the problem is a bad endpoint URL, invalid signature, malformed JSON, or a permanent 404 Not Found, repeated retries only create noise. Pair backoff with idempotency so a retried delivery does not create duplicate records or duplicate side effects.

Summary

Most webhook delivery failures come down to one of five areas: the provider could not reach your endpoint, the endpoint returned the wrong HTTP status code, signature verification failed, the payload did not match your schema, or infrastructure blocked the request. Start with delivery logs, then check the response code, then reproduce the exact request locally. If you make the handler fast, idempotent, observable, and tolerant of retries, webhook delivery becomes much easier to trust.