Have Checks: Always Include Checks for Important Functions
Earlier this year at Uptrip, we had a failure that stayed invisible for weeks.
Card creation was working, but card image generation was failing after a Linux dependency upgrade. The tricky part was that everything looked fine on the surface: the CLI error was already silenced in Sentry, core flows were still running, and no dashboard clearly highlighted the issue. We discovered it by accident while doing unrelated admin work and noticing missing images.
We had a similar pattern with one OAuth provider: it breaks every few months for reasons outside our control, and if we don’t detect it early, users become the first monitoring system. That is the wrong order.
The lesson is simple: if a function is business-critical, logs and exception tracking are not enough. You need periodic checks that verify outcomes, not just process execution. For us, that means regularly checking whether logins are actually happening for key providers, whether cards created in a period actually end up with generated images, and a few other checks related to Web3 and marketplace flows.
Have Checks for Important Functions
The approach we settled on is boring on purpose: run periodic outcome checks directly against production data.
If a provider login is critical, we check that there is at least one successful login in an expected time window. If card image generation is critical, we check that cards created in a recent period have images attached after a reasonable processing delay. We apply the same pattern to critical Web3 and marketplace flows: define the expected outcome, query for missing outcomes, and alert only when the gap is real.
We briefly considered relying on other approaches as primary monitoring:
- Real-time error monitoring only (Sentry/logs)
- Synthetic checks that simulate user flows
Both are useful, and we still use them, but neither is enough alone for these failures. Logs can miss or silence the wrong signal, and synthetic checks cover only a narrow path. Periodic data checks answer the question we actually care about: did the business outcome happen?
Frequency, Thresholds, and Alerting
Frequency should match business impact, not engineering convenience.
For Uptrip, we run high-impact checks every few hours (like card image generation) and lower-frequency checks daily (like provider-specific login health when traffic is lower). The exact schedule matters less than consistency and clear ownership.
Thresholds are where most teams create noise by accident. “Zero events” is not always a failure, especially in low-traffic windows. We define thresholds using expected volume and timing behavior:
- Check a recent time window where activity is expected.
- Add a grace period for asynchronous processing.
- Alert only when the missing outcome persists after the grace period.
Alerts should also be actionable. Each alert should say:
- What outcome is missing
- Which time window is affected
- Which query/check failed
- Who owns the first response
If the alert does not help someone take the first action in minutes, it is not ready.
Edge Cases and False Alerts
The quality of a checker is mostly about how well it avoids false alarms.
In our case, three edge cases mattered:
- Low-volume windows
- Processing delay versus real failure
- Query correctness across multiple tables and columns
Low-volume windows are simple to describe and easy to get wrong. If no one is expected to log in with a specific provider during a quiet period, “zero logins” should not wake people up. We solve this by evaluating checks only in windows where activity is expected, based on other activity markers.
Processing delay is another common source of noisy alerts. A card without an image right after creation is not always a bug; sometimes the image pipeline is still running. We add a grace period, then check only records older than that threshold.
The most important edge case for us was query correctness. A straightforward query for provider logins looked valid but produced wrong alerts because the real “successful login” signal was spread across multiple tables and columns. We had to model the check around our actual data semantics, not around the most convenient SQL statement.
If the checker does not reflect how your data is truly written, your alerts become just another source of noise in the ops channel.
Conclusion
The main shift for us was simple: stop treating “no visible error” as proof that critical flows are healthy.
For important functions, we now define expected business outcomes and verify them on a schedule. For Uptrip, that includes provider logins, card image generation, and other critical Web3 and marketplace outcomes. The checks are straightforward, but the details matter: frequency, grace periods, and accurate data queries.
Periodic checks are not fancy monitoring. They are a practical way to discover silent failures before customers do.