Using Production Telemetry to Strengthen Automated Test Validation

A green CI pipeline doesn’t automatically mean release, and many teams are already aware of that.

Why? Because the software can pass every single test with flying colors, and still, there’s zero guarantee that it won’t be a complete failure in production.

The reason for this is a huge gap between controlled testing and what happens in the world of real users (e.g., unpredictable user behavior, unstable infrastructure, etc.).

Naturally, you’d want to close that gap, but how do you do it? You need to look at production telemetry, which means logs, request metrics, trace data, and interactions by actual/real/human users. It’s how you become familiar with how users interact with the software, and without this information, work gets a lot harder.

The argument is pretty simple here.

Production data gives you proof of what happens – under real-world (non-ideal) conditions – and when quality assurance (QA) teams can study this data, they can figure out if their tests match what happens in the real world.

Why Passing Tests Still Fail in Production

The frustrating truth is that your tests all pass, yet the production system is on fire. How can that be?

The starting point is the environment because test environments are clean and quiet – basically, controlled setups where you’ve got variables that are limited, and behavior is fully chaotic/unpredictable. The data’s fake, and the services are fake.

It’s the complete opposite in production, where things get messy, and users use the software in ways nobody could have predicted.

There’s also a difference in data. Look at what tests use – neat datasets.

But real users will type in gibberish and click on random places. The result of this is situations you didn’t plan for. Another thing you need to remember with tests is that they stick to things you’d expect would happen, so that’s what they test.

What’s rarely shown is what happens INSIDE the system when it’s put under proper stress (e.g., memory running low, services failing in a chain reaction like dominoes, etc.). A lot of bugs will stay sort of hidden until they hit those huge traffic numbers or long runtimes.

This basically means that non-extreme tests and/or quick tests will almost never catch these types of issues.

The point of this isn’t to blame the tests because they do exactly what you ask them to do. But you can’t predict everything, which is a problem. Production telemetry, though? It shows precisely what you miss, and if you can see it, then you can fix it.

Using Production Telemetry to Strengthen Automated Test Validation

Converting Production Telemetry into Actionable Test Improvements

In order to make validation truly stronger, you need a shift in mindset. Telemetry isn’t something that alerts you when things are breaking, but structured evidence of how the system ACTUALLY behaves. If logs, metrics, and traces are seen through this lens, then they can be a roadmap for improving tests.

Take structured logs, for example.

Analyze them, and you can see recurring exceptions or warning patterns you’d never catch in staging. You map these exceptions back into existing test cases, and you can see if those failures were anticipated or totally and completely missed.

From there, QA teams can create new regression tests that are based on actual error payloads from production (properly sanitized, of course).

Examples of new regression tests that we can make from production telemetry:

  • Replay tests based on sanitized real production requests payloads
  • Latency injection against external dependencies
  • Soak tests, which are designed to show real production uptime duration
  • Failure chain simulations (across dependent services)
  • Backpressure tests
  • Queue saturation tests

Metrics are another layer of insight.

Performance tests often use assumptions about how fast a service should respond. Production metrics either validate those assumptions or challenge them. Those ‘assumptions’ are typically called service level objectives (a.k.a. SLOs).

SLOs are clearly defined targets for latency, error rates, and availability.

And what production metrics do they use to validate those targets, and they expose gaps between ‘what’s expected’ and ‘what happens in the real world’?

Examples of production telemetry signals that should indicate new test cases are required:

  • p50/p95/p99 response time
  • tail latency while under peak load
  • error rate percentage by endpoint
  • 5xx vs 4xx distribution
  • timeout rate
  • retry rate
  • RPS under peak traffic
  • concurrent active sessions
  • CPU saturation
  • garbage collection
  • memory usage
  • external API latency variance

For instance, when systems rely on external services that are time-sensitive, like APIs that show hourly and daily forecast endpoints, production data might show timeout spikes or variability in response that mocked tests never simulated.

And that can be the guide to build resilience tests that accurately depict real-world dependency behavior.

Distributed tracing is also very important because it’s what enables us to see what went wrong, by showing us the flow and timing. It won’t always show the root cause, but the data it provides us with will point us to where the issue is.

It’s sort of observational and system-wide ‘debugging’, except it’s not interactive.

But that’s very technical. Another thing that we can’t do without is to pay attention to what users actually interact with and how they interact. For instance, one page might get heavy traffic every day, meaning that we should focus more resources on those kinds of pages instead of pages that barely get any traffic.

What this does is it creates a loop between operations and between QA:

anomaly happens → it’s analyzed → (test) results are evaluated → tests are improved

Operations detect the anomaly and they analyze it. If test coverage gaps are being suspected, QA teams may join at the analyzing stage. This is where QA teams normally come in – test results evaluation. After evaluating, QA teams improve testing (operations may assist here with tooling/instrumentation) and the cycle repeats.

And this loop helps reduce the number of recurring instances of that particular anomaly.

Conclusion

Tests are unbiased, and they’re polite. They’ll do what you ask them to do. Nothing more. Nothing less. It’s simple, it’s orderly, and organized.

In comparison, production is pure chaos.

Things are being clicked in strange/unintended ways, the internet is slow or even breaks for no obvious reason, users are leaving halfway through without you knowing why they left, and yeah, users even end up doing things that never crossed anyone’s mind, leading to exploits and crashes. Then the third-party app that your entire app depends on stops working, and the servers get overloaded.

Automated tests have to go hand in hand with production telemetry because it’s the only way to see what happens when things get messy.

The real win of it all is the fact that QA teams have a way of using all this data to build something stronger and more resilient.

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.