Sam Phillips Startups, Technology, Agile.

Real user data is the only way to reliably test your new release April 29, 2013

Rewrite projects are inherently full of risk, not least because they often involve faithfully reproducing the external behaviour of an existing system. End users are not only reliant on your expected functionality, but may also depend upon bugs or accidental conveniences in the output or behaviour. The only way you can know that your new system looks identical to the outside is by running it against existing inputs and comparing the outputs.

On a recent project, we successfully deployed a new implementation of an external API - identical to the outside world but completely re-built underneath. In order to verify the changes before go-live, we intercepted all live user requests and, in addition to sending them to the existing system ("V1"), we asynchronously fired them at the replacement system ("V2"). Here's a basic diagram of the setup we put together:

When a user sends a V1 request to the system, it first hits an intercepting proxy which publishes the request (message body, requested url and headers) to a pub/sub channel. It then passes through the request to V1 as usual, and the V1 response is returned. The response is also published to the pub/sub channel.

Elsewhere, a simple worker script subscribes to the pub/sub channel. If a V1 request is received, it is persisted and then fired into V2 - the response of which is again persisted. The subscriber also listens for V1 responses on the pub/sub channel, and persists those into the same document - meaning that for every real V1 request, we now have a single document that shows the equivalent V2 response. These can then be verified to ensure that the new system gives the same answers as the existing.

Matching up requests and responses

One of the more challenging problems to solve in this setup was correlating the V1 response to the V1 request. Using an evented proxy was the right choice, but it meant that the proxy did not know which request a given response was for - as two concurrent requests would not necessarily complete in the order in which they started.

This had no impact for the end user, but for verification it was vital to be able to see the V1 request/response set together with the V2 equivalents. In the end we solved this problem by adding an additional header to the request before it was sent to V1 - `X-Request-Id`. On the V1 side, the webserver (Apache) was configured to 'echo' this header:

Header echo ^X-Request-Id

… meaning that it would repeat the same header received in the request in the response, which the proxy would then publish to the subscriber. The subscriber was then able to persist the V1 response to the same document as the V1 request.

Handling failure

One of the good things about this setup is almost all parts of both the parallelisation and the V2 system can be broken without impacting the user. If V2 itself is down, the only impact is on the channel subscriber.

If the subscriber is down, no messages are handled. The impact of this is a reduction in the amount of test data we have to analyse - it doesn't affect real users. In our case, we used Redis pub/sub which does not persist messages - so if nobody is listening to the channel, the message is lost. It would be trivial, however, to change the publishing so it persisted messages onto a Redis list and used pub/sub to notify of new contents. The subscriber would then delete the list entry when it had successfully dealt with it - meaning that a newly-started subscriber could catch up on messages that were sent whilst no subscriber was running.

The biggest risk to the end user experience was Redis being down or inaccessible to the proxy - as the publishing process was synchronous and blocking. As it happened, we had an nginx router set up in front of the proxy already. This meant we could set nginx to try calling the proxy first, and if it failed, to instead call V1 directly. Now, both Redis and the proxy could be inaccessible and the end user experience would still not be impacted, apart from perhaps a small increase in response time.

Creating a Firehose

Testing our new system with real data revealed a good deal of bugs and missing features which if they had be found on launch day, would have been cause for a rollback. In the event, the switch over to V2 was painless and easy, and we had tens of thousands of real requests to prove that it would work. And, because V2 had been live in the production environment for over two months, we knew it could handle the load of real requests, as well as their varying contents.

As ever, though, there were unforeseen benefits to this method - particularly in creating the Redis channel. Suddenly, it was easy to subscribe to a firehose of our requests which could be fed into monitoring systems and dashboards, collected to use as benchmarking data or simply watched by a curious eye in a sanitised format during the day. The more we did this, and the more oddities on our both our side and implementers' we observed, the more we grew our knowledge of our ecosystem and how it could be improved.