Sam's World Of No

True stories about startups, technology & agile.

Real User Data Is the Only Way to Reliably Test Your New Release

Rewrite projects are inherently full of risk, not least because they often involve faithfully reproducing the external behaviour of an existing system. End users are not only reliant on your expected functionality, but may also depend upon bugs or accidental conveniences in the output or behaviour. The only way you can know that your new system looks identical to the outside is by running it against existing inputs and comparing the outputs.

On a recent project, we successfully deployed a new implementation of an external API - identical to the outside world but completely re-built underneath. In order to verify the changes before go-live, we intercepted all live user requests and, in addition to sending them to the existing system (“V1”), we asynchronously fired them at the replacement system (“V2”). Here’s a basic diagram of the setup we put together:

When a user sends a V1 request to the system, it first hits an intercepting proxy which publishes the request (message body, requested url and headers) to a pub/sub channel. It then passes through the request to V1 as usual, and the V1 response is returned. The response is also published to the pub/sub channel.

Elsewhere, a simple worker script subscribes to the pub/sub channel. If a V1 request is received, it is persisted and then fired into V2 - the response of which is again persisted. The subscriber also listens for V1 responses on the pub/sub channel, and persists those into the same document - meaning that for every real V1 request, we now have a single document that shows the equivalent V2 response. These can then be verified to ensure that the new system gives the same answers as the existing.

Matching up requests and responses

One of the more challenging problems to solve in this setup was correlating the V1 response to the V1 request. Using an evented proxy was the right choice, but it meant that the proxy did not know which request a given response was for - as two concurrent requests would not necessarily complete in the order in which they started.

This had no impact for the end user, but for verification it was vital to be able to see the V1 request/response set together with the V2 equivalents. In the end we solved this problem by adding an additional header to the request before it was sent to V1 - X-Request-Id. On the V1 side, the webserver (Apache) was configured to ‘echo’ this header:

Header echo ^X-Request-Id

… meaning that it would repeat the same header received in the request in the response, which the proxy would then publish to the subscriber. The subscriber was then able to persist the V1 response to the same document as the V1 request.

Handling failure

One of the good things about this setup is almost all parts of both the parallelisation and the V2 system can be broken without impacting the user. If V2 itself is down, the only impact is on the channel subscriber.

If the subscriber is down, no messages are handled. The impact of this is a reduction in the amount of test data we have to analyse - it doesn’t affect real users. In our case, we used Redis pub/sub which does not persist messages - so if nobody is listening to the channel, the message is lost. It would be trivial, however, to change the publishing so it persisted messages onto a Redis list and used pub/sub to notify of new contents. The subscriber would then delete the list entry when it had successfully dealt with it - meaning that a newly-started subscriber could catch up on messages that were sent whilst no subscriber was running.

The biggest risk to the end user experience was Redis being down or inaccessible to the proxy - as the publishing process was synchronous and blocking. As it happened, we had an nginx router set up in front of the proxy already. This meant we could set nginx to try calling the proxy first, and if it failed, to instead call V1 directly. Now, both Redis and the proxy could be inaccessible and the end user experience would still not be impacted, apart from perhaps a small increase in response time.

Creating a Firehose

Testing our new system with real data revealed a good deal of bugs and missing features which if they had be found on launch day, would have been cause for a rollback. In the event, the switch over to V2 was painless and easy, and we had tens of thousands of real requests to prove that it would work. And, because V2 had been live in the production environment for over two months, we knew it could handle the load of real requests, as well as their varying contents.

As ever, though, there were unforeseen benefits to this method - particularly in creating the Redis channel. Suddenly, it was easy to subscribe to a firehose of our requests which could be fed into monitoring systems and dashboards, collected to use as benchmarking data or simply watched by a curious eye in a sanitised format during the day. The more we did this, and the more oddities on our both our side and implementers’ we observed, the more we grew our knowledge of our ecosystem and how it could be improved.

Focus Daily Standups on Value, Not Activity or Individuals

The daily standup (or daily scrum) is the bedrock of an agile process - and it should be kept enjoyable and useful, because that’s the only way that it’ll definitely continue to happen. A frequent complaint about this meeting, which erodes from its enjoyability and its utility, is a simple one - that it simply takes too long.

There are many reasons why the meeting goes on. It’s important that a team leaves totally in sync with one another - and sometimes this can take more than a few minutes, especially if a ad-hoc planning session emerges.

In many cases, I believe that the standard Scrum-style format of this meeting lays an unhelpful foundation from which many bad habits are often built. I therefore advocate moving teams away from this activity- and individual-focused meeting towards a towards a value-based or “right-to-left” standup.

The focus on activity and individuals

The standard, Scrum-style standup is well known - standing around your card wall, each member of the team answers the following three questions:

  1. What did I do yesterday?
  2. What will I do today?
  3. Is there anything standing in my way?

These questions are quickly learnt and are a good first step when adopting agile - they encourage clarity, commitment to one another and give an opportunity to state publicly if you are blocked (and are often the prompt for you to work out that you’re blocked!).

On a large agile team of ~7 people, though, going through these questions can start to take a while, and the whole meeting can take 20 minutes or more. I believe this is because the focus on activity creates a desire and often an necessity to talk not about your contribution to the team goal and the value you have delivered, but instead to ensure that you justify your previous day’s work. Simply put, people often seem to be spending time proving that they “worked very hard” yesterday, and will work similarly hard today.

The impact of this is a meeting with content focused on individual achievements, rather than an assessment of the state of projects and a synchronisation of team members. If you’re trying to justify yourself, being brief is the last thing on your mind - talking for longer is better.

When you look around other team members in this sort of meeting, you will find them staring at the floor and zoned out, or if they have yet to speak, compiling their own list of achievements to shout about. It’s boring, it wastes time and the team leaves the meeting not much wiser than when they arrived.

A focus on value - the “right-to-left” standup

A value-based standup puts the focus of the standup meeting where it belongs - on the value that the work delivers.

The format is similar but different - the team still stands around their card wall, but they do not go through each individual and discuss their activity. Instead, a facilitator goes through each card on the wall, starting from the right-most card (the one that is nearest to being complete and thus providing business value), and simply asks: what is the next step required to move this card further right, and who is doing it?

Sometimes the answer is that the task is blocked, and if this block is immutable, that’s fine - move on to the next most valuable card (the next most right-most card) and repeat the process.

Because the topic of conversation is always the work, each team member is engaged throughout the meeting, rather than only during the part where they are speaking, and the focus at all times is how the team will realise the most value today. The meeting is information-rich and fast-paced - meaning it’s not only shorter but also more enjoyable, which was the whole point.

Team members still make a personal commitment to each other about the work they will do today (one of the best parts of the activity-based standup), and there is no requirement that a manager or agile coach is always the facilitator, meaning the team can be self-organised and different team members are easily empowered to own the process. And of course, there’s something to be said for the variety of different people running the meeting - anything to keep it interesting!

Start small

I’ve successfully introduced this format to a number of teams, but it’s always been after at least a month or two of following the activity-based standup structure first. If your team’s standup is currently non-existent or is more of a nascent activity-based model, I’d suggest implementing activity-based standups first and then moving onto value-based soon after.

Can anybody share their experiences of going straight to a value-based standup?

Continuous Deployment With Feature Flagging

An important part of our product development process is being able to deploy new versions of features to only a portion of our users, so that we can gather feedback and to improve whilst not disrupting the workflows of our entire customer base.

We do this using feature flags - this mixes perfectly with both continuous deployment and having a customer-focused process, and in this post I’ll discuss how we did it and the lessons we’ve learned.

Rollout gem

We chose to use the rollout gem to fulfil this requirement - it’s simple, lightweight and has just the right amount of features. For a detailed introduction to rollout, check out the Railscast Pro episode.

We were keen to allow our product managers to be able to activate features for certain users using a UI - rather than developers activating users through a CLI. For this, we used rollout ui, which is a simple rails engine/sinatra app that interfaces with the rollout redis stores. Through a config.ru into a directory and deploy to heroku, and you’re ready to rock. Check out the interface above - again, simple and just enough features to be useful.

Integrating into the application

So, we start with a test:

The step implementation is pretty straightforward - simple, generic features that call methods from a helper.

… and the helper:

As you can see, we’re disabling all rollout features after each run - this is just as essential as clearing out your primary transactional database.

In terms of the implementation, we were keen to avoid sprinkling conditional logic/renders/routes throughout our application - so as to preserve code quality and to make the code easy to remove later. As such, we utilised the decorator pattern to decorate the subject objects with methods that could be used in the view:

… and the different versions of the decorator would provide different view artefacts: