Sam's World Of No

True stories about startups, technology & agile.

Real User Data Is the Only Way to Reliably Test Your New Release

Rewrite projects are inherently full of risk, not least because they often involve faithfully reproducing the external behaviour of an existing system. End users are not only reliant on your expected functionality, but may also depend upon bugs or accidental conveniences in the output or behaviour. The only way you can know that your new system looks identical to the outside is by running it against existing inputs and comparing the outputs.

On a recent project, we successfully deployed a new implementation of an external API - identical to the outside world but completely re-built underneath. In order to verify the changes before go-live, we intercepted all live user requests and, in addition to sending them to the existing system (“V1”), we asynchronously fired them at the replacement system (“V2”). Here’s a basic diagram of the setup we put together:

When a user sends a V1 request to the system, it first hits an intercepting proxy which publishes the request (message body, requested url and headers) to a pub/sub channel. It then passes through the request to V1 as usual, and the V1 response is returned. The response is also published to the pub/sub channel.

Elsewhere, a simple worker script subscribes to the pub/sub channel. If a V1 request is received, it is persisted and then fired into V2 - the response of which is again persisted. The subscriber also listens for V1 responses on the pub/sub channel, and persists those into the same document - meaning that for every real V1 request, we now have a single document that shows the equivalent V2 response. These can then be verified to ensure that the new system gives the same answers as the existing.

Matching up requests and responses

One of the more challenging problems to solve in this setup was correlating the V1 response to the V1 request. Using an evented proxy was the right choice, but it meant that the proxy did not know which request a given response was for - as two concurrent requests would not necessarily complete in the order in which they started.

This had no impact for the end user, but for verification it was vital to be able to see the V1 request/response set together with the V2 equivalents. In the end we solved this problem by adding an additional header to the request before it was sent to V1 - X-Request-Id. On the V1 side, the webserver (Apache) was configured to ‘echo’ this header:

Header echo ^X-Request-Id

… meaning that it would repeat the same header received in the request in the response, which the proxy would then publish to the subscriber. The subscriber was then able to persist the V1 response to the same document as the V1 request.

Handling failure

One of the good things about this setup is almost all parts of both the parallelisation and the V2 system can be broken without impacting the user. If V2 itself is down, the only impact is on the channel subscriber.

If the subscriber is down, no messages are handled. The impact of this is a reduction in the amount of test data we have to analyse - it doesn’t affect real users. In our case, we used Redis pub/sub which does not persist messages - so if nobody is listening to the channel, the message is lost. It would be trivial, however, to change the publishing so it persisted messages onto a Redis list and used pub/sub to notify of new contents. The subscriber would then delete the list entry when it had successfully dealt with it - meaning that a newly-started subscriber could catch up on messages that were sent whilst no subscriber was running.

The biggest risk to the end user experience was Redis being down or inaccessible to the proxy - as the publishing process was synchronous and blocking. As it happened, we had an nginx router set up in front of the proxy already. This meant we could set nginx to try calling the proxy first, and if it failed, to instead call V1 directly. Now, both Redis and the proxy could be inaccessible and the end user experience would still not be impacted, apart from perhaps a small increase in response time.

Creating a Firehose

Testing our new system with real data revealed a good deal of bugs and missing features which if they had be found on launch day, would have been cause for a rollback. In the event, the switch over to V2 was painless and easy, and we had tens of thousands of real requests to prove that it would work. And, because V2 had been live in the production environment for over two months, we knew it could handle the load of real requests, as well as their varying contents.

As ever, though, there were unforeseen benefits to this method - particularly in creating the Redis channel. Suddenly, it was easy to subscribe to a firehose of our requests which could be fed into monitoring systems and dashboards, collected to use as benchmarking data or simply watched by a curious eye in a sanitised format during the day. The more we did this, and the more oddities on our both our side and implementers’ we observed, the more we grew our knowledge of our ecosystem and how it could be improved.

Focus Daily Standups on Value, Not Activity or Individuals

The daily standup (or daily scrum) is the bedrock of an agile process - and it should be kept enjoyable and useful, because that’s the only way that it’ll definitely continue to happen. A frequent complaint about this meeting, which erodes from its enjoyability and its utility, is a simple one - that it simply takes too long.

There are many reasons why the meeting goes on. It’s important that a team leaves totally in sync with one another - and sometimes this can take more than a few minutes, especially if a ad-hoc planning session emerges.

In many cases, I believe that the standard Scrum-style format of this meeting lays an unhelpful foundation from which many bad habits are often built. I therefore advocate moving teams away from this activity- and individual-focused meeting towards a towards a value-based or “right-to-left” standup.

The focus on activity and individuals

The standard, Scrum-style standup is well known - standing around your card wall, each member of the team answers the following three questions:

  1. What did I do yesterday?
  2. What will I do today?
  3. Is there anything standing in my way?

These questions are quickly learnt and are a good first step when adopting agile - they encourage clarity, commitment to one another and give an opportunity to state publicly if you are blocked (and are often the prompt for you to work out that you’re blocked!).

On a large agile team of ~7 people, though, going through these questions can start to take a while, and the whole meeting can take 20 minutes or more. I believe this is because the focus on activity creates a desire and often an necessity to talk not about your contribution to the team goal and the value you have delivered, but instead to ensure that you justify your previous day’s work. Simply put, people often seem to be spending time proving that they “worked very hard” yesterday, and will work similarly hard today.

The impact of this is a meeting with content focused on individual achievements, rather than an assessment of the state of projects and a synchronisation of team members. If you’re trying to justify yourself, being brief is the last thing on your mind - talking for longer is better.

When you look around other team members in this sort of meeting, you will find them staring at the floor and zoned out, or if they have yet to speak, compiling their own list of achievements to shout about. It’s boring, it wastes time and the team leaves the meeting not much wiser than when they arrived.

A focus on value - the “right-to-left” standup

A value-based standup puts the focus of the standup meeting where it belongs - on the value that the work delivers.

The format is similar but different - the team still stands around their card wall, but they do not go through each individual and discuss their activity. Instead, a facilitator goes through each card on the wall, starting from the right-most card (the one that is nearest to being complete and thus providing business value), and simply asks: what is the next step required to move this card further right, and who is doing it?

Sometimes the answer is that the task is blocked, and if this block is immutable, that’s fine - move on to the next most valuable card (the next most right-most card) and repeat the process.

Because the topic of conversation is always the work, each team member is engaged throughout the meeting, rather than only during the part where they are speaking, and the focus at all times is how the team will realise the most value today. The meeting is information-rich and fast-paced - meaning it’s not only shorter but also more enjoyable, which was the whole point.

Team members still make a personal commitment to each other about the work they will do today (one of the best parts of the activity-based standup), and there is no requirement that a manager or agile coach is always the facilitator, meaning the team can be self-organised and different team members are easily empowered to own the process. And of course, there’s something to be said for the variety of different people running the meeting - anything to keep it interesting!

Start small

I’ve successfully introduced this format to a number of teams, but it’s always been after at least a month or two of following the activity-based standup structure first. If your team’s standup is currently non-existent or is more of a nascent activity-based model, I’d suggest implementing activity-based standups first and then moving onto value-based soon after.

Can anybody share their experiences of going straight to a value-based standup?

Continuous Deployment With Feature Flagging

An important part of our product development process is being able to deploy new versions of features to only a portion of our users, so that we can gather feedback and to improve whilst not disrupting the workflows of our entire customer base.

We do this using feature flags - this mixes perfectly with both continuous deployment and having a customer-focused process, and in this post I’ll discuss how we did it and the lessons we’ve learned.

Rollout gem

We chose to use the rollout gem to fulfil this requirement - it’s simple, lightweight and has just the right amount of features. For a detailed introduction to rollout, check out the Railscast Pro episode.

We were keen to allow our product managers to be able to activate features for certain users using a UI - rather than developers activating users through a CLI. For this, we used rollout ui, which is a simple rails engine/sinatra app that interfaces with the rollout redis stores. Through a config.ru into a directory and deploy to heroku, and you’re ready to rock. Check out the interface above - again, simple and just enough features to be useful.

Integrating into the application

So, we start with a test:

The step implementation is pretty straightforward - simple, generic features that call methods from a helper.

… and the helper:

As you can see, we’re disabling all rollout features after each run - this is just as essential as clearing out your primary transactional database.

In terms of the implementation, we were keen to avoid sprinkling conditional logic/renders/routes throughout our application - so as to preserve code quality and to make the code easy to remove later. As such, we utilised the decorator pattern to decorate the subject objects with methods that could be used in the view:

… and the different versions of the decorator would provide different view artefacts:

Here, you can see that we implement two methods:

  • navigation_partial, which rendes navigation links, and
  • show_path, which returns the path to the current resource.

In version 2 of the decorator, these methods are totally different:

… as you can see, we’ve retired the navigation partial entirely (but the view doesn’t need to know that, as this method will just return nil), and we’ve moved the resource to a RESTful location. By using this approach, the controller and view both stay entirely clear of conditional logic.

Code is temporary

Next time you’re agonising over the smallest detail of a piece of code you’ve written, remember that all code is temporary and, just like your project, will one day not exist. In the case of code written to support and test dual-running during feature rollout, its lifetime is often only a couple of weeks.

Our policy is that once a feature is ready to be released to all customers, then we make it the ‘default’ version of that feature, and remove any dual running. Using separate decorators and controllers rather than hundreds of if statements dotted around your codebase means that when a feature is ready to be deployed to the entire user base, you can remove this code easily and with confidence.

Integrating into our process

To represent feature flagging within our value stream, we added a ‘beta’ column as the penultimate state for a given piece of work. Not all tasks/features stay in this column (some skip it), but having the column there gives us visibility into the features that are currently in this state.

The customer is the focus of our process - our first column after the icebox is ‘Customer Conversations’, which represents the initial requirements refinement that happens with customers. In this context, customers means ‘end user’ - requirements gathering, business analysis and liaising with internal stakeholders follows this stage and refines it further. Having the ‘beta’ column at the other end of the value stream is a great way of understanding the success, in customer terms, of the work we’ve done to address their use cases.

Check out the sketch adjacent to see these portions of the board in action.

Lessons learned

Beta does not mean ‘done’

It’s easy to get a feature to beta and leave it there for way too long, especially if it won’t need active development during the feedback part of the cycle. The ‘beta’ column gives us great visibility into this and are always very clear about when a feature is in a beta state when talking to stakeholders. This makes it clear that there is still work to be done, and we’re not finished yet.

Feedback and internal releases

Although there are up-front benefits to this approach, it is not ‘free’ in terms of effort, so it’s important to make sure that you’re actually get the feedback from customers - email them and call them up, and if you’re releasing a feature for internal review, chase down your most fussy users and watch them use what you’ve released.

Redis reliability and hosting

In terms of technical implementation, this is the biggest question mark. Redis cloud/SaaS hosting is very much in its infancy, and we’ve found even the better-known providers to be far from ‘highly-available’ - by adding redis as a first-class dependency to your application, you’re introducing a reliance on your redis host that may not have existed before.

Deploying code and releasing features are not the same thing

The change in thinking that’s required for this approach is pretty simple - to un-learn the idea that pushing code to a production system is the only way to release new features to users, and that the two actions are inseparable. In fact, releasing features need not be a binary on/off state - it can be a sliding scale where you can roll things out gradually (and if they’re not quite right, roll them back).

As a development team, it’s great to be able to keep code for new features in your master branch, knowing that they’re protected by a feature switch and won’t be released to users without your knowledge. Once you can do this, you can use branching solely for grouping changes to code together, rather than having to ‘hold’ work in branches because it can’t be deployed yet.

I love this idea, because it means that feature rollout benefits both the engineering practices of a development team at the same time as providing a useful function to the wider business.

PS: There’s a talk!

This blog post is also a talk - most recently given at LRUG. The talk gives additional background into other approaches to continuous deployment, and also discusses examples of how we’ve used this approach. You can view my slides on Speaker Deck, or watch the video on Vimeo.

Implementing Company-Wide Agile

We’ve been looking recently at how we can use agile techniques not only within the product team, but also across the whole company, and over the last 6 months we’ve evolved our process. Always opting for a daily standup, we started with having the leaders from each team get together every morning - and later extended the attendance to the entire company. We followed a Scrum-style standup (“What did I do yesterday, what will I do today, what’s standing in my way”) and loved the way it kick-started our day.

Although we liked the inclusive nature of having the entire company together each morning, these meetings were not sufficiently information-rich to warrant the twenty minutes they would take, so we decided to review again. We wanted a process which would help us stay aligned on what was important, but not be so costly in terms of time that it felt like a burden.

About 6 weeks ago, we rolled out our new process. The biggest change was that we now only focus on strategic goals (measurable company-wide aspirations), and to not cover operational concerns at all. Every team has daily operational responsibilities - whether it be onboarding new customers, monitoring our inbound leads or keeping servers alive - but the focus of our company-wide agile process is moving the company forward on our bigger targets.

The day-to-day process goes something like this:

Monday Planning Meeting

We generally focus on two or three strategic goals at a time. Every Monday, the owners of those goals present their current status to the whole company, and add post-its representing the tasks we need to complete to move us forward to the card wall that lives in the board room. We also use this opportunity to provide context for the work and to explain any terminology that might not be commonly understood.

The card wall is simple - with horizontal ‘streams’ for each goal, split vertically into columns for ‘Pending’, ‘In Progress’, ‘Ready’ and ‘Done’ - straightforward and easy to understand. Multi-disciplinary teams work together on the goals, bringing together marketing, support, product, sales and analytics differently depending on the job in hand.

The sales team also present their weekly target and anticipated revenue, as well as their progress against their monthly target.

It’s a high-bandwidth meeting involving the whole company, and usually takes around 15 minutes. For the value we all get from it in terms of shared understanding, alignment and inclusivity on what everybody is working on, it’s worth every second.

Daily Standups

The daily standup happens around the card wall. We focus on tasks, not individuals, so rather than going around each person in the group, we go around the board, top to bottom. We update our progress, celebrating any completed work and highlighting any blockers. Even though there are 20+ people there, the whole thing takes about 5 minutes - and it’s still a great way to kick off the day.

Friday Retrospective and Knowledge Share

On Friday afternoon we take time to review the week’s progress (whilst also reviewing the taste of beer!). The owners of the goals present their status, and update us on the metrics we’re using to measure. The sales team report back on their weekly revenues. Across any of these, is progress is not what we’ve hoped, we look at why and suggest how we might do better next week.

This is then followed by a general retrospective session, usually facilitated by Henry, our co-CEO. This is a great opportunity to talk honestly about what’s been happening that week, what we’re happy about and what we want to do better at - it’s also where we review our process and suggest improvements. Each retrospective we have is better than the last, as we get better at articulating feedback and comparing how we’re feeling with the same point the week before.

We often follow the retrospective with a knowledge share, where somebody will give a presentation on a subject relevant to the week - perhaps the output of some analysis, an explanation of a new feature, or a deep-dive into a change in the sales operation. This is our opportunity to learn more about the work that’s been going on that might not have otherwise known about - and is my highlight of the week.

Improvements

We’re by no means done and are still thinking about what the next steps are. As with a lot of agile processes, one of our biggest challenges is representing our work in a way that gives enough detail without giving too much - a topic that, as with many things, may be solved by focusing on the smallest possible discreet unit of business value.

Using State to Migrate Users Between Systems

The Problem

This week at work, we finished the replacing our billing system (a bespoke implementation based upon the defunct invoicing gem) with integrations with Freshbooks and Recurly. These two systems would become the point of truth for everything invoicing - and our apps would access this information via a new component, Money Bags.

When it came to moving two years’ worth of invoicing data over into Money Bags, we were (as ever) very keen to avoid a ‘big bang’ approach - especially as it wasn’t going to be trivial to move the data over, and each customer would need to be set up in both external and internal services. In short, bunging the whole thing in a migration was not going to cut it, and we needed a smoother, less risky way to migrate our customers over - hopefully without them even noticing.

The Solution

Our solution was to put in place a few simple switches that would allow us to have, at any given point, some customers on the legacy system, and some customers on the new system. We gave each customer object knowledge of the point that customer was at in the migration - one of three states; old system (version ‘1’), migration completed and moved to new system (version ‘2’) or migration in progress (version ‘1-migrating’).

The implementation wasn’t sophisticated - just a varchar on the customers table. When the customer logged in and viewed their account, they would be directed to a different controller depending on which version of the billing system they were using. When it came to migrating a customer, we even gave our support team ownership of the process - this made sense because they are the people that most need to know where a given customer is in the migration. A simple page in the admin app allowed the support team to view each customer’s migration state, and also to kick off migrations and watch them run.

If the customer tried to view their account whilst they were being migrated, we let them know with a simple message that they were currently being migrated, and to get in touch with support if they had any questions. Support, of course, were in the position where they knew exactly what was going on.

The migration state was not only used when presenting information to the end users - it was also used by the migration process to decide which customers to migrate in any given run. This meant there was complete cohesion between teams and components around where a given customer was.

Sometimes the migrations didn’t work perfectly. We invoice monthly and date maths is complicated, and there were a few edge cases in the data we hadn’t anticipated. Because we were using a stateful approach, however, these were easy to handle - if it was a complex issue we could roll the customer back to the legacy system, and if it was a simple fix we could simply run their migration again once we’d applied the patch. Either way, the customer was totally isolated from any problems we were having.

Next Time

This is the second time I’ve done a project like this. Compared to the approach of having the web app only support one of the two systems, and requiring all users to be migrated before the system can be deployed, this approach was way better. I’m sure that even with weeks of planning and doing test runs of the big bang migration, you’d still be caught out on live. This way, we shipped way earlier and, as any problems that occurred wouldn’t be affecting the customer, we could fix issues in a stress-free way.

If I was to do this migration again, my main change would be to have greater granularity in the states we assigned to customers. Because the migration process was composed of several steps, it would have been very useful at times to understand exactly where a failure had occurred, and to then be able to roll back only as far as necessary. Because we only had one state to indicate that a customer was currently being migrated, we would have to roll the whole way back if there was an error - this meant that some migrations took longer than they could have done.

Overall, however, it was a success. My reward? Two fold - my task this week is to rip out the thousands of lines of code within the old billing system, but mostly, we’ve shown yet again that every project can be made simpler, easier and more delightful for all concerned.