Improving Monitoring and Alerting at the UK’s Government Digital Service

In 2011, the Government Digital Service was a small team focused on delivering new services to users as quickly as possible. By 2017, they were a large organisation running mature services. Unfortunately, technical debt had built up, slowing teams down and causing problems.

I was part of a team focused on improving the way teams monitored the health of their service and dealt with alerts when things went wrong.

A dashboard the team built for alerting and monitoring

THE PROBLEM

The UK’s Government Digital Service is seen as a world leader in research and design. It introduced agile ways of working to the government and transformed online public services. It did this by empowering small teams, giving them clear goals and letting them decide how they wanted to work.

As GDS had grown, different teams had emerged, all using different tooling. This meant there was a high cost for developers changing teams. There was also a high level of technical debt when it came to the way tooling was set up, due to the speed at which services had been established. This was slowing teams down.

There was also general wariness toward ‘fixing the plumbing’, after a migration from one hosting platform to AWS, which had been predicted to take 6 months, had taken over 18 months and counting. This meant teams were living with the tech debt rather than looking for change.

A new department, ‘Reliability Engineering’, had been created to solve this. It had two aims:

1. Provide a standardised set of tooling teams could use.

2. Help teams work faster by removing technical debt

Part of this focused on how teams monitor the health of their services and alert when things go wrong. This involves deciding what bits of your service to monitor, deciding when an alert should go off, and how to workout what needs fixing. Doing this well avoids broken services and problems for users, as well as saving time and stress for product teams.

I was working on several teams, including the Monitoring and Observing Team. Their role was to improve how teams monitored the health of their service and dealt with alerts. The proposal was to use Prometheus (https://prometheus.io) as the standard tool across GDS.


KEY GOALS

As a relatively newly formed team we wanted to understand if there was a need for what we were offering, and any barriers to adoption. There had been no initial research to find out if the problem we had been created to solve was a real problem for teams. This was the first thing we needed to find out.

We also needed to work out the best way of setting up our offering. The assumption was that offering a SaaS version that our team managed would be preferred. We wanted to confirm this as it would be difficult to change later.


THE TEAM

The team consisted of a Product Manager, eight Site Reliability Engineers, a Delivery Manager, and I. I joined the team as the first User Researcher three months after they had been set up .

MY ROLE

I managed the research from start to finish, including recruitment, as well as running research and research skills training for other teams. I was one of two researchers out of 80 people in the Reliability Engineering department, which meant I tried to make sure the research I was doing was useful for not just the team, but the department as a whole.


UNDERSTANDING THE USER

Assumption setting with team

This project was interesting in that the users were all internal. We had two key user groups: Product Managers and Back End Developers on teams inside GDS.

Product Managers were chosen because they would make the decision to prioritise the work involved in transitioning tooling. If we didn’t understand their needs, we might have a technically perfect solution that was never used.

Back End Developers would be the people using the alerting and monitoring, so we needed to understand if what we were building worked for them. If it didn’t we would just be spending large amounts of time and money to potentially make the situation worse

We focused on understanding how teams currently manage alerts and monitoring, as well as the perceived pros and cons of their current set up. We then looked at the perceived value of moving to Prometheus as well as potential challenges involved.

For Product Managers we also focused on current levels of understanding in terms of the value of monitoring and alerts.


HIGH LEVEL TIMELINE

This work took place over two fortnightly sprints. The first sprint involved an assumption workshop, deciding on the research questions and then starting recruitment. The second sprint was doing the interviews, and then group analysis as a team, followed by the findings playback.

I did a total of 12 60 minute depth interviews. This reflected a third of the total population, and so while it was smaller than I would normally like for this type of research, it felt representative.

I presented the work to multiple teams and senior stakeholders over the following weeks as a result of widespread interest.


BREAKING DOWN THE PROCESS 

I was new to the team, and they had already started building things, but weren’t sure if they were moving in the right direction, or if anybody wanted what they were offering.

Firstly, I ran an assumptions workshop. This focused on getting everybody to write down their assumptions, about their users, about the product, about the direction we were taking. This was slightly challenging, as it involved helping people to understand what was an assumption, and that some of the things they thought were true were merely assumptions.

This gave us an ordered list of the riskiest assumptions we were making as team. The most important assumption was that people wanted to improve their monitoring. If this was false, we had no market.

From the list of assumptions, I sat down with the Product Manager and the Lead Developer to help understand what information would be useful to them.

These two steps were crucial for a number of reasons. Firstly, it helped me understand the team’s current knowledge, so I didn’t come back with things they already knew.

Secondly, it helped them engage with the research process, which meant that the findings were more likely to be taken onboard. Finally, it set expectations about what was possible, which meant there was no disappointment in terms of information.

Recruitment

Recruitment was entirely internal users. I managed all of the recruitment myself.

Being unable to offer incentives meant I spent a lot of time finding the right people to speak to. This involved lots of emails and walking around finding people to talk to, and asking for introductions. Previous experience has taught me that direct introductions were the best way of signing people up.

This approach was time consuming, and often involved several rounds of introductions, but it meant that I was able to recruit everybody I approached.

Dealing with anonymity

Internal research also has its own challenges in terms of dealing with participant anonymity. If you have key teams to target, and there is only one Product Manager, your team is generally able to work out who you are speaking to. I also wanted team members to attend sessions, which made anonymity effectively impossible.

I had both ethical and data quality concerns about this so took steps to mitigate this.

Firstly, I was upfront with the participants that I couldn’t offer them full anonymity. Secondly, I explained my concerns to the team, and we agreed with the best way of dealing with it. This meant that people who attended the sessions didn’t tell team members who we had spoken to, signup sheets had ‘role’ rather than ‘team’, and I asked people not to try and guess the teams or users from quotes. Finally, I offered all participants the chance to review findings beforehand, and check that they felt comfortable with what I was saying.

As a result of this, I extracted very honest answers from users, and was confident in the data we collected. There were also no repercussions for any users, or any complaints about being misrepresented.

Understanding user’s context

As well as the interviews, I asked users to bring their laptops so that they could show us their setup as they talked. This made a huge difference to the detail I received, as well as the ability of the developers watching to ask more detailed and technical questions.

I also obtained screenshots of the dashboards different teams were using to display alerts and monitoring, as well as the various tools they used. I used this to build a ‘wall of dashboards’, which  was a really useful reference for the team in helping make decisions about the alerts and metrics we should offer.

A team explaining their alerting and monitoring set up

Key findings

The biggest priority for teams are metrics that tell them about user impact

My team had been thinking about metrics they could provide in terms of the technical aspects of the service. By contrast, teams were thinking about alerting in terms of avoiding impact on the end user. This helped change the language we used to talk about alerting to teams.

An example of a team’s dashboard

The alerts thresholds for teams were mostly arbitrary

We had wanted to understand how teams decided on alert thresholds (at what point an alert goes off). I learnt they were mostly a stab in the dark at first, and then tweaked based on if people felt they were going off too often, or if there was a major problem.

People were regularly ignoring alerts

A consequence of this was that people were ignoring alerts, because they weren’t set up properly, or useful. This is a big problem because it reduces the overall effectiveness of the alerting and monitoring.

People also were reluctant to get rid of alerts, or didn’t feel able to do so, even though they were causing lots of problems, including people being woken up in the night for no reason.

Teams didn’t want us to manage their monitoring because it created a layer of abstraction

There had been an assumption that teams would be happy to let another team manage their alerting. This was not the case. Teams were worried about adding in longer feedback loops, and making it harder to debug things. This was seen as a huge downside to adoption.

Most teams recognised there were problems with their setup

Most teams were in a situation similar to these canoeists. They knew that they had problems with the way they managed alerting and monitoring, but had becone used to things being slow and broken.They weren’t making any effort to improve them despite it holding them back and wasting time and effort. (I made use of this gif when presenting to add lightness to what might have been seen as a criticism. It became something people referred back to repeatedly.)

Teams were looking for ‘good enough’ rather than perfect

Partly, this was because teams had a more pragmatic approach to alerting. The monitoring and alerting team were focused on providing a high quality solution, but teams just wanted something that was good enough. This reduced the perceived value of our solution because it was seen as overkill.

Impact of the research

The key finding from this report were the concerns people had about having somebody else manage their devops tooling. Dealing with the management of the tool had been seen as a huge win for the Reliability Engineering team, and a key driver of adoption of the services it was offering.

Discovering that there was a generally negative view about letting these services be managed by other teams resulted in the strategy of the entire programme shifting.

The research also helped the team realise that they thought and cared about monitoring a lot more than the people on other teams. People didn’t know what a good alert looked like, and so they didn’t understand the impact it could have. We made a lot of presentations to teams, making sure we talked about benefits in terms of the language users had used in the research sessions.

As a team, we realised that we needed to do three things:

  1. Shift to an offering that could be owned by teams rather than managed by us
  2. Start helping teams understand what good looked like
  3. Do as much of the work on transitioning as possible

Lessons Learned

Validating product market fit is scary but important

One of the things I learnt was that even if it’s scary to challenge if the thing you have been told to build is actually wanted, it’s incredibly valuable. I was lucky that the team was open to asking this question early on.

While we were unable to fully change the direction, we helped change the scope and were able to push back as a team a lot quicker. I’ve tried to get teams to engage with this question as early as possible as a result of this work.

You need to talk about your findings repeatedly

I also learnt the value of repeatedly talking about the same research, and presenting it multiple times. Previously, I’d felt pressure to continually deliver new findings. The fact we were doing research internally meant the number of people I could speak to was limited, and I had to be really careful about not exhausting their goodwill. This meant I had to really make sure that the research was taken in, which meant repeating it, constantly referring back to it, presenting it to lots of teams, shouting about it.

I presented it to lots of teams as I believed they would find it useful. I later had other teams proactively asking me to present it to them. 

Repetition helped the research have impact. I now make sure I focus on getting as much as I can out of every  bit of research I conduct.


Other Projects