A Chaos Test too far…

栏目: IT技术 · 发布时间: 4年前

A Chaos Test too far…

By Kathryn Downes & Arjun Gadhia

It was a warm, summer’s afternoon. We were looking forward to an after-work beverage at The Market Porter in the sunshine. What better time to do some harmless chaos testing in our staging environment, we thought.

Our team has been building Spark — an in-house content management system for creating and publishing digital content. The Financial Times has around 500 journalists around the world, across a dozen news desks. We’ve been working closely with a couple of desks from an early prototype — and to date, we have around 80 users (journalists, editors/subeditors, picture/data journalists, etc) that are publishing between 10–20 stories a day using Spark.

A Chaos Test too far…

We wrote and edited this very blog post collaboratively in Spark!

Spark is a collaborative, web-based tool written in React / Node.js , and deployed on the cloud to Heroku . The editor is built using the open source library Prosemirror, the content is stored in MongoDB and collaboration happens over websockets using a Redis caching layer for the changes coming through.

As the team has grown and the tool becomes more fully fledged, we became aware of knowledge silos starting to form. Inspired by our colleagues in Customer Products , we decided to host our very own Documentation Day. This was a fun (there were drinks and snacks) and productive way to both spread knowledge and decrease our operational risk.

A Chaos Test too far…

Spark team intensely writing documentation

One of the things we learned from this exercise was that nobody in the team knew how to back-up and restore the database. So the following day, two of us decided to go through the exercise in our staging environment and document the process.

This mini-chaos test taught us two valuable things that we wouldn’t have known otherwise:

  1. The database was set to back-up every 24 hours. This wasn’t really an acceptable timeframe for us or editorial.
  2. The process of restoring the back-up caused a few minutes of downtime (understandable), and even after it finished our apps needed restarting (this was less obvious and would have bitten us in an out-of hours scenario).

So all in all, a raging success. We increased our production back-up frequency to every two hours, and then as the day came to a close we quickly ran through the process again in staging so we could write it up.

In staging.

Staging…

It wasn’t staging.

It was production.

A Chaos Test too far…

Holy moly, Production?!! But restoring to a back-up causes downtime?! And the last back-up was at 5am?!

Yuup :scream::headinhands:

Ok, this was bad. Spark was down and we had lost all the journalists’ changes from that day. This included important articles that were due to be published in the next couple of hours. They were going to be understandably, very annoyed.

We also faced losing all of our users’ trust and confidence that we had been working so hard to gain over the last few months. We would be back to square one if we didn’t get this fixed fast.

So, what did we do?

Instinctively we wanted to try and stop the restore. Our staging test showed us that the process took around 15 minutes, but we couldn’t find anything in the UI. We fired an email to our MongoDB provider, but it was too late to stop it. It was looking grim, the sinking feeling was sinking deeper and any glimmer of hope for restoring content was fading fast.

“Can we take a back-up of Redis?”, someone shouted. Despite the high stress and adrenaline levels, the team somehow managed to recall that we keep a temporary cache of our articles in Redis, to speed up our collaborative editing feature. This cache has a short time-to-live, so it was imperative that we pulled this down as soon as possible!

The restore had completed, and as we learnt to do earlier in the day, we restarted the app. Spark was now showing articles from 5 am that morning. We had already told our stakeholders what was happening, and to their credit, they were doing a fantastic job of protecting us from the inevitable questions trickling in from users. They managed to compile a list of the missing articles from FT.com, and prioritised them based on urgency.

With a local Redis in place, we confirmed that all the 37 missing articles were there! The format of the data stored is slightly different between Redis and Mongo, but we had enough to be able to manually recreate all the content. It was a slow and finicky process, but together we managed to bring everything back exactly the way it was, and most of our users were none the wiser! :sweat_smile:

It was late. We were mentally and emotionally drained. It was time for the Market Porter.

So, thankfully the crisis was averted. Phew! Despite being totally stressful to deal with, when we reflected on the incident afterwards, there were a lot of positives we could take from it. We had learned so much about what we did and didn’t know and as a result were able to identify areas for improvement. It also demonstrated that we had been able to pull together and work really effectively as one team. We felt proud of how we responded.

These are some of the things we learnt and some of the changes we suggested:

  • We now back-up the database much more frequently.
  • Confusing Heroku database names had made the mistake too easy to make so we changed them to be super clear, STAGING and PRODUCTION.
  • We updated our runbook to include instructions on data recovery as well as improving our on-boarding wiki after realising that some newer team members didn’t have everything installed, e.g. a Mongo client.
  • Having one person take charge of coordinating the response to a significant incident is useful. They assigned tasks and liaised with our stakeholders/users, which let the other engineers focus on the fix.
  • We didn’t just need engineers to help us recover from this incident, the input and knowledge of our editorial stakeholders was vital and good communication with all members of the team, not just technical was really important.
  • Chaos testing is really useful and we plan to do more of it in the future!

Fortunately, we were able to quickly respond and fix this mistake, learning a lot and improving the app in the process. However, we all know that any one of us (even a Principal Engineer :grimacing:) could make some other mistake in the future, we are only human!

At the FT we are lucky to work in a culture of no blame, where we recognise the complexity of our work and accept the inevitability of things going wrong. You are encouraged even, to make mistakes since they provide such rich learning experiences. When something does go wrong, we are supportive rather than rushing to pin blame on one individual. Try to look past the inevitable sinking feeling and look for opportunities for improvement.

Kathryn Downes — Engineer, Internal Products

Arjun Gadhia — Principal Engineer, Internal Products

P.S. We’re hiring !


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

阿里传

阿里传

波特·埃里斯曼 / 张光磊、吕靖纬、崔玉开 / 中信出版社 / 2015-9-15 / CNY 49.00

你只知道阿里巴巴故事的中国部分,而这本书会完整呈现故事的全部。 波特•埃里斯曼是阿里巴巴创业时期为数不多的外国高管。他于2000~2008年在阿里巴巴担任副总裁,这本书记录了他在阿里巴巴8年的时间里的创业故事、商业经验以及在阿里巴巴和马云、蔡崇信、关明生等阿里巴巴早期团队并肩奋战的故事。 在波特眼中,阿里巴巴的成功经验和模式是可以复制的,阿里巴巴曾经犯过的错误,走过的弯路,我们也可以绕......一起来看看 《阿里传》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

html转js在线工具
html转js在线工具

html转js在线工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换