Crates.io incident report for 2020-02-20

栏目: IT技术 · 发布时间: 5年前

内容简介：On 2020-02-20 at 21:28 UTC we received a report from a user of crates.io that their crate was not available on the index even after 10 minutes since the upload. This was a bug in the crates.io webapp exposed by a GitHub outage.In some corner cases the code

On 2020-02-20 at 21:28 UTC we received a report from a user of crates.io that their crate was not available on the index even after 10 minutes since the upload. This was a bug in the crates.io webapp exposed by a GitHub outage.

Root cause of the outage

In some corner cases the code that uploads new commits to the GitHub repository of the index was returning a successful status even though the push itself failed. The bug caused the job scheduler to think the upload was actually successful, causing the job to be removed from the queue and producing a data loss.

The outage was caused by that bug, triggered by an unexpected response during the GitHub outage happening at the same time.

Resolution

The team analyzed the code of the background job uploading commits to the index, and found a possible cause of the misreported success. A team member wrote the fix , another one reviewed it and we then deployed the patch directly to production.

At the same time, once we saw the index started to be updated again, we removed the broken entries in the database manually and asked the reporter to upload their crates again.

Affected crates

Postmortem

Deploying the change took way longer than expected: there were changes landed in master but waiting to be deployed on production, increasing the length of the build process and the risks of the deploy. In the future we should deploy hotfixes by branching off the current deployed commit, and cherry-picking the fix on top of that. We should also strive to reduce the amount of time PRs sit in master without being live.

Nobody was paged due to this incident, as our monitoring and alerting system wasn’t able to catch the problem: we have monitoring in place for jobs failing to execute, but in this case the job was mistakenly marked as correct. We should implement periodic checks that ensure the database and the index are correctly synchronized.

We were lucky that two members of the team with access to both the support email and the production environment were online during the outage: without paging available we could’ve noticed it way later than we did.

During the incident investigation we also found that our logging was not good enough to properly diagnose the problem: there is no message logged when a commit is pushed to the index, nor when a background job is executed. Also, the API call to publish new crates doesn’t include the crate name in its line. We should enhance our logging capabilities to find the root cause of issues quickly during future incidents.

Timeline of events

It took 1 hour and 31 minutes from the start of the incident to the deploy of the fix.

2020-02-20

21:17 UTC: the authors of kaze , wasmer-runtime-core and wasmer-win-exception-handler published them on crates.io
21:28 UTC: the author of wasmer-runtime-core and wasmer-win-exception-handler reports the issue to help@crates.io
21:31 UTC: GitHub updates their status page to report an outage
21:33 UTC: Pietro notices the support mail, pings Sean on Discord, Sean starts investigating
21:35 UTC: Pietro got back to the author saying that the team was investigating
21:37 UTC: Sean and Pietro find the symptoms of the incident
21:50 UTC: Sean finds a possible cause for the bug
22:01 UTC: Sean deletes the affected versions from the database
22:09 UTC: Sean opens up PR 2207 with the fix
22:16 UTC: GitHub updates their status page to say the problem is fixed
22:17 UTC: Pietro asks for changes on the PR
22:20 UTC: Sean addresses Pietro’s concerns in the PR
22:23 UTC: PR merged, Sean deploys it directly to master
22:48 UTC: Fix deployed on production

2020-02-21

kaze
kaze

Action items

#2226 : Add simple logging when we start the index publishing process.
#2227 : Add a periodic job that checks the index and the database for consistency, paging the on-call person if there are any mismatches. The job will need to account for the crates not yet published on the index but in the queue.
#2228 : Include the crate name in the HTTP log entry for the publish API call.
#2229 : Add in-depth logging for swirl background jobs, with information such as the job name or the parameters.
#2230 : Investigate whether we want to implement a self-healing feature to automatically synchronize the index in cases of mismatch.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Crates.io incident report for 2020-02-20

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Node即学即用

[英] Tom Hughes-Croucher、[英] Mike Wilson / 郑达韡 / 人民邮电出版社 / 2013-2 / 39.00元

《Node即学即用》由休斯－克劳奇、威尔逊编著，《Node即学即用》讲解如何用Node构建可扩展因特网应用，是全面的实用指南，除了详细介绍Node提供的API外，还用大量篇幅介绍了服务器事件驱动开发的重要概念。内容涉及跨服务器的并发连接、非阻塞I/O和事件驱动的编程、如何支持各种数据库和数据存储工具、NodeAPI的使用示例等。适合对JavaScript及编程有一定程度了解的读者阅读。一起来看看《Node即学即用》这本书的介绍吧!

码农工具