Introducing the GitHub Availability Report

栏目: IT技术 · 发布时间: 4年前

内容简介:Historically, GitHub has published post-incident reviews for major incidents that impact service availability. Whether we’re sharing new investments to infrastructure or detailing site downtimes, our belief is that we can collectively grow as an industry b

Historically, GitHub has published post-incident reviews for major incidents that impact service availability. Whether we’re sharing new investments to infrastructure or detailing site downtimes, our belief is that we can collectively grow as an industry by learning from one another. This month, we’re excited to introduce the GitHub Availability Report.

What can you expect?

On the first Wednesday of each month, we’ll publish a report describing GitHub’s availability, including a description of any incidents that may have occurred and update you on how we are evolving our engineering systems and practices in response. You should expect these updates to include a summary of what happened, as well as a technical explanation for incidents where we believe the occurrence was novel and contains information that helps engineers around the world learn how to improve product operations at scale.

Why are we doing this?

Availability and performance are a core feature, including how GitHub responds to service disruptions. We strive to engineer systems that are highly available and fault-tolerant and we expect that most of these monthly updates will recap periods of time where GitHub was >99% available. When things don’t go as planned, rather than waiting to share information about particularly interesting incidents, we want to describe all of the events that may impact you. Our hope is that by increasing our transparency and sharing what we’ve learned, rather than simply reporting minutes of downtime on a status page, everyone can learn from our experiences. At GitHub, we take the trust you place in us very seriously, and we hope this is a way for you to help hold us accountable for continuously improving our operational excellence as well as our product functionality.

Availability Report for May and June

In May and June, we experienced four distinct incidents resulting in a lack of availability or degraded service for GitHub.com.

May 5 00:45 UTC (lasting for two hours and 24 minutes)

During the incident, a shared database table’s auto-incrementing ID column exceeded the size that can be represented by the MySQL Integer type (Rails int(11)): 2147483647. When we attempted to insert larger integers into the column, the database rejected the value and Rails raised an ActiveModel::RangeError, which resulted in 500s from our API endpoint.

This impacted GitHub apps that rely on getting installation tokens. The top affected GitHub apps internally included Actions, Pages, and Dependabot.

GitHub’s monitoring systems currently alert when tables hit 70% of the primary key size used. We are now extending our test frameworks to include a linter in place for int / bigint foreign key mismatches.

May 22 16:41 UTC (lasting for five hours and nine minutes)

During a planned maintenance operation (failing over a MySQL primary instance) we experienced a novel crash in the mysqld process on the newly promoted MySQL primary server. To mitigate the impact of the crash, we manually redirected traffic back to the original primary. However, the crashed MySQL primary had already served approximately six seconds of write traffic. At this point, a restore of replicas from the new primary was initiated which took approximately four hours with a further hour for cluster reconfiguration to re-enable full read capacity. For a period of approximately five hours, users may have observed delays before data written to the affected database cluster were visible in the web interface and API.

We’ve run multiple internal gameday exercises in response to ensure a higher degree of preparedness for similar topology inconsistencies and will continue to exercise our automated failover systems to reduce recovery time.

June 19 08:52 UTC (lasting for 51 minutes)

Changes to better instrument A/B experimentation for UI improvements introduced an unknown dependency on the presence of a specific, dynamically generated file that is served by a separate application.

During an application deployment, the file failed to be generated on a significant proportion of the application deployments due to a high retrieval rate being rate limited by the upstream application. This resulted in site-wide application errors for a percentage of users enrolled in the experiment. Upon detection, we were able to disable the requirement on this file which restored service to all users.

Going forward, configuration for A/B and multivariate experiments will be cached internally to ensure successful propagation of dependencies.

June 29 12:03 UTC (lasting for two hours and 29 minutes)

As part of maintenance, the database team rolled out an updated version of ProxySQL on Monday, June 22. A week later, the primary MySQL node on one of our main database clusters failed and was replaced automatically by a new host. Within seconds, the newly promoted primary crashed. Orchestrator ’s anti-flapping mechanism prevented a subsequent automatic failover. After we recovered service manually, the new primary became CPU starved and crashed again. A new primary was promoted which also crashed shortly thereafter. To recover, we rolled back to the previous version of ProxySQL and disabled a change in our application that had required the new ProxySQL version. When this completed, we were able to allow writes on the primary node without it crashing.

We are analyzing application logs, MySQL core dumps, and our internal telemetry as part of continued investigation into the CPU starvation issue to avoid similar failure modes going forward.

In summary

As an organization we continue to invest heavily in reliability. We treat each incident discussed here as an invaluable opportunity from which to learn and grow. Our systems and processes continue to evolve based on these learnings and we look forward to sharing our progress in future updates.

Please follow our status page for real time updates and watch our blog for next month’s availability report.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

人件

人件

Tom DeMarco、Timothy Lister / UML China / 清华大学出版社 / 2003-6 / 35.00元

《人件(第2版)》专门讨论了软件开发和维护的团队管理问题,并向人们的传统认识提出了挑战。作者汤姆·迪马可,蒂姆·李斯特在书中推崇人本管理思想,指出知识型企业的核心是人,而不是技术。《人件(第2版)》于1987年首次出版后,曾在西方引起了轰动,被誉为“对美国软件业影响最大的一本书”。《人件(第2版)》还对大中型组织中的软件开发团队如何运作进行了深入探讨。《人件》已成为软件图书中的经典之作。它和《人月......一起来看看 《人件》 这本书的介绍吧!

URL 编码/解码
URL 编码/解码

URL 编码/解码

MD5 加密
MD5 加密

MD5 加密工具

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具