Team Structure for Software Reliability Within Your Organization

栏目: IT技术 · 发布时间: 4年前

内容简介:First, we need to define a goal for our system. The goal we set for our reliability system is maintaining stability in functionality while we continue to ship new features. Your company may have a different definition, and that would impact your system mod

How well positioned is your team to ship reliable software? What are the different roles in engineering that impact reliability, and how do you optimize the ratio of software engineers to SREs to DevOps within teams? These questions can be hard to answer in a quantifiable way, but projecting different scenarios using systems thinking can help. Will Larson’s blog post Modeling Reliability does just that, and serves as inspiration for this article. In this article, we will abstract away the mathematical formulas for a broader audience to intuitively understand team construction for reliability.

Systems thinking is a methodology for analyzing complex ideas. It allows us to zoom out to a higher level and examine the interactions between components within the system. This can help us simplify the complexity and see the bigger picture. The best way to explain what we mean is through an example, so let’s dive right in.

Reliability in the simplest possible system 

First, we need to define a goal for our system. The goal we set for our reliability system is maintaining stability in functionality while we continue to ship new features. Your company may have a different definition, and that would impact your system model.

In order to tell how stable our functionality is, we need to measure how many occurrences of instability there are — in other words, the number of incidents. As we continuously ship new features, some will unfortunately result in incidents due to unforeseen dependencies or vulnerabilities. When our team resolves incidents, the incident becomes mitigated.   

In this example, we don’t care what or how many features there are. We just care that some defect and become incidents. Similarly, after we mitigate the incidents, the main metric we’re evaluating is the rate (speed and quantity) at which we’re mitigating them.

Team Structure for Software Reliability Within Your Organization

This is the simplest model to really understand how we’re doing with reliability. At a minimum, if we instrument the number of active incidents, defect rate, and mitigation rate, then we can quantify and see how reliable our system is . From there we can develop a program for handling incidents to try and lower the defect rate and increase the mitigation rate to be within ranges we find acceptable.

 As we continuously ship new features, some will unfortunately result in incidents due to unforeseen dependencies or vulnerabilities.

Modeling the change rate of features and deploys

To scope our system more accurately, we also take features and deploys into consideration. The rate of new deploys isn’t constant, partly because the number of developers in a company isn’t constant. As our company grows, the quantity of developers also grows, as (hopefully) do investments in tooling and automation to improve innovation velocity. With these changes, the quantity of features increases.

Team Structure for Software Reliability Within Your Organization  

However, as new services get onboarded and operational complexity grows within a system, the volume of incidents increases at a rate that is likely much faster than the rate of hiring additional engineers who can handle them. This means you’ll reach a bottleneck where you cannot ship more features because your engineers will be spending all their time fixing ongoing problems.  

In fact, this becomes a huge issue in Gene Kim’s “The Phoenix Project.” In this novel, an engineer named Brent becomes a massive bottleneck to the critical application Phoenix. Brent is often so busy with mitigation work and helping others that he has little to no time to devote to the project itself. One of the book’s most important statements is that you can only move as quickly as your bottleneck allows, and that any improvements you do to the system will be next to worthless if it doesn’t help the bottleneck. In this context, the bottleneck will likely be engineering hours. You simply won’t have enough time to spend to fix the issues and ship new features.

The two types of reliability work

So how do we prevent incidents from overwhelming the system? By adding components to our system to lower the defect rate and to increase the mitigation rate. As a company grows bigger, dedicated staff would likely be required for this effort. That’s where SREs come in.  

Team Structure for Software Reliability Within Your Organization

There are two main types of reliability work. The first is mitigation, which is a linear fix that’s often referred to as firefighting. In other words, you’re fixing problems as they come. The second is change management, which is a non-linear fix that proactively reduces the defect rates through projects like migrating to better tools and refactoring spaghetti code. While SREs support both these types of work, they should spend more time on the latter. In Google’s SRE handbook , it states that SREs should allocate approximately 50% of their time on this sort of project work for the company to really see improvements in reliability.  

Long term, SREs encourage the entire engineering organization to adopt better reliability practices, thus creating a more effective team.  

However, it’s also important to note that if a system becomes unstable for a long period of time with no effort to improve, SREs will likely start withdrawing from the application. If developers don’t make an effort to help prioritize reliability, the SRE may hand back the pager.

Because reliability is a team sport, SRE is a highly cultural endeavor, which requires buy-in across many stakeholders.  

Conclusion: Kickstarting your SRE team

To balance your mix of developers and SREs properly so you can effectively lower defect rates, two things need to happen:

  1. You need to hire . It’s not enough to have a fixed number of SREs because as you hire more devs, even with a reduced defect rate, the number of defects will still be increasing significantly. To keep the growing number of defects under control, you need to design an SRE team that is of a defined relative size to the overall engineering organization, say one to twenty.
  2. Things need to get worse before they get better. In the beginning of SRE implementation, you’ll need to discuss reasonable expectations for your team. For a while, incidents will likely increase before they get better, and team members shouldn’t feel discouraged about this. An increase in incidents occurs when technical debt is being drained. When SREs begin draining technical debt, latent incidents are exposed. These incidents are like landmines; you don’t know they even exist until you step on them. Latent incidents are extremely common in legacy systems. This can be a huge pitfall for great teams. If incidents spike due to remediation efforts, it doesn’t mean that the team is failing. In fact, it’s the opposite. The team is exposing and fixing dangerous issues. By letting your team know ahead of time that things will likely be worse before they get better, you can avoid this issue and boost your system’s reliability plus your team’s morale.

With the tools above, you can start thinking about the allocation of investments across your system to optimize for reliability by adding SREs to your organization. Of course, with all system models, the exact boundaries are not clear cut, and you may want an expanded scope to capture a fuller picture. Boundaries will likely change over time and will need to be reevaluated periodically. However, we hope this framework provides some high-level ideas on how you can quantify the impacts of SRE specifically on your team.

 With the tools above, you can start thinking about the allocation of investments across your system to optimize for reliability by adding SREs to your organization. 

If you liked this article, you may want to check out these others:

Written by: Owen Wang and Hannah Culver

Edited by: Ancy Dow and Charlie Taylor


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

20个月赚130亿

20个月赚130亿

陈士骏、张黎明 / 中国华侨出版社 / 2011-11-17 / 35.00元

YouTube联合创始人陈士骏在书中以朴实亲切的口吻讲述了他的人生经历,以及对学业、事业、梦想、财富、生死等的种种感悟。 童年随全家去美国小镇定居,少年时代迷上计算机编程; 离大学毕业还有几个月时放弃学位,怀揣200美元奔赴硅谷,加入创业公司PayPal,公司上市后成为百万富翁; 因为无法接受PayPal被EbayeBay收购后工程师丧失发言权,和好友一起开创视频网站YouTub......一起来看看 《20个月赚130亿》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具