Seamless branch deploys with Kubernetes

栏目: IT技术 · 发布时间: 4年前

内容简介:Basecamp’s newest productAs you work on code changes or new feature additions for an application, you naturally want to test them somewhere — either in a unique environment or in production via feature flags. For our other applications likeOur existing bet

Basecamp’s newest product HEY has lived on Kubernetes since development first began. While our applications aremajestic monoliths, a product like HEY has numerous supporting services that run along-side the main app like our mail pipeline ( Postfix and friends), Resque (and Resque Scheduler ), and nginx , making Kubernetes a great orchestration option for us.

As you work on code changes or new feature additions for an application, you naturally want to test them somewhere — either in a unique environment or in production via feature flags. For our other applications like Basecamp 3 , we make this happen via a series of numbered environments called betas (beta1 through betaX). A beta environment is essentially a mini production environment — it uses the production database but everything else (app services, Resque, Redis) is separate. In Basecamp 3’s case, we have a claim system via an internal chatbot that shows the status of each beta environment (here, none of them are claimed):

Seamless branch deploys with Kubernetes
(prior to starting work on HEY, we were running 8 beta environments for BC3)

Our existing beta setup is fine, but what if we can do something better with the new capabilities that we are afforded by relying on Kubernetes? Indeed we can! After reading about GitHub’s branch-lab setup , I was inspired to come up with a better solution for beta environments than our existing claims system. The result is what’s in-use today for HEY: a system that (almost) immediately deploys any branch to a branch-specific endpoint that you can access right away to test your changes without having to use the claims system or talk to anyone else (along with an independent job processing fleet and Redis instance to support the environment).

Let’s walk through the developer workflow

  • A dev is working on a feature addition to the app, aptly named new-feature .
  • They make their changes in a branch (called new-feature) and push them to GitHub which automatically triggers a CI run in Buildkite :
Seamless branch deploys with Kubernetes
  • The first step in the CI pipeline builds the base Docker image for the app (all later steps depend on it). If the dev hasn’t made a change to Gemfile/Gemfile.lock, this step takes ~8 seconds. Once that’s complete, it’s off to the races for the remaining steps, but most importantly for this blog post: Beta Deploy.
  • The “Beta Deploy” step runs bin/deploy within the built base image, creating a POST to GitHub’s Deployments API . In the repository settings for our app, we’ve configured a webhook that responds solely to deployment events — it’s connected to a separate Buildkite pipeline. When GitHub receives a new deployment request, it sends a webhook over to Buildkite causing another build to be queued that handles the actual deploy (known as the deploy build).
  • The “deploy build” is responsible for building the remainder of the images needed to run the app (nginx, etc.) and actually carrying out the Helm upgrades to both the main app chart and the accompanying Redis chart (that supports Resque and other Redis needs of the branch deploy):
Seamless branch deploys with Kubernetes
  • From there, Kubernetes starts creating the deployments, statefulsets, services, and ingresses needed for the branch, a minute or two later the developer can access their beta at https://new-feature.corp.com . (If this isn’t the first time a branch is being deployed, there’s no initializing step and the deployment just changes the images running in the deployment).

What if a developer wants to manage the deploy from their local machine instead of having to check Buildkite? No problem, the same bin/deploy script that’s used in CI works just fine locally:

$ bin/deploy beta
[✔] Queueing deploy
[✔] Waiting for the deploy build to complete : https://buildkite.com/new-company/great-new-app-deploys/builds/13819
[✔] Kubernetes deploy complete, waiting for Pumas to restart

Deploy success! App URL: https://new-feature.corp.com

( bin/deploy also takes care of verifying that the base image has already been built for the commit being deployed. If it hasn’t it’ll wait for the initial CI build to make it past that step before continuing on to queueing the deploy.)

Remove the blanket!

Sweet, so the developer workflow is easy enough, but there’s got to be more going on below the covers, right? Yes, a lot. But first, story time.

HEY runs on Amazon EKS — AWS’ managed Kubernetes product. While we wanted to use Kubernetes, we don’t have enough bandwidth on the operations team to deal with running a bare-metal Kubernetes setup currently (or relying on something like Kops on AWS), so we’re more than happy to pay AWS a few dollars per month to handle managing our cluster masters for us.

While EKS is a managed service and relatively integrated with AWS, you still need a few other pieces installed to do things like create Application Load Balancers (what we use for the front-end of HEY) and touch Route53. For those two pieces, we have a reliance on the aws-alb-ingress-controller and external-dns projects.

Inside the app Helm chart we have two Ingress resources (one external, and one internal for cross-region traffic that stays within the AWS network) that have all of the right annotations to tell alb-ingress-controller to spin up an ALB with the proper settings (health-checks so that instances are marked healthy/unhealthy, HTTP→HTTPS redirection at the load balancer level, and the proper SSL certificate from AWS Certificate Manager ) and also to let external-dns know that we need some DNS records created for this new ALB. Those annotations look something like this:

Annotations:
  kubernetes.io/ingress.class:                             alb
  alb.ingress.kubernetes.io/listen-ports:                  [{"HTTP": 80},{"HTTPS": 443}]
  alb.ingress.kubernetes.io/scheme:                        internet-facing
  alb.ingress.kubernetes.io/ssl-policy:                    ELBSecurityPolicy-TLS-1-2-2017-01
  alb.ingress.kubernetes.io/actions.ssl-redirect:          {"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
  alb.ingress.kubernetes.io/certificate-arn:               arn:aws:acm:us-east-1:############:certificate/########-####-####-####-############
  external-dns.alpha.kubernetes.io/hostname:               new-feature.us-east-1.corp.com.,new-feature.corp.com.

alb-ingress-controller and external-dns are both Kubernetes controllers and constantly watch cluster resources for annotations that they know how to handle. In this case, external-dns will know that it shouldn’t create a record for this Ingress resource until it has been issued an Address , which alb-ingress-controller will take care of in it’s own control loop. Once an ALB has been provisioned, alb-ingress-controller will tell the Kubernetes API that this Ingress has X Address and external-dns will carry on creating the appropriate records in the appropriate Route53 zones (in this case, external-dns will create an ALIAS record pointing to Ingress.Address and a TXT ownership record within the same Route53 zone (in the same AWS account as our EKS cluster that has been delegated from the main app domain just for these branch deploys).

These things cost money, right, what about the clean-up!?

Totally, and at the velocity that our developers are working on this app, it can rack up a small bill in EC2 spot instance and ALB costs if we have 20-30 of these branches deployed at once running all the time! We have two methods of cleaning up branch-deploys:

  • a GitHub Actions-triggered clean-up run
  • a daily clean-up run

Both of these run the same code each time, but they’re targeting different things. The GitHub Actions-triggred run is going after deploys for branches that have just been deleted — it is triggered whenever a delete event occurs in the repository. The daily clean-up run is going after deploys that are more than five days old (we do this by comparing the current time with the last deployed time from Helm). We’ve experimented with different lifespans on branch deploys, but five works for us — three is too short, seven is too long, it’s a balance.

When a branch is found and marked for deletion, the clean-up build runs the appropriate helm delete commands against the main app release and the associated Redis release, causing a cascading effect of Kubernetes resources to be cleaned up and deleted, the ALBs to be de-provisioned, and external-dns to remove the records it created (we run external-dns in full-sync mode so that it can delete records that it owns).

Other bits

  • We’ve also run this setup using Jetstack’s cert-manager for issuing certs with Let’s Encrypt for each branch deploy, but dropped it in favor of wildcard certs managed in AWS Certificate Manager because hell hath no fury like me opening my inbox everyday to find 15 cert expiration emails in it. It also added several extra minutes to the deploy provisioning timeline for new branches — rather than just having to wait for the ALB to be provisioned and the new DNS records to propagate, you also had to wait for the certificate verification record to be created, propagate, Let’s Encrypt to issue your cert, etc etc etc.
  • DNS propagation can take a while, even if you remove the costly certificate issuance step. This was particularly noticeable if you used bin/deploy locally because the last step of the script is to hit the endpoint for your deploy over and over again until it’s healthy. This meant that you could end up caching an empty DNS result since external-dns may not have created the record yet (likely, in-fact, for new branches). We help this by setting a low negative caching TTL on the Route53 zone that we use for these deploys.
  • There’s a hard limit on the number of security groups that you can attach to an ENI and there’s only so much tweaking you can do with AWS support to maximize the number of ALBs that you can have attached to the nodes in an EKS cluster. For us this means limiting the number of branch deploys in a cluster to 30. HOWEVER, I have a stretch goal to fix this by writing a custom controller that will play off of alb-ingress-controller and create host-based routing rules on a single ALB that can serve all beta instances. This would increase the number of deploys per cluster up to 95ish (per ALB since an ALB has a limit on the number of rules attached), and reduce the cost of the entire setup significantly because each ALB costs a minimum of $16/month and each deploy has two ALBs (one external and one internal).
  • We re-use the same Helm chart for production, beta, and staging — the only changes are the database endpoints (between production/beta and staging), some resource requests, and a few environmental variables. Each branch deploy is its own Helm release.
  • We use this setup to run a full mail pipeline for each branch deploy, too. This makes it easy for devs to test their changes if they involve mail processing, allowing them to send mail to <their username>@new-feature.corp.com and have it appear in their account as if they sent it through the production mail pipeline.
  • Relying on GitHub’s Deployments API means that we get nice touches in PRs like this:
Seamless branch deploys with Kubernetes
complete with a direct link to the temporary deploy environment

If you’re interested in HEY, checkout hey.com and learn about our take on email.

Blake is Senior System Administrator on Basecamp’s Operations team who spends most of his time working with Kubernetes, and AWS, in some capacity. When he’s not deep in YAML, he’s out mountain biking. If you have questions, send them over on Twitter – @t3rabytes .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

数据资本时代

数据资本时代

Viktor Mayer-Schnberger / 李晓霞、周涛 / 中信出版集团股份有限公司 / 2018-11-1 / CNY 58.00

【编辑推荐】 大数据除了能对我们的生活、工作、思维产生重大变革外,还能够做什么?畅销书《大数据时代》作者舍恩伯格在新书《数据资本时代》中,展示了大数据将如何从根本上改变经济——这并不是因为数据是一种新型石油,而是因为数据是一种新型润滑脂,它将给市场带来巨大能量,给公司带来巨大压力,使金融资本的作用大大削弱。赢家是市场,而并非资本。 这本书在当下国内出版,可以说恰逢其时。时下,中国经济正......一起来看看 《数据资本时代》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

SHA 加密
SHA 加密

SHA 加密工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具