Reading time ~7 minutes

Automated Github Backups with ECS and S3

Architecture
- Docker Image and Python Logic
- Terraform and Infrastructure Setup
Usage
Security Considerations
How Much Does this Cost?
Show Me the Code
Conclusions

Over the past couple of weeks, I started thinking a bit more about adding resiliency to my personal projects and accounts. You can follow my entire thought process on Twitter (see the embedded Tweet below), but in this blog post I’m going to focus on Github.

Following AWS, the second most critical service for my projects is Github: 90% of my code is stored there (mostly in private projects), and I have to admit I never took a backup of this data.

So I finally decided to set some time aside to set up an automated process to backup my Github account, and I ended up relying on ECS (Fargate) and S3 Glacier. This blog explains the architecture and implications of the final setup I decided to go with.

In the past couple of weeks, I started thinking a bit more about adding resiliency into my personal projects/accounts. A thread 🧵
— Marco Lancini (@lancinimarco) June 21, 2021

Architecture

At a high level, this is how the final setup looks like:

Backups of my Github account are taken via an ECS on Fargate Task Definition, with execution triggered periodically by a CloudWatch Event Rule, and secrets (i.e., the Github PAT) pulled from Parameter Store.
The data fetched from Github is zipped and uploaded to an S3 bucket, where it will transition to Glacier after one day.
Notifications are sent via SNS for every task starting and/or stopping, as well as for every new object created in the destination S3 bucket.

Let’s see what all of this means, and let’s analyse the different components in more detail.

Docker Image and Python Logic

Let’s start by talking about the Docker Image which hosts the actual application logic in charge of the backup.

The logic is based on python-github-backup, a Python script that can be used to backup an entire organization or repository (including issues and wikis in the most appropriate format), and which I’ve customized for my use case.

In particular, I’ve added the following:

Fetch the Github Personal Access Token (PAT) and target user from environment variables.
Zip the final output folder.
Upload the Zip file to an S3 Bucket.

This customized version of python-github-backup is then packaged as a Docker image, and stored in an ECR repository within one of my AWS accounts. The image is automatically built and pushed to ECR via Github Actions.

FROM python:3.9.5-slim-buster

RUN addgroup --gid 11111 app
RUN adduser --shell /bin/false --no-create-home --uid 11111 --gid 11111 app

RUN apt-get update \
  && apt-get install -y --no-install-recommends git \
  && apt-get purge -y --auto-remove \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /src
COPY docker/python-github-backup/python-github-backup /src
RUN pip install -e .

RUN chown -R app:app /src
USER app

ENTRYPOINT ["github-backup"]

The code for recreating this Docker image, alongside the customised python script, can be found on Github at:
github.com/marco-lancini/utils/tree/main/docker/python-github-backup

Terraform and Infrastructure Setup

The rest of the components you can see in the “Architecture” diagram above are managed via Terraform. I ended up creating a module which can be used to create:

An ECR repository where to store the Docker image of the customised python-github-backup script.

A destination S3 bucket with a lifecycle policy which transitions objects to Glacier after 1 day.

A Systems Manager Parameter Store where to store the Github PAT.

A dedicated VPC, and a subnet with an Internet Gateway (IGW) attached to it, to allow for egress traffic.
An ECS Cluster on Fargate cluster, in the dedicated VPC.

An ECS Task Definition, with execution triggered periodically (cron) by a CloudWatch Event Rule, and secrets pulled from Parameter Store.

For notifications:
- A dedicated SNS Topic.
- A CloudWatch Event Rule to alert on every ECS Task starting (RUNNING) and/or stopping (STOPPED).
CloudWatch Event Rule
- An S3 Event Notification to alert on every new object created in the destination bucket.
S3 Event Notification

This Terraform module can be found on Github at:
github.com/marco-lancini/utils/tree/main/terraform/aws-github-backups

📙 The CloudSec Engineer is out now!

The CloudSec Engineer is a practical guide on how to enter, establish yourself, and thrive in the Cloud Security industry as an individual contributor.

You can head over to CloudSecBooks.com to find more information about the book and its contents.

Usage

Run the Terraform module above, which will setup all the necessary components
Create a Personal Access Token in Github, and assign the following scopes to it:
- repo
- read:org
- read:user
Store the Github’s PAT in the Parameter Store:
- Name: GITHUB_PAT_BACKUP
- Description: Github Personal Access Token to grant access to the org
- Type: SecureString
- KMS: default
Build the custom Docker image and upload it to ECR. You could automate this via your CI/CD pipeline, or, otherwise, you could push it manually with a script similar to the one below:

#! /bin/bash

AWS_ACCOUNT_ID="XXXXX"
AWS_REGION="XXXXX"
IMAGE_NAME="python-github-backup"
IMAGE_VERSION="latest"
ECR_REPO="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${IMAGE_NAME}"

# AUTHENTICATE TO ECR
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

# BUILD IMAGE
docker build -t ${IMAGE_NAME} .

# TAG IMAGE
docker tag ${IMAGE_NAME}:${IMAGE_VERSION} ${ECR_REPO}:${IMAGE_VERSION}

# PUSH IMAGE
docker push ${ECR_REPO}:${IMAGE_VERSION}

Wait till the first day of the next month (or run a Task manually), to have your Github backup stored into S3!

Security Considerations

Dependencies

Code related to all my personal projects is stored within a single monorepo, and all (well, the majority) of dependencies are vendorised (I briefly touched about this in “My Blogging Stack”, but this will probably warrant another post on its own).

This setup is no exception: I initially reviewed python-github-backup, tailored it to my needs, and now Github Actions builds the Docker image from the custom copy within the monorepo.

At the same time, the terraform module leverages 2 other external modules: umotif-public/ecs-fargate/aws and umotif-public/ecs-fargate-scheduled-task/aws. Although the public module I released on Github uses the upstream versions, the module I use internally refers to local vendorised copies of these modules.

Secrets Management

This is where this solution could be improved, in my opinion.

For my use case, I decided to store Github’s PAT in Parameter Store instead of Secrets Manager mainly from a pricing point of view, with Parameter Store not incurring in additional charges for Standard parameters.

For me, this is a “good enough” tradeoff for now, but I understand Secrets Manager could be seen as a more reliable solution for storing Github’s PAT.

Storage Reliability

For handling backups, I decided to have a dedicated AWS account.

Another improvement could involve setting up cross-account backups, via AWS Backup, to replicate the data stored in S3 into another account. This data, though, already exists in two places already (the live data in Github, and the backup in S3) so it seems an overkill for now.

Other two options worth looking into could be S3 Object Lock and Glacier Vault Lock.

How Much Does this Cost?

Since I’ve just deployed this solution, I don’t have enough historical data to show you exactly how much I spent on it.

What I can do, though, is to use the AWS Pricing Calculator to give you an estimate:

Service	Monthly Forecast ($)	First 12 months Forecast ($)
S3 Glacier	0.05	0.60
ECR	0.0098	0.12
Parameter Store	0	0
CloudWatch	0	0
Total	0.0598	0.72

As you can see, the biggest entry, as expected, will be storage: I expect to have ~1GB generated each month, for a total of ~12GB concurrently stored in Glacier when at full regime (since the retaining period for each backup is 1 year).

Show Me the Code

As briefly mentioned, both the custom Docker image and the Terraform module needed to recreate the different components of the architecture are available on Github:

The code for recreating the Docker image, alongside the customised python script, can be found at: github.com/marco-lancini/utils/tree/main/docker/python-github-backup
The Terraform module can be found at: github.com/marco-lancini/utils/tree/main/terraform/aws-github-backups

Conclusions

In this post I outlined architecture and implications of an automated process aiming to backup a Github account, relying on ECS Fargate and S3 Glacier.

The next service I want to tackle, since it is where I store the majority of my personal data, is GDrive. ~~Expect another blog post (with code) hopefully soon.~~ I blogged about it at: “Automated GDrive Backups with ECS and S3”.

I hope you found this post useful and interesting, and I’m keen to get feedback on it! If you find the information shared was useful, if something is missing, or if you have ideas on how to improve it, please let me know on Twitter.

About

Automated Github Backups with ECS and S3

Architecture

Docker Image and Python Logic

Terraform and Infrastructure Setup

📙 The CloudSec Engineer is out now!

Usage

Security Considerations

Dependencies

Secrets Management

Storage Reliability

How Much Does this Cost?

Show Me the Code

Conclusions

Subscribe to CloudSecList

Marco Lancini

About

The CloudSec Engineer

CloudSec* Projects

Collections

Must Read

Recent Articles

Tags