Over the past couple of weeks, I started thinking a bit more about adding resiliency to my personal projects and accounts. You can follow my entire thought process on Twitter (see the embedded Tweet below), but in this blog post I’m going to focus on Github.

Following AWS, the second most critical service for my projects is Github: 90% of my code is stored there (mostly in private projects), and I have to admit I never took a backup of this data.

So I finally decided to set some time aside to set up an automated process to backup my Github account, and I ended up relying on ECS (Fargate) and S3 Glacier. This blog explains the architecture and implications of the final setup I decided to go with.


Architecture

At a high level, this is how the final setup looks like:

  • Backups of my Github account are taken via an ECS on Fargate Task Definition, with execution triggered periodically by a CloudWatch Event Rule, and secrets (i.e., the Github PAT) pulled from Parameter Store.
  • The data fetched from Github is zipped and uploaded to an S3 bucket, where it will transition to Glacier after one day.
  • Notifications are sent via SNS for every task starting and/or stopping, as well as for every new object created in the destination S3 bucket.
Github Backups with ECS - Architecture
Github Backups with ECS - Architecture

Let’s see what all of this means, and let’s analyse the different components in more detail.

Docker Image and Python Logic

Let’s start by talking about the Docker Image which hosts the actual application logic in charge of the backup.

The logic is based on python-github-backup, a Python script that can be used to backup an entire organization or repository (including issues and wikis in the most appropriate format), and which I’ve customized for my use case.

In particular, I’ve added the following:

This customized version of python-github-backup is then packaged as a Docker image, and stored in an ECR repository within one of my AWS accounts. The image is automatically built and pushed to ECR via Github Actions.

FROM python:3.9.5-slim-buster

RUN addgroup --gid 11111 app
RUN adduser --shell /bin/false --no-create-home --uid 11111 --gid 11111 app

RUN apt-get update \
  && apt-get install -y --no-install-recommends git \
  && apt-get purge -y --auto-remove \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /src
COPY docker/python-github-backup/python-github-backup /src
RUN pip install -e .

RUN chown -R app:app /src
USER app

ENTRYPOINT ["github-backup"]
The code for recreating this Docker image, alongside the customised python script, can be found on Github at:
github.com/marco-lancini/utils/tree/main/docker/python-github-backup

Terraform and Infrastructure Setup

The rest of the components you can see in the “Architecture” diagram above are managed via Terraform. I ended up creating a module which can be used to create:

  • An ECR repository where to store the Docker image of the customised python-github-backup script.
ECR Repository
ECR Repository
S3 Lifecycle Policy
S3 Lifecycle Policy
Systems Manager Parameter Store
Systems Manager Parameter Store
ECS Cluster
ECS Cluster
ECS Task Definition
ECS Task Definition
  • For notifications:
    • A dedicated SNS Topic.
    • A CloudWatch Event Rule to alert on every ECS Task starting (RUNNING) and/or stopping (STOPPED).
    CloudWatch Event Rule
    CloudWatch Event Rule
    S3 Event Notification
    S3 Event Notification
This Terraform module can be found on Github at:
github.com/marco-lancini/utils/tree/main/terraform/aws-github-backups

Subscribe to CloudSecList

If you found this article interesting, you can join thousands of security professionals getting curated security-related news focused on the cloud native landscape by subscribing to CloudSecList.com.


Usage

  • Run the Terraform module above, which will setup all the necessary components
  • Create a Personal Access Token in Github, and assign the following scopes to it:
    • repo
    • read:org
    • read:user
  • Store the Github’s PAT in the Parameter Store:
    • Name: GITHUB_PAT_BACKUP
    • Description: Github Personal Access Token to grant access to the org
    • Type: SecureString
    • KMS: default
  • Build the custom Docker image and upload it to ECR. You could automate this via your CI/CD pipeline, or, otherwise, you could push it manually with a script similar to the one below:
#! /bin/bash

AWS_ACCOUNT_ID="XXXXX"
AWS_REGION="XXXXX"
IMAGE_NAME="python-github-backup"
IMAGE_VERSION="latest"
ECR_REPO="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${IMAGE_NAME}"

# AUTHENTICATE TO ECR
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

# BUILD IMAGE
docker build -t ${IMAGE_NAME} .

# TAG IMAGE
docker tag ${IMAGE_NAME}:${IMAGE_VERSION} ${ECR_REPO}:${IMAGE_VERSION}

# PUSH IMAGE
docker push ${ECR_REPO}:${IMAGE_VERSION}
  • Wait till the first day of the next month (or run a Task manually), to have your Github backup stored into S3!

Security Considerations

Dependencies

Code related to all my personal projects is stored within a single monorepo, and all (well, the majority) of dependencies are vendorised (I briefly touched about this in “My Blogging Stack”, but this will probably warrant another post on its own).

This setup is no exception: I initially reviewed python-github-backup, tailored it to my needs, and now Github Actions builds the Docker image from the custom copy within the monorepo.

At the same time, the terraform module leverages 2 other external modules: umotif-public/ecs-fargate/aws and umotif-public/ecs-fargate-scheduled-task/aws. Although the public module I released on Github uses the upstream versions, the module I use internally refers to local vendorised copies of these modules.

Secrets Management

This is where this solution could be improved, in my opinion.

For my use case, I decided to store Github’s PAT in Parameter Store instead of Secrets Manager mainly from a pricing point of view, with Parameter Store not incurring in additional charges for Standard parameters.

For me, this is a “good enough” tradeoff for now, but I understand Secrets Manager could be seen as a more reliable solution for storing Github’s PAT.

Storage Reliability

For handling backups, I decided to have a dedicated AWS account.

Another improvement could involve setting up cross-account backups, via AWS Backup, to replicate the data stored in S3 into another account. This data, though, already exists in two places already (the live data in Github, and the backup in S3) so it seems an overkill for now.

Other two options worth looking into could be S3 Object Lock and Glacier Vault Lock.


How Much Does this Cost?

Since I’ve just deployed this solution, I don’t have enough historical data to show you exactly how much I spent on it.

What I can do, though, is to use the AWS Pricing Calculator to give you an estimate:

Service Monthly Forecast ($) First 12 months Forecast ($)
S3 Glacier 0.05 0.60
ECR 0.0098 0.12
Parameter Store 0 0
CloudWatch 0 0
Total 0.0598 0.72

As you can see, the biggest entry, as expected, will be storage: I expect to have ~1GB generated each month, for a total of ~12GB concurrently stored in Glacier when at full regime (since the retaining period for each backup is 1 year).


Show Me the Code

As briefly mentioned, both the custom Docker image and the Terraform module needed to recreate the different components of the architecture are available on Github:


Conclusions

In this post I outlined architecture and implications of an automated process aiming to backup a Github account, relying on ECS Fargate and S3 Glacier.

The next service I want to tackle, since it is where I store the majority of my personal data, is GDrive. Expect another blog post (with code) hopefully soon. I blogged about it at: “Automated GDrive Backups with ECS and S3”.

I hope you found this post useful and interesting, and I’m keen to get feedback on it! If you find the information shared was useful, if something is missing, or if you have ideas on how to improve it, please let me know on Twitter.