Back Up GitHub and GitLab Repositories Using Golang

Amit Saha

Issue #279, July 2017

Want to learn Golang and build something useful? Learn how to write a tool to back up your GitHub and GitLab repositories.

GitHub and GitLab are two popular Git repository hosting services that are used to host and manage open-source projects. They also have become an easy way for content creators to be able to invite others to share and collaborate without needing to have their own infrastructure setup.

Using hosted services that you don't manage yourself, however, comes with a downside. Systems fail, services go down and disks crash. Content hosted on remote services can simply vanish. Wouldn't it be nice if you could have an easy way to back up your git repositories periodically into a place you control?

If you follow along with this article, you will write a Golang program to back up git repositories from https://github.com and https://about.gitlab.com (including custom GitLab installations). Being familiar with Golang basics will be helpful, but not required. Let's get started!

Hello Golang

The latest stable release of Golang at the time of this writing is 1.8. The package name is usually golang, but if your Linux distro doesn't have this release, you can download the Golang compiler and other tools for Linux from https://golang.org/dl. Once downloaded, extract it to /usr/local:


$ sudo tar -C /usr/local -xzf <filename-from-above>
$ export PATH=$PATH:/usr/local/go/bin

Opening a new terminal and typing $ go version should show the following:

$ go version
go version go1.8 linux/amd64

Let's write your first program. Listing 1 shows a program that expects a -name flag (or argument) when run and prints a greeting using the specified name. Compile and run the program as follows:

$ go build listing1.go
$ ./listing1 -name "Amit"
Hello Amit

$ ./listing1
./listing1
2017/02/18 22:48:25 Please specify your name using -name
$ echo $?
1

If you don't specify the -name argument, it exits printing a message with a non-zero exit code. You can combine both compiling and running the program using go run:

$ go run listing1.go -name Amit
2017/03/04 23:08:11 Hello Amit

The first line in the program declares the package for the program. The main package is special, and any executable Go program must live in the main package. Next, the program imports two packages from the Golang standard library using the import statement:

import (
	"flag"
	"log"
)

The "flag" package is used to handle command-line arguments to programs, and the "log" package is used for logging.

Next, the program defines the main() function where the program execution starts:

func main() {
    name := flag.String("name", "", "Your Name")
    flag.Parse()

    if len(*name) != 0 {
        log.Printf("Hello %s", *name)
    } else {
        log.Fatal("Please specify your name using -name")
    }
}

Unlike other functions you'll write, the main function doesn't return anything nor does it take any arguments. The first statement in the main() function above defines a string flag, "name", with a default value of an empty string and "Your Name" as the help message. The return value of the function is a string pointer stored in the variable, name. The := is a shorthand notation of declaring a variable where its type is inferred from the value being assigned to it. In this case, it is of type *string—a reference or pointer to a string value.

The Parse() function parses the flags and makes the specified flag values available via the returned pointer. If a value has been provided to the "-name" flag when executing the program, the value will be stored in "name" and is accessible via *name (recall that name is a string pointer). Hence, you can check whether the length of the string referred to via name is non-zero, and if so, print a greeting via the Printf() function of the log package. If, however, no value was specified, you use the Fatal() function to print a message. The Fatal() function prints the specified message and terminates the program execution.

Structures, Slices and Maps

The program shown in Listing 2 demonstrates the following different things:

  • Defining a struct data type.

  • Creating a map.

  • Creating a slice and iterating over it.

  • Defining a user-defined function.

At the beginning, you define a new struct data type Repository as follows:

type Repository struct {
    GitURL string
    Name   string
}

The structure Repository has two members: GitURL and Name, both of type string. You can define a variable of this structure type using r := Repository{"git+ssh://git.mydomain.com/myrepo", "myrepo"}. You can choose to leave one or both members out when defining a structure variable. For example, you can leave the GitURL unset using r := Repository{Name: "myrepo"}, or you even can leave both out. When you leave a member unset, the value defaults to the zero value for that type—0 for int, empty string for string type.

Next, you define a function, getRepo, which takes an integer as argument and returns a value of type Repository:

func getRepo(id int) Repository {
    repos := map[int]Repository{
        1: Repository{GitURL: "git+ssh://github.com/amitsaha/gitbackup", 
 ↪Name: "gitbackup"},
        2: Repository{GitURL: 
 ↪"git+ssh://github.com/amitsaha/lj_gitbackup", Name: "lj_gitbackup"},
    }

    return repos[id]
}

In the getRepo() function, you create a map or a hash table of key-value pairs—the key being an integer and a value of type Repository. The map is initialized with two key-value pairs.

The function returns the Repository, which corresponds to the specified integer. If a specified key is not found in a map, a zero value of the value's type is returned. In this case, if an integer other than 1 or 2 is supplied, a value of type Repository is returned with both the members set to empty strings.

Next, you define a function backUp(), which accepts a pointer to a variable of type Repository as an argument and prints the Name of the repository. In the final program, this function actually will create a backup of a repository.

Finally, there is the main() function:

func main() {
    var repositories []Repository
    repositories = append(repositories, getRepo(1))
    repositories = append(repositories, getRepo(2))
    repositories = append(repositories, getRepo(3))

    for _, r := range repositories {
        if (Repository{}) != r {
            backUp(&r)
        }
    }
}

In the first statement, you create a slice, repositories, that will store elements of type Repository. A slice in Golang is an dynamically sized array—similar to a list in Python. You then call the getRepo() function to obtain a repository corresponding to the key 1 and store the returned value in the repositories slice using the append() function. You do the same in the next two statements. When you call the getRepo() function with the key, 3, you get back an empty value of type Repository.

You then use a for loop with the range clause to iterate over the elements of the slice, repositories. The index of the element in a slice is stored in the _ variable, and the element itself is referred to via the r variable. You check if the element is not an empty Repository variable, and if it isn't, you call the backUp() function, passing the address of the element. It is worth mentioning that there is no reason to pass the element's address; you could have passed the element's value itself. However, passing by address is a good practice when a structure has a large number of members.

When you build and run this program, you'll see the following output:

$ go run listing2.go
2017/02/19 19:44:32 Backing up gitbackup
2017/02/19 19:44:32 Backing up lj_gitbackup

Goroutines and Channels

Consider the previous program (Listing 2). You call the backUp() function with every repo in the repositories serially. When you actually create a backup of a large number of repositories, doing them serially can be slow. Since each repository backup is independent of any other, they can be run in parallel. Golang makes it really easy to have multiple simultaneous units of execution in a program using goroutines.

A goroutine is what other programming languages refer to as lightweight threads or green threads. By default, a Golang program is said to be executing in a main goroutine, which can spawn other goroutines. A main goroutine can wait for all the spawned goroutines to finish before finishing up using a variable of WaitGroup type, as you'll see next.

Listing 3 modifies the previous program such that the backUp() function is called in a goroutine. The main() function declares a variable, wg of type WaitGroup defined in the sync package, and then sets up a deferred call to the Wait() function of this variable. The defer statement is used to execute any function just before the current function returns. Thus, you ensure that you wait for all the goroutines to finish before exiting the program.

The other primary change in the main() function is how you call the backUp() function. Instead of calling this function directly, you call it in a new goroutine as follows:


wg.Add(1)
go func(r Repository) {
	backUp(&r, &wg)
}(r)

You call the Add() function with an argument 1 to indicate that you'll be creating a new goroutine that you want to wait for before you exit. Then, you define an anonymous function taking an argument, r of type Repository, which calls the function backUp() with an additional argument, a reference to the variable, wg—the WaitGroup variable declared earlier.

Consider the scenario where you have a large number of elements in your repositories list—a very realistic scenario for this backup tool. Spawning a goroutine for each element in the repository can easily lead to having an uncontrolled number of goroutines running concurrently. This can lead to the program hitting per-process memory and file-descriptor limits imposed by the operating system.

Thus, you would want to regulate the maximum number of goroutines spawned by the program and spawn a new goroutine only when the ones executing have finished. Channels in Golang allow you to achieve this and other synchronization operations among goroutines. Listing 4 shows how you can regulate the maximum number of goroutines spawned.

You create a channel of capacity 5 and use it to implement a token system. The channel is created using make:

tokens := make(chan bool, 5)

The above statement creates a “buffered channel”—a channel with a capacity of 5 and that can store only values of type “bool”. If a buffered channel is full, writes to it will block, and if a channel is empty, reads from it will block. This property allows you to implement your token system.

Before you can spawn a goroutine, you write a boolean value, true into it (“taking” a token) and then take it back once you are done with it (“releasing” the token). If the channel is full, it means the maximum number of goroutines are already running and, hence, your attempt to write will block and a new goroutine will not be spawned. The write operation is performed via:


tokens <- true

After the control is returned from the backUp() function, you read a value from the channel and, hence, release the token:


<-tokens

The above mechanism ensures that you never have more than five goroutines running simultaneously, and each goroutine releases its token before it exits so that the next goroutine may run. The file, listing5.go in the GitHub repository mentioned at the end of the article uses the runtime package to print the number of goroutines running using this mechanism, essentially allowing you to verify your implementation.

gitbackup—Backing Up GitHub and GitLab Repositories

In the example programs so far, I haven't explored using any third-party packages. Whereas Golang's built-in tools completely support having an application using third-party repositories, you'll use a tool called gb for developing your “gitbackup” project. One main reason I like gb is how it's really easy to fetch and update third-party dependencies via its “vendor” plugin. It also does away with the need to have your go application in your GOPATH, a requirement that the built-in go tools assume.

Next, you'll fetch and build gb:

$ go get github.com/constabulary/gb/...

The compiled binary gb is placed in the directory $GOPATH/bin. You'll add $GOPATH/bin to the $PATH environment variable and start a new shell session and type in gb:

$ gb
gb, a project based build tool for the Go programming language.

Usage:

     gb command [arguments]
     ..

Next, install the gb-vendor plugin:

$ go get github.com/constabulary/gb/cmd/gb-vendor

gb works on the notion of projects. A project has an “src” subdirectory inside it, with one or more packages in their own sub-directories. Clone the “gitbackup” project from https://github.com/amitsaha/gitbackup, and you will notice the following directory structure:

$ tree -L 1 gitbackup
gitbackup
|--src
|  |--gitbackup
	   |--main.go
    	|--main_test.go
    	|--remote.go
    ..

The “gitbackup” application is composed of only a single package, “gitbackup”, and it has two program files and unit tests. Let's take a look at the remote.go file first. Right at the beginning, you import third-party repositories in addition to a few from the standard library:

  • github.com/google/go-github/github: this is the Golang interface to the GitHub API.

  • golang.org/x/oauth2: used to send authenticated requests to the GitHub API.

  • github.com/xanzy/go-gitlab: Golang interface to the GitLab API.

You define a struct of type Response, which matches the Response structure implemented by both the GitHub and GitLab libraries above. The struct Repository describes each repository that you fetch from either GitLab or GitHub. It has two string fields: GitURL, representing the git clone URL of the repository, and Name, the name of the repository.

The NewClient() function accepts the service name (github or gitlab) as a parameter and returns the corresponding client, which then will be used to interface with the service. The return type of this function is interface{}, a special Golang type indicating that this function can return a value of any type. Depending on the service name specified, it either will be of type *github.Client or *gitlab.Client. If a different service name is specified, it will return nil. To be able to fetch your list of repositories before you can back them up, you will need to specify an access token via an environment variable.

The token for GitLab is specified via the GITLAB_TOKEN environment variable and for GitHub via the GITHUB_TOKEN environment variable. In this function, you check if the correct environment variable has been specified using the Getenv() function from the os package. The function returns the value of the environment variable if specified and an empty string if the specified environment variable wasn't found. If the corresponding environment variable isn't found, you log a message and exit using the Fatal() function from the log package.

The NewClient() function is used by the getRepositories() function, which returns a slice of Repository objects obtained via an API call to the service. There are two conditional blocks in the function to account for the two supported services. The first conditional block handles repository listing for GitHub via the Repositories.List() function implemented by the github.com/gooogle/go-github package. The first argument to this function is the GitHub user name whose repositories you want to fetch. If you leave it as an empty string, it returns the repositories of the currently authenticated user. The second argument to this option is a value of type github.RepositoryListOptions, which allows you to specify the type of repositories you want returned via the Type field. The call to the function Repositories.List() is as follows:

repos, resp, err := client.(*github.Client)
↪.Repositories.List("", &options)

Recall that the newClient() function returns a value of type interface{}, which is an empty interface. Hence, if you attempt to make your function call as client.Repositories.List(), the compiler will complain with an error message:

# gitbackup
remote.go:70: client.Repositories undefined (type interface {} 
 ↪is interface with no methods)

So, you need to perform a “type assertion” through which you get access to the underlying value of client, which is either of the *github.Client or *gitlab.Client type.

You query the list of repositories from the service in an infinite loop indicated by the for loop:

for {
	// This is an infinite loop
}

The function returns three values: the first is a list of repositories, the second is an object of type Response, and the third is an error value. If the function call was successful, the value of err is nil. You then iterate over each of the returned objects, create a Repository object containing two fields you care about and append it to the slice repositories. Once you have exhausted the list of repositories returned, you check the NextPage field of the resp object to check whether it is equal to 0. If it is equal to 0, you know there isn't anything else to read; you break from the loop and return from the function with the list of repositories you have so far. If you have a non-zero value, you have more repositories, so you set the Page field in the ListOptions structure to this value:

options.ListOptions.Page = resp.NextPage

The handler for the “gitlab” service is almost the same as the “github” service with one additional detail. “gitlab” is an open-source project, and you can have a custom installation running on your own host. You can handle it here via this code:

if len(gitlabUrl) != 0 {
    gitlabUrlPath, err := url.Parse(gitlabUrl)
    if err != nil {
        log.Fatal("Invalid gitlab URL: %s", gitlabUrl)
    }
    gitlabUrlPath.Path = path.Join(gitlabUrlPath.Path, "api/v3")
    client.(*gitlab.Client).SetBaseURL(gitlabUrlPath.String())
}

If the value in gitlabUrl is a non-empty string, you assume that you need to query the GitLab hosted at this URL. You attempt to parse it first using the Parse() function from the “url” package and exit with an error message if the parsing fails. The GitLab API lives at <DNS of gitlab installation>/api/v3, so you update the Path object of the parsed URL and then call the function SetBaseURL() of the *gitlab.Client to set this as the base URL.

Next, let's look at the main.go file. First though, you should learn where “gitbackup” creates the backup of the git repositories. You can pass the location via the -backupdir flag. If not specified, it defaults to $HOME/.gitbackup. Let's refer to it as BACKUP_DIR. The repositories are backed up in BACKUP_DIR/gitlab/ or BACKUP_DIR/github. If a repository is not found in BACKUP_DIR/<service_name>/<repo>, you know you'll have to make a new clone of the repository (git clone). If the repository exists, you update it (git pull). This operation is performed in the backUp() function in main.go:

func backUp(backupDir string, repo *Repository, wg *sync.WaitGroup) {
    defer wg.Done()

    repoDir := path.Join(backupDir, repo.Name)
    _, err := os.Stat(repoDir)

    if err == nil {
        log.Printf("%s exists, updating. \n", repo.Name)
        cmd := exec.Command("git", "-C", repoDir, "pull")
        err = cmd.Run()
        if err != nil {
            log.Printf("Error pulling %s: %v\n", repo.GitURL, err)
        }
    } else {
        log.Printf("Cloning %s \n", repo.Name)
        cmd := exec.Command("git", "clone", repo.GitURL, repoDir)
        err := cmd.Run()
        if err != nil {
            log.Printf("Error cloning %s: %v", repo.Name, err)
        }
    }
}

The function takes three arguments: the first is a string that points to the location of the backup directory, followed by a reference to a Repository object and a reference to a WaitGroup. You set up a deferred call to Done() on the WaitGroup. The next two lines then check whether the repository already exists in the backup directory using the Stat() function in the os package. This function will return a nil error value if the directory exists, so you execute the git pull command by using the Command() function from the exec package. If the directory doesn't exist, you execute a git clone command instead.

The main() function sets up the flags for the “gitbackup” program:

  • backupdir: the backup directory. If not specified, it defaults to $HOME/.gitbackup.

  • github.repoType: GitHub repo types to back up; all will back up all of your repositories. Other options are owner and member.

  • gitlab.projectVisibility: visibility level of GitLab projects to clone. It defaults to internal, which refers to projects that can be cloned by any logged in user. Other options are public and private.

  • gitlab.url: DNS of the GitLab service. If you are creating a backup of your repositories on a custom GitLab installation, you can just specify this and ignore specifying the “service” option.

  • service: the service name for the Git service from which you are backing up your repositories. Currently, it recognizes “gitlab” and “github”.

In the main() function, if the backupdir is not specified, you default to use the $HOME/.gitbackup/<service_name> directory. To find the home directory, use the package github.com/mitchellh/go-homedir. In either case, you create the directory tree using the MkdirAll() function if it doesn't exist.

You then call the getRepositories() function defined in remote.go to fetch the list of repositories you want to back up. Limit the maximum number of concurrent clones to 20 by using the token system I described earlier.

Let's now build and run the project from the clone of the “gitbackup” repository you created earlier:

$ pwd
/Users/amit/work/github.com/amitsaha/gitbackup
$ gb build
..
$ ./bin/gitbackup -help
Usage of ./bin/gitbackup:
  -backupdir string
        Backup directory
  -github.repoType string
    	Repo types to backup (all, owner, member) (default "all")
  -gitlab.projectVisibility string
    	Visibility level of Projects to clone (default "internal")
  -gitlab.url string
    	DNS of the GitLab service
  -service string
    	Git Hosted Service Name (github/gitlab)

Before you can back up repositories from either GitHub or GitLab, you need to obtain an access token for each. To be able to back up a GitHub repository, obtain a GitHub personal access token from https://github.com/settings/tokens/new with only the “repo” scope. For GitLab, you can get an access token from https://<location of gitlab>/profile/personal_access_tokens with the “api” scope.

The following command will back up all repositories from github:

$ GITHUB_TOKEN=my$token ./bin/gitbackup -service github

Similarly, to back up repositories from a GitLab installation to a custom location, do this:

$ GITLAB_TOKEN=my$token ./bin/gitbackup -gitlab.url 
 ↪git.mydomain.com -backupdir /mnt/disk/gitbackup

See the README at https://github.com/amitsaha/gitbackup to learn more, and I welcome improvements to it via pull requests. In the time between the writing of this article and its publication, gitbackup has changed a bit. The code discussed in this article is available in the tag https://github.com/amitsaha/gitbackup/releases/tag/lj-0.1. To learn about the changes since this tag in the current version of the repository, see my blog post at echorand.me/notes-on-using-golang-to-write-gitbackup.html.

Conclusion

I covered some key Golang features in this article and applied them to write a tool to back up repositories from GitHub and GitLab. Along the way, I explored interfaces, goroutines and channels, passing command-line arguments via flags and working with third-party packages.

The code listings discussed in the article are available at https://github.com/amitsaha/lj_gitbackup. See the Resources section to learn more about Golang, GitHub and the GitLab API.

Amit Saha is a software engineer and the author of Doing Math with Python (No Starch Press). He blogs at echorand.me and can be reached via email at amitsaha.in@gmail.com.