Week 1: Introduction to Version Control

Last updated: Dec 09, 2022

23 min read

Learning Objectives:

Describe the concept of version control and why it is important to use
Utilize the diff and patch commands to automate differentiating and editing files
Explain what Git is and its benefits of use
Install Git on local machine
Utilize Git to create and clone repositories, add code, check the status of code, and commit code

Table of Contents:

Course Introduction

This course focuses on how to keep track of the different versions of your code and configuration files using version control systems or VCS.

In this course, we'll introduce you to a popular VCS called Git, and show you some of the ways you can use it.

We'll also go through how to set up an account with the service called GitHub, so that you can create your very own remote repositories to store your code and configuration.

By the end of this course, you'll be able to store your codes history in Git, and collaborate with others in GitHub, where you'll also start creating your own portfolio.

On top of search results, here are some great Git resources available online:

Pro Git: This book (available online and in print) covers all the fundamentals of how Git works and how to use it. Refer to it if you want to learn more about the subjects that we cover throughout the course.
Git tutorial: This tutorial includes a very brief reference of all Git commands available. You can use it to quickly review the commands that you need to use.

Before Version Control

Intro to Module 1: Version Control

When trying to manage change in IT, it's super important to have detailed historical information for your organization's configuration files and automation code.

This let's the administrators see what was modified and when, which can be critical to troubleshooting.

It also provides a documentation trail that will let future IT specialists know why the infrastructure is the way it is, and it provides a mechanism for undoing a change completely.

Keeping Historical Copies

Have you ever worked on a project that was developing over time?

First, you need to remember to make the copy.

Second, you usually make a copy of the whole thing, even if you're only changing one small part.

And third, even if you're emailing your changes to your colleagues, it might be hard to figure out at the end who did what, and more importantly, why they did it.

The principle behind version control is the same. It lets us keep track of the changes in our files.

Diffing Files

We can use the diff command line tool to take two files or even to directories, and show the differences between them in a few formats.

Example:

We have two files rearrange 1.py and rearrange 2.py which contain two different versions of the same function.

When we call the diff command: diff rearrange1.py rearrange2.py

We get only the lines that are different between two files.

See the symbols at the beginning of each of those lines? The “<“ symbol tells us that the first line was removed from the first file, and the “>” symbol tells us that the second line was added to the second file. In other words, the old line got replaced by the new one.

Example:

Here there are more changes going on. We can see that diff splits the changes in two separate sections.

The section that starts with 5c5,6 shows a line in the first file that was replaced by two different lines in the second file. The number at the beginning of this section indicates the line number in the first and second files. The c in between the numbers means that a line was changed.

The section that starts with 11a13,15 shows three lines that are new in the second file. The a stands for added, but that block looks a bit strange doesn't it? It seems like we're adding a return and an if condition but nobody for the if block. What's up with that? To understand this better we can use the -u flag to tell diff to show the differences in another format.

This unified format is pretty different from the one that we saw before. It shows the change lines together with some context, using the “-“ sign to mark lines that were removed, and the “+” sign to mark lines that were added. The extra context let's us better know what's going on with the change that we're diffing. We can see that the new file actually has a completely new if block that's part of a chain of conditionals that looks very similar, and that's why with the diff output that we saw before, it was a little confusing which lines had been added.

There are a lot of tools out there to compare files. Diff is the most popular one, but not the only one available. For example, wdiff highlights the words that have changed in a file instead of working line by line like diff does.

To help us even more, there are bunch of graphical tools that display files side by side and highlight the differences by using color. Some examples of this include: meld, KDiff3, or vimdiff.

Applying Changes

Imagine a colleague sends you a script with a bug and asked you to help fix the issue. To make the change clear, you could send him a diff with the change so that they can see what the modified code looks like.

To do this, we typically use a command line like: diff -u old_file new_ file > change.diff

As a reminder, the greater than sign redirects the output of the diff command to a file. So with this command, we're generating a file called change.diff with the contents of diff -u command.

The generated file is usually referred to as a diff file or sometimes a patch file. It includes all the changes between the old file and the new one, plus the additional context needed to understand the changes and to apply those changes back to the original file.

There's a command called patch that takes a file generated by diff and applies the changes to the original file. How do we do that? We'll pass the name of the file that we want to patch as the first parameter to the command and then we'll provide the diff file through standard input.

We get one single line that says the file was patched, which means that we've successfully applied the changes.

You might be wondering, why go through all this trouble diffing, and patching, and not just send the whole file instead?

There are a few reasons for this:

The main reason is that the original code could have changed. By using a diff instead of the whole file, we can clearly see what they changed, no matter which version they were using. The patch command can detect that there were changes made to the file and will do its best to apply the diff anyways. It won't always succeed but in many cases it will.
Another reason is structure. In this case we're patching a single small file. But sometimes, you might be modifying a bunch of large files inside of a huge project. Say you are changing four files in a project tree that contain 100 different files, arranged in different directories according to what they do. If you were to send the whole files, you'd need to specify where those files were supposed to be placed. As we called out, we can diff whole directory structures and in that case the diff file can specify where each change file should be without having to do any manual juggling.

Practical Application of diff and patch

Imagine this, a colleague is asking our help with fixing a script named disk_usage.py.

Before we change anything, let’s make a couple copies of the script. We'll add _original to one copy, which we’ll keep unmodified and use for comparison and _fixed to the other copy, which we’ll use to repair our fix.

After making the changes, we need to send a fixed to our colleague so that they can fix their script.

By calling patch with the diff file, we've applied the changes that were necessary to fix the bugs. Let's check that disk_usage.py now executes successfully.

But this is still a very manual process, where version control systems can really help.

diff and patch Cheat Sheet

diff

diff is used to find differences between two files. On its own, it’s a bit hard to use; instead, use it with diff -u to find lines which differ in two files:

diff -u

diff -u is used to compare two files, line by line, and have the differing lines compared side-by-side in the same output. See below:

~$ cat menu1.txt 
Menu1:

Apples
Bananas
Oranges
Pears

~$ cat menu2.txt 
Menu:

Apples
Bananas
Grapes
Strawberries

~$ diff -u menu1.txt menu2.txt 
--- menu1.txt   2019-12-16 18:46:13.794879924 +0900
+++ menu2.txt   2019-12-16 18:46:42.090995670 +0900
@@ -1,6 +1,6 @@
-Menu1:
+Menu:
 
 Apples
 Bananas
-Oranges
-Pears
+Grapes
+Strawberries

Patch

Patch is useful for applying file differences. See the below example, which compares two files. The comparison is saved as a .diff file, which is then patched to the original file!

~$ cat hello_world.txt 
Hello World
~$ cat hello_world_long.txt 
Hello World

It's a wonderful day!
~$ diff -u hello_world.txt hello_world_long.txt 
--- hello_world.txt     2019-12-16 19:24:12.556102821 +0900
+++ hello_world_long.txt        2019-12-16 19:24:38.944207773 +0900
@@ -1 +1,3 @@
 Hello World
+
+It's a wonderful day!
~$ diff -u hello_world.txt hello_world_long.txt > hello_world.diff
~$ patch < hello_world.diff 
patching file hello_world.txt
~$ cat hello_world.txt 
Hello World

It's a wonderful day!

There are some other interesting patch and diff commands such as patch -p1, diff -r !

Check them out in the following references:

Version Control Systems

What is version control?

A Version Control System keeps track of the changes that we make to our files. By using a VCS, we can know when the changes were made and who made them.
It also lets us easily revert a change if it turned out not to be a good idea.
It makes collaboration easier by allowing us to merge changes from lots of different sources
Version Control System can be used to store much more than just code. We can use it to store configuration files, documentation, data files, or any other content that we may need to track.

Version Control and Automation

If anything breaks after you change your code, you can rely on the VCS to tell you what the file looked like before the change.

You can then revert to the old version quickly, so you can fix the problem fast and figure out what went wrong later. This functionality enhances the reliability of systems you operate.

Because of the audit trail provided by the VCS, you know exactly what version of the code to rollback to, which reduces the time it takes to fix the problem.

It's generally better to quickly roll back first and stop errors before spending time figuring out what went wrong.

Figuring out the bug might take up valuable time or worse, your first attempt at a solution can have its own bugs.

What is Git?

Git is a VCS created in 2005 by Linus Torvalds. The developer who started the Linux kernel.
Git is a free open source system available for installation on Unix based platforms, Windows and macOS.
Linus originally created get to help manage the task of developing the Linux kernel. This was difficult because a lot of geographically distributed programmers were collaborating to write a whole bunch of code.
Git has a distributed architecture. This means that every person contributing to a repository has full copy of the repository on their own development machines.
If you want to collaborate with others, it usually makes sense to set up a repository on a server to act as a kind of hub for everyone to interact with.
But Git doesn't rely on any kind of centralized server to provide control organizations to its workflow. Git can work as a standalone program as a server and as a client. This means that you can use Git on a single machine without even having a network connection.
Git clients can communicate with Git servers over the network using HTTP, SSH or Git's own special protocol.
You can use Git to track private work that you can keep to yourself or you can share your work with others by hosting a code on public servers like Github, Gitlab or others.
A commit is a collection of edits which has been submitted to the version control system for safe keeping.

When looking for information online you might notice that the official Git website is called http://git-scm.com. SCM is actually another acronym similar to VCS. It stands for Source Control Management.

There are other VCS programs like Subversion or Mercurial.

More Information About Git

Check out the following links for more information:

Installing Git

The first step is to check whether you already have it installed. You can do this by running git --version. If you're running a version number higher than 2.20, then you can just use that one. If you get an error message or an older version number, you'll need to install the current version.
If you use a package management system like apt or yum on Linux, Chocolatey on Windows, or Homebrew on Mac OS, you can just install Git through that. If you don't use a package management system, then you can download the latest executable installer from the official website and deploy it on your computer.
1. On Linux, installing and using Git is pretty straightforward. You can install it with the command apt
  install git or yum install git, and after that, you'll have Git installed and ready to use.
2. On Mac OS, you can even have it installed when you run git --version. If Git isn't installed, this command will ask you if you want to install it and then download it and install it for you. Alternatively, you can also download it from the website and install it by following the prompts. Once it's installed, you'll be able to use it from the command line just like any other tool.
3. On Windows, after downloading and executing the installer, you'll need to go through a bunch of different configuration options. These options come with preselected defaults that usually makes sense to just keep. Pay attention to the editor question though. You'll probably want to change
  the editor to one that you feel comfortable with, like Notepad++ or Atom. One interesting thing about
  the Windows installation is that it comes preloaded with an environment called MinGW64. This environment lets us operate on Windows with the same commands and tools available on Linux. So you can practice some Linux command line tools on your Windows machine. After installing Git on your Windows machine, you'll be able to use Git from the Linux command line. If you selected the default option for the path environment question, you'll be able to also run it from the PowerShell command line (An optional video will talk about the available options and when you might want to select something different from the default).

Throughout this course, we'll talk about how to do things from the command line.

Some integrated developer environments or IDEs let us interact with Git through graphical interfaces. It's fine to use those if you feel more comfortable with them.

The first step on the way to using Git is to install it! The directions found in the Git documentation below are pretty thorough and helpful, check them out for the best method of getting Git onto your platform of choice.

Using Git

First Steps with Git

Let's start by setting some basic configuration.

Remember when we said that a VCS tracks who made which changes, for this to work, we need to tell Git who we are. We can do this by using the Git config command and then setting the values of user.email and user.name to our email and our name like this.

We use the dash dash global flag to state that we want to set this value for all git repositories that we'd use. We could also set different values for different repositories.

With that done, there are two ways to start working with a git repository:

We can create one from scratch using the git init command or we can use the git clone command to make a copy of a repository that already exists somewhere else. We'll talk about remote repositories later in the course.

For now, let's start by creating a new directory and then a git repository inside that directory. So when we run git init we initialize an empty git repository in the current directory. The message that we get mentions a directory called .git

We can check that this directory exist using the ls -la command which lists files that start with a dot.

We can also use the ls -l .git/ command to look inside of it and see the many different things it contains.

This is called a Git directory. You can think of it as a database for your Git project that stores the changes and the change history. We can see it contains a bunch of different files and directories. We won't touch any of these files directly, we'll always interact with them through Git commands. So whenever you clone a repository, this git directory is copied to your computer.

Whenever you run git init to create a new repository like we just did, a new git directory is initialized.

The area outside the git directory is the working tree. The working tree is the current version of your project. You can think of it like a workbench or a sandbox where you perform all the modification you want to your file. This working tree will contain all the files that are currently tracked by Git and any new files that we haven't yet added to the list of track files.

Right now our working tree is empty. Let's change that by copying the disk_usage.py file into our current directory.

We now have file and a working tree but it's currently untracked by Git.

To make Git track our file, we'll add it to the project using the git add command passing the file that we want as a parameter. With that, we've added our file to the staging area.

The staging area (index) is a file maintained by Git that contains all of the information about what files and changes are going to go into your next command.

We can use the git status command to get some information about the current working tree and pending changes. We see that our new file is marked to be committed, this means that our change is currently in the staging area. To get it committed into the.git directory, we run the git commit command.

When we run this command, we tell Git that we want to save our changes. It opens a text editor where we can enter a commit message. If you want, you can change the editor used to your preferred editor. In our case, this computer has nano configured as a default editor.

The texts that we get tells us that we need to write a commit message and that the change to be committed is the new file that we've added.

For now, let's enter a simple description of what we did which was to add this one file and then exit the editor saving our commit message and with that we've created our first git commit.

Tracking Files

Any Git project will consist of three sections: the Git directory, the working tree, and the staging area.

The Git directory contains the history of all the files and changes.
The working tree contains the current state of the project, including any changes that we've made.
And the staging area contains the changes that have been marked to be included in the next commit.

So it might be helpful to think about Git as representing your project. Which is the code and associated files and a series of snapshots.

Each time you make a commit, Git records a new snapshot of the state of your project at that moment. It's a picture of exactly how all these files looked at a certain moment in time.

Combined, these snapshots make up the history of your project, and it's information that gets stored in the Git directory.

When we operate with Git, our files can be either tracked or untracked.

Tracked files are part of the snapshots, while untracked files aren't a part of snapshots yet. This is the usual case for new files.

Each track file can be in one of three main states, modified, staged or committed:

If a file is in the modified state, it means that we've made changes to it that we haven't committed yet. The changes could be adding, modifying or deleting the contents of the file. But won't store any changes until we add them to the staging area.
Next step is to stage those changes. When we do this, our modified files become stage files. In other words, the changes to those files are ready to be committed to the project. All files that are staged will be part of the next snapshot we take.
Finally, when a file gets committed, the changes made to it are safely stored in a snapshot in the Git directory.

This means that typically a file tracked by Git:

Will first be modified when we change it in any way.
Then it becomes staged when we mark those changes for tracking.
And finally it will get committed when we store those changes in the VCS.

Example Git repo:

First, let's check the contents of the current working tree using ls -l. And then the current status of our files using the Git status command.

When we run Git status, Git tells us a bunch of things, including that we're on the master branch.For now, notice how it says that there's nothing to commit and that the working tree is clean. Let's modify a file to change that. For example, we'll just add periods at the end of the message that our script presents to the user.

So, now that we've made the change, let's call Git status again and see the new output Again, Git tells us a lot of things, including giving us some tips for commands that we might want to use.

See how the file we changed is now marked as modified? And that it's currently not staged for commit?

Let's change that by running the Git add command, passing the disk_usage.py file as a parameter. When we call Git add, we're telling Git that we want to add the current changes in that file to the list of changes to be committed. This means that our file is currently part of the staging area, and it will be committed once we run the next Git command, Git commit.

In this case, instead of opening up an editor, let's pass the commit message using the - m flag, stating that we added periods at the end of the sentences.

So, we've now committed our stage changes. This creates a new snapshot in the Git directory. The command shows us some stats for the change made.

Let's do one last status check. We see that once again, we have no changes to commit.

Because the change we made has gone through the full cycle of modified, staged and committed.

So to sum up:

We work on modified files in our working tree.
When they're ready, we staged these files by adding them to the staging area.
Finally, we commit the changes sitting in our staging area, which takes a snapshot of those files and stores them in the database that lives in the Git directory.

The Basic Git Workflow

We saw that each repository will have a Git directory, a working tree, and a staging area.

And we called out that files can be in three different states, modified, staged, and committed.

Let's review these concepts one more time by looking at the normal workflow when operating with Git on a day to day basis.

First, all the files we want to manage with Git must be a part of a Git repository.

We initialize a new repository by running the git init command in any file system directory.

For example, let's use the mkdir command to create a directory called scripts, and then change into it and initialize an empty Git repository init.

Our shiny new Git repository can now be used to track changes to files inside of it. But before jumping into that, let's check out our current configuration by using the git config -l command.

For now, pay special attention to the user.email and the user.name lines, which we touched on briefly in an earlier video.

This information will appear in public commit logs if you use a shared repository. For privacy reasons, you might want to use different identities when dealing with your private work and when submitting code to public repositories. We'll include more details about changing this information in our next reading.

Okay, our repo is ready to work, but it's currently empty. Let's create a file in it, we'll start with a basic skeleton for a Python script, which will help us demonstrate the Git workflow.

As with any Python script, we'll start with the shebang line. For now, we'll add an empty main function, which we'll fill in later. And at the end, we'll just call this main function.

This is a script that we'll want to execute, so let's make it executable. And then let's check the status of our repo using git status command. As we called out before, when we create a new file in a repository, it starts off as untracked. We can make all kinds of changes to the file, but until we tell Git to track it, Git won't do anything with an untracked file.

The git add command will immediately move a new file from untracked to stage status. And as we'll see later, it will also change a file in the modified state to staged state.

Remember that when a file is staged, it means it's been added to the staging area and it's ready to be committed to the Git repository.

To initiate a commit of staged files, we issue the git commit command. When we do this, Git will only commit the changes that have been added to the staging area, untracked files or modified files that weren't staged will be ignored.

Calling git commit with no parameters will launch a text editor, this will open whatever has been set as your default editor. If the default editor is not the one you'd like to use, there are a bunch of ways to change it.

For now, let's edit our message with Nano, which is the current default for this computer. We'll say that our change is creating an empty all_checks.py file, then save and exit.

Voila! We've just recorded a snapshot of the code in our project, which is stored in the Git directory.

Remember that every time we commit changes, we take another snapshot, which is annotated with a commit message that we can review later.

Okay, that's how we add new files, but usually we'll modify existing ones. So let's add a bit more content to our script to see that in action. We'll add a function called check_reboot, that will check if the computer is pending a reboot. To do that, we'll check if the run/reboot-required file exists. This is a file that's created on our computer when some software requires a reboot. And of course, since we're using os.path.exists, we need to add import os to our script.

All right, we've added a function to our file. Let's check the current status using git status again. Our file's modified, but not staged.

To stage our changes, we need to call git add once again. We have to call git commit to store those changes to the Git directory. This time, we'll use the other way of setting the commit message.

We'll call git commit -m, and then pass the commit message that we want to use. So in this case, we'll say that we've added the check_reboot function.

With that, we've demonstrated the basic Git workflow. We make changes to our files, stage them with git add, and commit them with git commit.

The commit will abort in the case of an empty commit message.

Anatomy of a Commit Message

Writing a clear informative commit message is important when you use a VCS, future you or other developers or IT specialists who might read the commit message later on will really appreciate the contextual information as they try and figure out some of the parts of the code or configuration.

It can be helpful to keep your audience in mind when you write commit messages:

What would someone reading a message weeks or months from now want to know about the changes you've made?
What might be especially important or tricky to understand about them?
Is there extra information that might help the reader out, like links to design documents or tickets in your ticketing system?

Similarly to how style guides exist for writing code, your company might have specific rules for you to follow when you write commit messages.

Even if they don't, it's good to use a few general guidelines to make sure your commit messages are as clear and useful as possible.

A commit message is generally broken up into a few sections:

The first line is usually kept to about 50 characters or less. The line contains a short description of what the commit changes are about.
After the first line, comes an empty line,
and the rest of the text is usually kept under 72 characters. This text is intended to provide a detailed explanation of what's going on with the change. It can reference bugs or issues that will be fixed with the change. It can also include links to more information when relevant.

When you run the git commit command, Git will open up a text editor of your choice so you can write your commit message. A good commit message might look something like this.

The line limits can be annoying but they help in making the commit message be more digestible for the reader.

There's a git command used to display these commit messages called git log. This command will do any line wrapping for us. Which means that if we don't stick to the recommended line wrapping, long commit messages will run off the edge of the screen and be difficult to read.

The first thing listed for each commit is its identifier, which is a long string of letters and numbers that uniquely identify each commit.
The first commit in the list also says that the head indicator is pointing to the master branch.
For each commit, we see the name and the email of the person who made the commit which is indicated as the author.
Then we get the date and time the commit was made.
Finally the commit message is displayed.

Initial Git Cheat Sheet