This lesson is being piloted (Beta version)

U-M Carpentries Curriculum

Introduction to the Workshop

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • What is The Carpentries?

  • What will the workshop cover?

  • What else do I need to know about the workshop?

Objectives
  • Introduce The Carpentries.

  • Go over logistics.

  • Introduce the workshop goals.

What is The Carpentries?

The Carpentries is a global organization whose mission is to teach researchers, and others, the basics of coding so that you can use it in your own work. We believe everyone can learn to code, and that a lot of you will find it very useful for things such as data analysis and plotting.

Our workshops are targeted to absolute beginners, and we expect that you have zero coding experience coming in. That being said, you’re welcome to attend a workshop if you already have a coding background but want to learn more!

To provide an inclusive learning environment, we follow The Carpentries Code of Conduct. We expect that instructors, helpers, and learners abide by this code of conduct, including practicing the following behaviors:

You can report any violations to the Code of Conduct by filling out this form.

Introducing the instructors and helpers

Now that you know a little about The Carpentries as an organization, the instructors and helpers will introduce themselves and what they’ll be teaching/helping with.

The etherpad & introducing participants

Now it’s time for the participants to introduce themselves. Instead of verbally, the participants will use the etherpad to write out their introduction. We use the etherpad to take communal notes during the workshop. Feel free to add your own notes on there whenever you’d like. Go to the etherpad and write down your name, role, affiliation, and work/research area.

The “goal” of the workshop

Now that we all know each other, let’s learn a bit more about why we’re here. Our goal is to write a report to the United Nations on the relationship between GDP, life expectancy, and CO2 emissions. In other words, we are going to analyze how countries’ economic strength or weakness may be related to public health status and climate pollution, respectively.

To get to that point, we’ll need to learn how to manage data, make plots, and generate reports. The next section discusses in more detail exactly what we will cover.

What will the workshop cover?

This workshop will introduce you to some of the programs used everyday in computational workflows in diverse fields: microbiology, statistics, neuroscience, genetics, the social and behavioral sciences, such as psychology, economics, public health, and many others.

A workflow is a set of steps to read data, analyze it, and produce numerical and graphical results to support an assertion or hypothesis encapsulated into a set of computer files that can be run from scratch on the same data to obtain the same results. This is highly desirable in situations where the same work is done repeatedly – think of processing data from an annual survey, or results from a high-throughput sequencer on a new sample. It is also desirable for reproducibility, which enables you and other people to look at what you did and produce the same results later on. It is increasingly common for people to publish scientific articles along with the data and computer code that generated the results discussed within them.

The programs to be introduced are:

  1. Python, JupyterLab: a general purpose program and a interface to it. We’ll use these tools to manage data and make pretty plots!
  2. Git: a program to help you keep track of changes to your programs over time.
  3. GitHub: a web application that makes sharing your programs and working on them with others much easier. It can also be used to generate a citable reference to your computer code.
  4. The Unix shell (command line): A tool that is extremely useful for managing both data and program files and chaining together discrete steps in your workflow (automation).

We will not try to make you an expert or even proficient with any of them, but we hope to demonstrate the basics of controlling your code, automating your work, and creating reproducible programs. We also hope to provide you with some fundamentals that you can incorporate in your own work.

At the end, we provide links to resources you can use to learn about these topics in more depth than this workshop can provide.

Asking questions and getting help

One last note before we get into the workshop.

If you have general questions about a topic, please raise your hand (in person or virtually) to ask it. Virtually, you can also ask the question in the chat. The instructor will definitely be willing to answer!

For more specific nitty-gritty questions about issues you’re having individually, we use sticky notes (in person) or Zoom buttons (red x/green check) to indicate whether you are on track or need help. We’ll use these throughout the workshop to help us determine when you need help with a specific issue (a helper will come help), whether our pace is too fast, and whether you are finished with exercises. If you indicate that you need help because, for instance, you get an error in your code (e.g. red sticky/Zoom button), a helper will message you and (if you’re virtual) possibly go to a breakout room with you to help you figure things out. Feel free to also call helpers over through a hand wave or a message if we don’t see your sticky!

Other miscellaneous things

If you’re in person, we’ll tell you where the bathrooms are! If you’re virtual we hope you know. :) Let us know if there are any accommodations we can provide to help make your learning experience better!

Key Points

  • We follow The Carpentries Code of Conduct.

  • Our goal is to generate a shareable and reproducible report by the end of the workshop.

  • This lesson content is targeted to absolute beginners with no coding experience.


Python for Plotting

Overview

Teaching: 120 min
Exercises: 30 min
Questions
  • What are Python and JupyterLab?

  • How do I read data into Python?

  • How can I use Python to create and save professional data visualizations?

Objectives
  • To become oriented with Python and JupyterLab.

  • To be able to read in data from csv files.

  • To create plots with both discrete and continuous variables.

  • To understand transforming and plotting data using the seaborn library.

  • To be able to modify a plot’s color, theme, and axis labels.

  • To be able to save plots to a local directory.

Contents

  1. Introduction to Python and JupyterLab
  2. Python basics
  3. Loading and reviewing data
  4. Understanding commands
  5. Creating our first plot
  6. Plotting for data exploration
  7. Bonus
  8. Glossary of terms

Bonus: why learn to program?

Share why you’re interested in learning how to code.

Solution:

There are lots of different reasons, including to perform data analysis and generate figures. I’m sure you have more specific reasons for why you’d like to learn!

Introduction to Python and JupyterLab

Back to top

In this session we will be testing the hypothesis that a country’s life expectancy is related to the total value of its finished goods and services, also known as the Gross Domestic Product (GDP). To test this hypothesis, we’ll need two things: data and a platform to analyze the data.

You already downloaded the data. But what platform will we use to analyze the data? We have many options!

We could try to use a spreadsheet program like Microsoft Excel or Google sheets that have limited access, less flexibility, and don’t easily allow for things that are critical to “reproducible” research, like easily sharing the steps used to explore and make changes to the original data.

Instead, we’ll use a programming language to test our hypothesis. Today we will use Python, but we could have also used R for the same reasons we chose Python (and we teach workshops for both languages). Both Python and R are freely available, the instructions you use to do the analysis are easily shared, and by using reproducible practices, it’s straightforward to add more data or to change settings like colors or the size of a plotting symbol.

But why Python and not R?

There’s no great reason. Although there are subtle differences between the languages, it’s ultimately a matter of personal preference. Both are powerful and popular languages that have very well developed and welcoming communities of scientists that use them. As you learn more about Python, you may find things that are annoying in Python that aren’t so annoying in R; the same could be said of learning R. If the community you work in uses Python, then you’re in the right place.

To run Python, all you really need is the Python program, which is available for computers running the Windows, Mac OS X, or Linux operating systems. In this workshop, we will use Anaconda, a popular Python distribution bundled with other popular tools (e.g., many Python data science libraries). We will use JupyterLab (which comes with Anaconda) as the integrated development environment (IDE) for writing and running code, managing projects, getting help, and much more.

Bonus Exercise: Can you think of a reason you might not want to use JupyterLab?

Solution:

On some high-performance computer systems (e.g. Amazon Web Services) you typically can’t get a display like JupyterLab to open. If you’re at the University of Michigan and have access to Great Lakes, then you might want to learn more about resources to run JupyterLab on Great Lakes.

To get started, we’ll spend a little time getting familiar with the JupyterLab interface. When we start JupyterLab, on the left side there’s a collapsible sidebar that contains a file browser where we can see all the files and directories on our system.

On the right side is the main work area where we can write code, see the outputs, and do other things. Now let’s create a new Jupyter notebook by clicking the “Python 3” button (under the “Notebook” category) on the “Launcher” tab .

Now we have created a new Jupyter notebook called Untitled.ipynb. The file name extension ipynb indicates it’s a notebook file. In case you are interested, it stands for “IPython Notebook”, which is the former name for Jupyter Notebook.

Let’s give it a more meaningful file name called gdp_population.ipynb To rename a file we can right click it from the file browser, and then click “Rename”.

A notebook is composed of “cells”. You can add more cells by clicking the plus “+” button from the toolbar at the top of the notebook.


Python basics

Back to top

Arithmetic operators

At a minimum, we can use Python as a calculator.

If we type the following into a cell, and click the run button (the triangle-shaped button that looks like a play button), we will see the output under the cell.

Another quicker way to run the code in the selected cell is by pressing on your keyboard Ctrl+Enter (for Windows) or Command+Return (for MacOS).

Addition

2 + 3
5

Subtraction

2 - 3
-1

Multiplication

2 * 3
6

Division

2 / 3
0.6666666666666666

Exponentiation

One thing that you might need to be a little careful about is the exponentiation. If you have used Microsoft Excel, MATLAB, R, or some other programming languages, the operator for exponentiation is the caret ^ symbol. Let’s take a look at if that works in Python.

2 ^ 3
1

Hmm. That’s not what we expected. It turns out in Python (and a few other languages), the caret symbol is used for another operation called bitwise exclusive OR.

In Python we use double asterisks ** for exponentiation.

2 ** 3
8

Order of operations

We can also use parentheses to specify what operations should be resolved first. For example, to convert 60 degrees Fahrenheit to Celsius, we can do:

5 / 9 * (60 - 32)
15.555555555555555

Assignment operator

In Python we can use a = symbol, which is called the assignment operator, to assign values on the right to objects on the left.

Let’s assign a number to a variable called “age”.

When we run the cell, it seems nothing happened. But that’s only because we didn’t ask Python to display anything in the output after the assignment operation. We can call the Python built-in function print() to display information in the output.

We can also use another Python built-in function type() to check the type of an object, in this case, the variable called “age”. And we can see the type is “int”, standing for integers.

age = 26
print(age)
print(type(age))
26
<class 'int'>

Let’s create another variable called “pi”, and assign it with a value of 3.1415. We can see that this time the variable has a type of “float” for floating-point number, or a number with a decimal point.

pi = 3.1415
print(pi)
print(type(pi))
3.1415
<class 'float'>

We can also assign string or text values to a variable. Let’s create a variable called “name”, and assign it with a value “Ben”.

name = Ben
print(name)
NameError: name 'Ben' is not defined

We got an error message. As it turns out, to make it work in Python we need to wrap any string values in quotation marks. We can use either single quotes ' or double quotes ". We just need to use the same kind of quotes at the beginning and end of the string. You do need to use the same kind of quotes at the beginning and end of the string. We can also see that the variable has a type of “str”, standing for strings.

name = "Ben"
print(name)
print(type(name))
Ben
<class 'str'>

Single vs Double Quotes

Python supports using either single quotes ' or double quotes " to specify strings. There’s no set rules on which one you should use.

  • Some Python style guide suggests using single-quotes for shorter strings (the technical term is string literals), as they are a little easier to type and read, and using double-quotes for strings that are likely to contain single-quote characters as part of the string itself (such as strings containing natural language, e.g. "I'll be there.").
  • Some other Python style guide suggests being consistent with your choice of string quote character within a file. Pick ' or " and stick with it.

Assigning values to objects

Try to assign values to some objects and observe each object after you have assigned a new value. What do you notice?

name = "Ben"
print(name)

name = "Harry Potter"
print(name)

Solution

When we assign a value to an object, the object stores that value so we can access it later. However, if we store a new value in an object we have already created (like when we stored “Harry Potter” in the name object), it replaces the old value.

Guidelines on naming objects

  • You want your object names to be explicit and not too long.
  • They cannot start with a number (2x is not valid, but x2 is).
  • Python is case sensitive, so for example, weight_kg is different from Weight_kg.
  • You cannot use spaces in the name.
  • There are some names that cannot be used because they are the names of fundamental functions in Python (e.g., if, else, `for; run help("keywords") for a complete list). You may also notice these keywords change to a different color once you type them (a feature called “syntax highlighting”).
  • It’s best to avoid dots (.) within names. Dots have a special meaning (methods) in Python and other programming languages.
  • It is recommended to use nouns for object names and verbs for function names.
  • Be consistent in the styling of your code, such as where you put spaces, how you name objects, etc. Using a consistent coding style makes your code clearer to read for your future self and your collaborators. The official Python naming conventions can be found here.

Bonus Exercise: Bad names for objects

Try to assign values to some new objects. What do you notice? After running all four lines of code bellow, what value do you think the object Flower holds?

1number = 3
Flower = "marigold"
flower = "rose"
favorite number = 12

Solution

Notice that we get an error when we try to assign values to 1number and favorite number. This is because we cannot start an object name with a numeral and we cannot have spaces in object names. The object Flower still holds “marigold.” This is because Python is case-sensitive, so running flower = "rose" does NOT change the Flower object. This can get confusing, and is why we generally avoid having objects with the same name and different capitalization.

Data structures

Python lists

Rather than storing a single value to an object, we can also store multiple values into a single object called a list. A Python list is indicated with a pair of square brackets [], and different items are separated by a comma. For example, we can have a list of numbers, or a list of strings.

squares = [1, 4, 9, 16, 25]
print(squares)

names = ["Sara", "Tom", "Jerry", "Emma"]
print(names)

We can also check the type of the object by calling the type() function.

type(names)
list

An item from a list can be accessed by its position using the square bracket notation. Say if we want to get the first name, “Sara”, from the list, we can do

names[1]
'Tom'

That’s not what we expected. Python uses something called 0-based indexing. In other words, it starts counting from 0 rather than 1. If we want to get the first item from the list, we should use an index of 0. Let’s try that.

names[0]
'Sara'

Now see if you can get the last name from the list.

Solutions:

names[3]

A cool thing in Python is it also supports negative indexing. If we just want the last time on a list, we can pass the index of -1.

names[-1]

Python dictionaries

Python lists allow us to organize items by their position. Sometimes we want to organize items by their “keys”. This is when a Python dictionary comes in handy.

A Python dictionary is indicated with a pair of curly brackets {} and composed of entries of key-value pairs. The key and value are connected via a colon :, and different entries are separated by a comma ,. For example, let’s create a dictionary of capitals. We can separate the entries in multiple lines to make it a little easier to read, especially when we have many entries. In Python we can break lines inside braces (e.g., (), [], {}) without breaking the code. This is a common technique people use to avoid long lines and make their code a little more readable.

capitals = {"France": "Paris",
            "USA": "Washington DC",
            "Germany": "Berlin",
            "Canada": "Ottawa"}

We can check the type of the object by calling the type() function.

type(capitals)
dict

An entry from a dictionary can be accessed by its key using the square bracket notation. Say if we want to get the capital for USA, , we can do

capitals["USA"]
'Washington DC'

Now see if you can get the capital from another country.

Solutions:

capitals["Canada"]
'Ottawa'

Calling functions

So far we have used two Python built-in functions, print() to print some values on the screen, and type() to show the type of an object. The way we called these functions is to first type the name of the function, followed by a pair of parenthesis. Many functions require additional pieces of information to do their job. We call these additional values “arguments” or “parameters”. We pass the arguments to a function by placing values in between the parenthesis. A function takes in these arguments and does a bunch of “magic” behind the scenes to output something we’re interested in.

Do all functions need arguments? Let’s test some other functions.

It is common that we may want to use a function from a module. In this case we will need to first import the module to our Python session. We do that by using the import keyword followed by the module’s name. To call a function from a module, we type the name of the imported module, followed by a dot ., followed by the name of the function that we wish to call.

Below we import the operating system module and call the function getcwd() to get the current working directory.

import os
os.getcwd()
'/Users/fredfeng/Desktop/teaching/workshops/um-carpentries/intro-curriculum-python/_episodes_ipynb'

Sometimes the function resides inside a submodule, we can specify the submodule using the dot notation. In the example below, we call the today() function which is located in the date submodule inside the datetime module that we imported.

import datetime
datetime.date.today()
datetime.date(2023, 11, 4)

While some functions, like those above, don’t need any arguments, in other functions we may want to use multiple arguments. When we’re using multiple arguments, we separate the arguments with commas. For example, we can use the print() function to print two strings:

print("My name is", name)
My name is Harry Potter

Pro-tip

Each function has a help page that documents what a function does, what arguments it expects and what it will return. You can bring up the help page a few different ways. You can type ? followed by the function name, for example, ?print. A help document should pop up.

You can also place the mouse curse next to a function, and press Shift+Tab to see its help doc.

Learning more about functions

Look up the function round(). What does it do? What will you get as output for the following lines of code?

round(3.1415)
round(3.1415, 3)

Solution

The round() function rounds a number to a given precision. By default, it rounds the number to an integer (in our example above, to 3). If you give it a second number, it rounds it to that number of digits (in our example above, to 3.142)

Notice how in this example, we didn’t include any argument names. But you can use argument names if you want:

round(number=3.1415, ndigits=3)

Position of the arguments in functions

Which of the following lines of code will give you an output of 3.14? For the one(s) that don’t give you 3.14, what do they give you?

round(number=3.1415)
round(number=3.1415, ndigits=2)
round(ndigits=2, number=3.1415)
round(2, 3.1415)

Solution

The 2nd and 3rd lines will give you the right answer because the arguments are named, and when you use names the order doesn’t matter. The 1st line will give you 3 because the default number of digits is 0. Then 4th line will give you 2 because, since you didn’t name the arguments, x=2 and digits=3.1415.

Sometimes it is helpful - or even necessary - to include the argument name, but often we can skip the argument name, if the argument values are passed in a certain order. If all this function stuff sounds confusing, don’t worry! We’ll see a bunch of examples as we go that will make things clearer.

Comments

Sometimes we may want to write some comments in our code to help us remember what our code is doing, but we don’t want Python to think these comments are a part of the code you want to evaluate. That’s where comments come in! Anything after a # sign in your code will be ignored by Python. For example, let’s say we wanted to make a note of what each of the functions we just used do:

datetime.date.today()   # returns today's date
os.getcwd()    # returns our current working directory

Some other time we may want to temporarily disable some code without deleting them. We can comment out lines of code by placing a # sign at the beginning of each line.

A handy keyboard shortcut for that is move the mouse cursor to the line you wish to comment out, then press Ctrl+/ (for Windows) or Command+/ (for MacOS) to toggle through comment and uncomment. If you wish to comment out multiple lines, first select all the lines, then use the same keyboard shortcut to comment or uncomment.

Loading and reviewing data

Back to top

Data objects

In the above we introduced Python lists and dictionaries. There are other ways to store data in Python. Most objects have a table-like structure with rows and columns. We will refer to these objects generally as “data objects”. If you’ve used pandas before, you may be used to calling them “DataFrames”.

Understanding commands

The first thing we usually do when starting a new notebook is to import the libraries that we will need later to the python session. In general, we will need to first install a library before we can import it. If you followed the setup instruction and installed Anaconda, some common data science libraries are already installed.

Here we can go ahead and import them using the import keyword followed by the name of the library. It’s common to give a library an alias name or nickname, so we can type less words when calling the library later. The alias is created by using the keyword as. By convention, numpy’s alias is np, and pandas’s alias is pd. Technically you can give whatever the alias you want, but please don’t :)

import numpy as np
import pandas as pd
pd.read_csv()
TypeError: read_csv() missing 1 required positional argument: 'filepath_or_buffer'

We get an error message. Don’t panic! Error messages pop up all the time, and can be super helpful in debugging code.

In this case, the message tells us the function that we called is “missing 1 required positional argument: ‘filepath_or_buffer’”

If we think about it. We haven’t told the function what CSV files to read. Let’s tell the function where to find the CSV file by passing a file path to the function as a string.

gapminder_1997 = pd.read_csv("gapminder_1997.csv")

gapminder_1997
                country       pop continent  lifeExp     gdpPercap
0           Afghanistan  22227415      Asia   41.763    635.341351
1               Albania   3428038    Europe   72.950   3193.054604
2               Algeria  29072015    Africa   69.152   4797.295051
3                Angola   9875024    Africa   40.963   2277.140884
4             Argentina  36203463  Americas   73.275  10967.281950
..                  ...       ...       ...      ...           ...
137             Vietnam  76048996      Asia   70.672   1385.896769
138  West Bank and Gaza   2826046      Asia   71.096   7110.667619
139          Yemen Rep.  15826497      Asia   58.020   2117.484526
140              Zambia   9417789    Africa   40.238   1071.353818
141            Zimbabwe  11404948    Africa   46.809    792.449960

[142 rows x 5 columns]

The read_csv() function took the file path we provided, did who-knows-what behind the scenes, and then outputted a table with the data stored in that CSV file. All that, with one short line of code!

We can check the type of the variable by calling the Python built-in function type.

type(gapminder_1997)
pandas.core.frame.DataFrame

In pandas terms, gapminder_1997 is a named DataFrame that references or stores something. In this case, gapminder_1997 stores a specific table of data.

Reading in an Excel file

Say you have an Excel file and not a CSV - how would you read that in? Hint: Use the Internet to help you figure it out!

Solution

Pandas comes with the read_excel() function which provides the same output as the output of read_csv().

Creating our first plot

Back to top

We will mostly use the seaborn library to make our plots. Seaborn is a popular Python data visualization library. We will use the seaborn objects interface.

We first import the seaborn module.

All plots start by calling the Plot() function. In a Jupyter notebook cell type the following:

Note we use the parenthesis so that we can improve the code readability by vertically aligning the methods that we will apply to the plot later. The parenthesis makes sure the code does not break when we use new lines for each method.

import seaborn.objects as so

(
    so.Plot(gapminder_1997)
)

plot of chunk DataOnly

What we’ve done is to call the Plot() function to instantiate a Plot object and told it we will be using the data from the gapminder_1997, the DataFrame that we loaded from the CSV file.

So we’ve made a plot object, now we need to start telling it what we actually want to draw in this plot. The elements of a plot have a bunch of visual properties such as an x and y position, a point size, a color, etc. When creating a data visualization, we map a variable in our dataset to a visual property in our plot.

To create our plot, we need to map variables from our data gapminder_1997 to the visual properties using the Plot() function. Since we have already told Plot that we are using the data in the gapminder_1997, we can access the columns of gapminder_1997 using the data frame’s column names. (Remember, Python is case-sensitive, so we have to be careful to match the column names exactly!)

We are interested in whether there is a relationship between GDP and life expectancy, so let’s start by telling our plot object that we want to map the GDP values to the x axis, and the life expectancy to the y axis of the plot.

(
    so.Plot(gapminder_1997, x='gdpPercap', y='lifeExp')
)

plot of chunk DataOnly

Excellent. We’ve now told our plot where the x and y values are coming from and what they stand for. But we haven’t told our plot how we want it to draw the data.

There are different types of marks, for example, dots, bars, lines, areas, and band. We tell our plot what to draw by adding a layer of the visualization in terms of mark. We will talk about many different marks today, but for our first plot, let’s draw our data using the “dot” mark for each value in the data set. To do this, we apply the add() method to our plot and put inside so.Dot() as the mark.

(
    so.Plot(gapminder_1997, x='gdpPercap', y='lifeExp')
    .add(so.Dot())
)

plot of chunk DataOnly

We can add labels for the axes and title by applying the label() method to our plot.

(
    so.Plot(gapminder_1997, x='gdpPercap', y='lifeExp')
    .add(so.Dot())
    .label(x="GDP Per Capita")
)

plot of chunk DataOnly

Give the y axis a nice label.

Solution

(
    so.Plot(gapminder_1997, x='gdpPercap', y='lifeExp')
    .add(so.Dot())
    .label(x="GDP Per Capita", 
           y="Life Expectancy")
)

plot of chunk FirstPlotAddY

Now it finally looks like a proper plot! We can now see a trend in the data. It looks like countries with a larger GDP tend to have a higher life expectancy. Let’s add a title to our plot to make that clearer. We can specify that using the same label() method, but this time we will use the title argument.

(
    so.Plot(gapminder_1997, x='gdpPercap', y='lifeExp')
    .add(so.Dot())
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title="Do people in wealthy countries live longer?")
)

plot of chunk DataOnly

No one can deny we’ve made a very handsome plot! But now looking at the data, we might be curious about learning more about the points that are the extremes of the data. We know that we have two more pieces of data in the gapminder_1997 that we haven’t used yet. Maybe we are curious if the different continents show different patterns in GDP and life expectancy. One thing we could do is use a different color for each of the continents. It is possible to map data values to various graphical properties. In this case let’s map the continent to the color property.

(
    so.Plot(gapminder_1997, 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent')
    .add(so.Dot())
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title = "Do people in wealthy countries live longer?")
)

plot of chunk DataOnly

Here we can see that in 1997 the African countries had much lower life expectancy than many other continents. Notice that when we add a mapping for color, seaborn automatically provides a legend for us. It took care of assigning different colors to each of our unique values of the continent variable. The colors that seaborn uses are determined by the color “palette”. If needed, we can change the default color palette. Let’s change the colors to make them a bit prettier.

The code below allows us to select a color palette. Seaborn is built based on Matplotlib and supports all the color palettes from the matplot colormaps. You can also learn more about the seaborn color palettes from here.

import seaborn as sns
sns.color_palette()

sns.color_palette('flare')
sns.color_palette('Reds')
sns.color_palette('Set1')

We can change the color palettes by applying the scale() method to the plot. The scale() method specifies how the data should be mapped to visual properties, and in this case, how the categorical variable “continent” should be mapped to different colors of the dot marks.

(
    so.Plot(gapminder_1997, 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent')
    .add(so.Dot())
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title="Do people in wealthy countries live longer?")
    .scale(color='Set1')
)

plot of chunk DataOnly

Seaborn also supports passing a list of custom colors to the color argument of the scale() method. For example, we can use the color brewer to pick a list of colors of our choice, and pass it to the scale() method.

(
    so.Plot(gapminder_1997, 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent')
    .add(so.Dot())
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title="Do people in wealthy countries live longer?")
    .scale(color=['#1b9e77','#d95f02','#7570b3','#e7298a','#66a61e'])
)

plot of chunk DataOnly

Since we have the data for the population of each country, we might be curious what effect population might have on life expectancy and GDP per capita. Do you think larger countries will have a longer or shorter life expectancy? Let’s find out by mapping the population of each country to another visual property: the size of the dot marks.

(
    so.Plot(gapminder_1997, 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent',
            pointsize='pop')
    .add(so.Dot())
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title="Do people in wealthy countries live longer?")
    .scale(color='Set1')
)

plot of chunk DataOnly

We got another legend here for size which is nice, but the values look a bit ugly with very long digits. Let’s assign a new column in our data called pop_million by dividing the population by 1,000,000 and label it “Population (in millions)”

Note for large numbers such as 1000000, it’s easy to mis-count the number of digits when typing or reading it. One cool thing in Python is we can use the underscore _ as a separator to make large numbers easier to read. For example: 1_000_000.

(
    so.Plot(gapminder_1997.assign(pop_million=gapminder_1997['pop']/1_000_000), 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent',
            pointsize='pop_million')
    .add(so.Dot())
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title="Do people in wealthy countries live longer?",
           pointsize='Population (in millions)'
          )
    .scale(color='Set1')
)

plot of chunk DataOnly

We can further fine-tune how the population should be mapped to the point size using the scale() method. In this case, let’s set the output range of the point size to 2-20.

As you can see, some of the marks are on top of each other, making it hard to see some of them (This is called “overplotting” in data visualization.) Let’s also reduce the opacity of the dots by setting the alpha property of the Dot mark.

(
    so.Plot(gapminder_1997.assign(pop_million=gapminder_1997['pop']/1_000_000), 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent',
            pointsize='pop_million')
    .add(so.Dot(alpha=.5))
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title="Do people in wealthy countries live longer?",
           pointsize='Population (in millions)'
          )
    .scale(color='Set1', pointsize=(2, 18))
)

plot of chunk DataOnly

In addition to colors, we can also use different markers to represent the continents.

(
    so.Plot(gapminder_1997.assign(pop_million=gapminder_1997['pop']/1_000_000), 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent',
            marker='continent',
            pointsize='pop_million')
    .add(so.Dot(alpha=.5))
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title="Do people in wealthy countries live longer?",
           pointsize='Population (in millions)'
          )
    .scale(color='Set1', pointsize=(2, 18))
)

plot of chunk DataOnly

Changing marker type

Instead of (or in addition to) color, change the shape of the points so each continent has a different marker type. (I’m not saying this is a great thing to do - it’s just for practice!) Feel free to check the documentation of the Plot() function.

Solution

You’ll want to specify the marker argument in the Plot() function:

(
    so.Plot(gapminder_1997.assign(pop_million=gapminder_1997['pop']/1_000_000), 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent',
            marker='continent',
            pointsize='pop_million')
    .add(so.Dot(alpha=.5))
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title="Do people in wealthy countries live longer?",
           pointsize='Population (in millions)'
          )
    .scale(color='Set1', pointsize=(2, 18))
)

plot of chunk Shape

Plotting for data exploration

Back to top

Many datasets are much more complex than the example we used for the first plot. How can we find meaningful insights in complex data and create visualizations to convey those insights?

Importing datasets

Back to top

In the first plot, we looked at a smaller slice of a large dataset. To gain a better understanding of the kinds of patterns we might observe in our own data, we will now use the full dataset, which is stored in a file called “gapminder_data.csv”.

To start, we will read in the data to a pandas DataFrame.

Read in your own data

What argument should be provided in the below code to read in the full dataset?

gapminder_data = pd.read_csv()

Solution

gapminder_data = pd.read_csv("gapminder_data.csv")

Let’s take a look at the full dataset. Pandas offers a way to select the top few rows of a data frame by applying the head() method to the data frame. Try it out!

gapminder_data.head()
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3  Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4  Afghanistan  1972  13079460.0      Asia   36.088  739.981106

Notice that this dataset has an additional column “year” compared to the smaller dataset we started with.

Predicting seaborn outputs

Now that we have the full dataset read into our Python session, let’s plot the data placing our new “year” variable on the x axis and life expectancy on the y axis. We’ve provided the code below. Notice that we’ve left off the labels so there’s not as much code to work with. Before running the code, read through it and see if you can predict what the plot output will look like. Then run the code and check to see if you were right!

(
    so.Plot(data=gapminder, 
            x='year', 
            y='lifeExp',
            color='continent')
    .add(so.Dot())
)

plot of chunk PlotFullGapminder

Hmm, the plot we created in the last exercise isn’t very clear. What’s going on? Since the dataset is more complex, the plotting options we used for the smaller dataset aren’t as useful for interpreting these data. Luckily, we can add additional attributes to our plots that will make patterns more apparent. For example, we can generate a different type of plot - perhaps a line plot - and assign attributes for columns where we might expect to see patterns.

Let’s review the columns and the types of data stored in our dataset to decide how we should group things together. We can apply the pandas method info to get the summary information of the data frame.

gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   year       1704 non-null   int64  
 2   pop        1704 non-null   float64
 3   continent  1704 non-null   object 
 4   lifeExp    1704 non-null   float64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB

So, what do we see? The data frame has 1,704 entries (rows) and 6 columns. The “Dtype” shows the data type of each column.

What kind of data do we see?

Our plot has a lot of points in columns which makes it hard to see trends over time. A better way to view the data showing changes over time is to use lines. Let’s try changing the mark from dot to line and see what happens.

(
    so.Plot(data=gapminder, 
            x='year', 
            y='lifeExp',
            color='continent')
    .add(so.Line())
)

plot of chunk GapMinderLinePlotBad

Hmm. This doesn’t look right. By setting the color value, we got a line for each continent, but we really wanted a line for each country. We need to tell seaborn that we want to connect the values for each country value instead. To do this, we need to specify the group argument of the Plot() function.

(
    so.Plot(data=gapminder, 
            x='year', 
            y='lifeExp',
            group='country',
            color='continent')
    .add(so.Line())
)

plot of chunk GapMinderLinePlot

Sometimes plots like this are called “spaghetti plots” because all the lines look like a bunch of wet noodles.

Bonus Exercise: More line plots

Now create your own line plot comparing population and life expectancy! Looking at your plot, can you guess which two countries have experienced massive change in population from 1952-2007?

Solution

(
    so.Plot(data=gapminder, 
            x='pop', 
            y='lifeExp',
            group='country',
            color='continent')
    .add(so.Line())
)

plot of chunk gapminderMoreLines (China and India are the two Asian countries that have experienced massive population growth from 1952-2007.)

Categorical Plots

Back to top

So far we’ve looked at plots with both the x and y values being numerical values in a continuous scale (e.g., life expectancy, GDP per capita, year, population, etc.) But sometimes we may want to visualize categorical data (e.g., continents).

We’ve previously used the categorical values of the continent column to color in our points and lines. But now let’s try moving that variable to the x axis. Let’s say we are curious about comparing the distribution of the life expectancy values for each of the different continents for the gapminder_1997 data.

Let’s map the continent to the x axis and the life expectancy to the y axis. Let’s use the dot marks to represent the data.

(
    so.Plot(gapminder_1997, 
            x='continent', 
            y='lifeExp')
    .add(so.Dot())
)

plot of chunk GapMinderLinePlot

We see that there is some overplotting as countries from the same continents are aligned vertically like a strip of kebab, making it hard to see the dots in some dense areas. The seaborn objects interface leaves it to us to specify who we would like the overplotting to be handled. A common treatment is to spread (or “jitter”) the dots within each group by adding a little random displacement along the categorical axis. The result is sometimes called a “jitter plot”.

Here we can simply add so.Jitter().

(
    so.Plot(gapminder_1997, 
            x='continent', 
            y='lifeExp')
    .add(so.Dot(), so.Jitter())
)

plot of chunk GapMinderLinePlot

We can control the amount of jitter by setting the width argument. Let’s also change the size and opacity of the dots.

(
    so.Plot(gapminder_1997, 
            x='continent', 
            y='lifeExp')
    .add(so.Dot(pointsize=10, alpha=.5), so.Jitter(width=.8))
)

plot of chunk GapMinderLinePlot

Lastly, let’s further map the continents to the color of the dots.

(
    so.Plot(gapminder_1997, 
            x='continent', 
            y='lifeExp', 
            color='continent')
    .add(so.Dot(pointsize=10, alpha=.5), so.Jitter(width=.8))
)

plot of chunk GapMinderLinePlot

This type of visualization makes it easy to compare the distribution (e.g., range, spread) of values across groups.

Bonus Exercise: Other categorical plots

Let’s plot the range of the life expectancy for each continent in terms of its mean plus/minus one standard deviation.

Example solution

(
    so.Plot(gapminder_1997, 
            x='continent', 
            y='lifeExp', 
            color='continent')
    .add(so.Range(), so.Est(func='mean', errorbar='sd'))
    .add(so.Dot(), so.Agg())
)

plot of chunk GapViol

Univariate Plots

Back to top

We jumped right into making plots with multiple columns. But what if we wanted to take a look at just one column? In that case, we only need to specify a mapping for x and choose an appropriate mark. Let’s start with a histogram to see the range and spread of life expectancy.

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Bars(), so.Hist())
)

plot of chunk GapLifeHist

Histograms can look very different depending on the number of bins you decide to draw. The default is 10. Let’s try setting a different value by explicitly passing a bins argument to Hist.

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Bars(), so.Hist(bins=20))
)

plot of chunk GapLifeHistBins

You can try different values like 5 or 50 to see how the plot changes.

Sometimes we don’t really care about the total number of bins, but rather the bin width and end points. For example, we may want the bins at 40-42, 42-44, 44-46, and so on. In this case, we can set the binwidth and binrange arguments to the Hist.

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Bars(), so.Hist(binwidth=5, binrange=(0, 100)))
)

plot of chunk GapLifeHistBins

Changing the aggregate statistics

By default the y axis shows the number of observations in each bin, that is, stat='count'. Sometimes we are more interested in other aggregate statistics rather than count, such as the percentage of the observations in each bin. Check the documentation of so.Hist and see what other aggregate statistics are offered, and change the histogram to show the percentages instead.

Solution

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Bars(), so.Hist(stat='percent', binwidth=5, binrange=(0, 100)))
)

plot of chunk GapLifeHistBins

If we want to see a break-down of the life expectancy distribution by each continent, we can add the continent to the color property.

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp', 
            color='continent')
    .add(so.Bars(), so.Hist(stat='percent', binwidth=5, binrange=(0, 100)))
)

plot of chunk GapLifeHistBins

Hmm, it looks like the bins for each continent are on top of each other. It’s not very easy to see the distributions. Again, we can tell seaborn how overplotting should be handled. In this case we can use so.Stack() to stack the bins. This type of chart is often called a “stacked bar chart”.

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp', 
            color='continent')
    .add(so.Bars(), so.Hist(stat='percent', binwidth=5, binrange=(0, 100)), so.Stack())
)

plot of chunk GapLifeHistBins

Other than the histogram, we can also usekernel density estimation, a smoothing technique that captures the general shape of the distribution of a continuous variable.

We can add a line so.Line() that represents the kernel density estimates so.KDE().

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Line(), so.KDE())
)

plot of chunk GapLifeHistBins

Alternatively, we can also add an area so.Area() that represents the kernel density estimates so.KDE().

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Area(), so.KDE())
)

plot of chunk GapLifeHistBins

If we want to see the kernel density estimates for each continent, we can map continents to the color in the plot function.

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp',
            color='continent')
    .add(so.Area(), so.KDE())
)

plot of chunk GapLifeHistBins

We can overlay multiple visualization layers to the same plot. Here let’s combine the histogram and the kernel density estimate. Note we will need to change the stat argument of the so.Hist() to density, so that the y axis values of the histogram are comparable with the kernel density.

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Bars(), so.Hist(stat='density', binwidth=5, binrange=(0, 100)))
    .add(so.Line(), so.KDE())
)

plot of chunk GapLifeHistBins

Lastly, we can make a few further improvements to the plot.

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Bars(), so.Hist(stat='density', binwidth=5, binrange=(0, 100)), label='Histogram')
    .add(so.Line(color='red', linewidth=4, alpha=.7), so.KDE(), label='Kernel density')
    .label(x="Life expectancy", y="Density")
    .layout(size=(9, 4))
)

plot of chunk GapLifeHistBins

Facets

Back to top

If you have a lot of different columns to try to plot or have distinguishable subgroups in your data, a powerful plotting technique called faceting might come in handy. When you facet your plot, you basically make a bunch of smaller plots and combine them together into a single image. Luckily, seaborn makes this very easy. Let’s start with the “spaghetti plot” that we made earlier.

(
    so.Plot(data=gapminder, 
            x='year', 
            y='lifeExp',
            group='country',
            color='continent')
    .add(so.Line())
)

plot of chunk GapMinderLinePlot

Rather than having all the countries in a single plot, this time let’s draw a separate box (a “subplot”) for countries in each continent. We can do this by applying the facet() method to the plot.

(
    so.Plot(data=gapminder, 
            x='year', 
            y='lifeExp',
            group='country',
            color='continent')
    .add(so.Line())
    .facet('continent')
)

plot of chunk GapFacetWrap

Note now we have a separate subplot for countries in each continent. This type of faceted plots are sometimes called small multiples.

Note all five subplots are in one row. If we want, we can “wrap” the subplots across a two-dimentional grid. For example, if we want the subplots to have a maximum of 3 columns, we can do the following.

    so.Plot(data=gapminder, 
            x='year', 
            y='lifeExp',
            group='country',
            color='continent')
    .add(so.Line())
    .facet('continent', wrap=3)

plot of chunk GapFacetWrap

By default, the facet() method will place the subplots along the columns of the grid. If we want to place the subplots along the rows (it’s probably not a good idea in this example as we want to compare the life expectancies), we can set row='continent' when applying facet to the plot.

(
    so.Plot(data=gapminder, 
            x='year', 
            y='lifeExp',
            group='country',
            color='continent')
    .add(so.Line())
    .facet(row='continent')
)

plot of chunk GapFacetWrap

Saving plots

Back to top

We’ve made a bunch of plots today, but we never talked about how to share them with our friends who aren’t running Python! It’s wise to keep all the code we used to draw the plot, but sometimes we need to make a PNG or PDF version of the plot so we can share it with our colleagues or post it to our Instagram story.

We can save a plot by applying the save() method to the plot.

(
    so.Plot(data=gapminder, 
            x='year', 
            y='lifeExp',
            group='country',
            color='continent')
    .add(so.Line())
    .facet('continent', wrap=3)
    .save("awesome_plot.png", bbox_inches='tight', dpi=200)
)

Saving a plot

Try rerunning one of your plots and then saving it using save. Find and open the plot to see if it worked!

Example solution

(
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Bars(), so.Hist(stat='density', binwidth=5, binrange=(0, 100)), label='Histogram')
    .add(so.Line(color='red', linewidth=4, alpha=.7), so.KDE(), label='Kernal density')
    .label(x="Life expectency", y="Density")
    .layout(size=(9, 4))
    .save("another_awesome_plot.png", bbox_inches='tight', dpi=200)
)

Check your current working directory to find the plot!

We also might want to just temporarily save a plot while we’re using Python, so that you can come back to it later. Luckily, a plot is just an object, like any other object we’ve been working with! Let’s try storing our histogram from earlier in an object called hist_plot.

hist_plot = (
    so.Plot(data=gapminder_1997, 
            x='lifeExp')
    .add(so.Bars(), so.Hist(stat='density', binwidth=5, binrange=(0, 100)))
)

Now if we want to see our plot again, we can just run:

hist_plot

plot of chunk outputViolinPlot

We can also add changes to the plot. Let’s say we want to add another layer of the kernel density estimation.

hist_plot.add(so.Line(color='red'), so.KDE())

plot of chunk violinPlotBWTheme

Watch out! Adding the theme does not change the hist_plot object! If we want to change the object, we need to store our changes:

hist_plot = hist_plot.add(so.Line(color='red'), so.KDE())

Bonus Exercise: Create and save a plot

Now try it yourself! Create your own plot using so.Plot(), store it in an object named my_plot, and save the plot using the save() method.

Example solution

(
    so.Plot(gapminder_1997, 
            x='gdpPercap', 
            y='lifeExp', 
            color='continent')
    .add(so.Dot())
    .label(x="GDP Per Capita", 
           y="Life Expectancy", 
           title = "Do people in wealthy countries live longer?")
    .save("my_awesome_plot.png", bbox_inches='tight', dpi=200)
)

Bonus

Creating complex plots

Animated plots

Back to bonus

Sometimes it can be cool (and useful) to create animated graphs, like this famous one by Hans Rosling using the Gapminder dataset that plots GDP vs. Life Expectancy over time. Let’s try to recreate this plot!

The seaborn library that we used so far does not support annimated plots. We will use a different visualization library in Python called Plotly - a popular library for making interactive visualizations.

Plotly is already pre-installed with Anaconda. All we need to do is to import the library.

import plotly.express as px

(
    px.scatter(data_frame=gapminder, 
               x='gdpPercap',
               y='lifeExp', 
               size='pop', 
               animation_frame='year', 
               hover_name='country', 
               color='continent', 
               height=600, 
               size_max=80)
)

plot of chunk hansGraphAnimated

Awesome! This is looking sweet! Let’s make sure we understand the code above:

  1. The animation_frame argument of the plotting function tells it which variable should be different in each frame of our animation: in this case, we want each frame to be a different year.
  2. There are quite a few more parameters that we have control over the plot. Feel free to check out more options from the documentation of the px.scatter() function.

So we’ve made this cool animated plot - how do we save it? We can apply the write_html() method to save the plot to a standalone HTML file.

(
    px.scatter(data_frame=gapminder, 
               x='gdpPercap',
               y='lifeExp', 
               size='pop', 
               animation_frame='year', 
               hover_name='country', 
               color='continent', 
               height=600, 
               size_max=80)
    .write_html("./hansAnimatedPlot.html")
)

Glossary of terms

Back to top

Key Points

  • Python is a free general purpose programming language used by many for reproducible data analysis.

  • Use Python library pandasread_csv() function to read tabular data.

  • Use Python library seaborn to create and save data visualizations.


The Unix Shell

Overview

Teaching: 60 min
Exercises: 30 min
Questions
  • What is a command shell and why would I use one?

  • How can I move around on my computer?

  • How can I see what files and directories I have?

  • How can I specify the location of a file or directory on my computer?

  • How can I create, copy, and delete files and directories?

  • How can I edit files?

Objectives
  • Explain how the shell relates to users’ programs.

  • Explain when and why command-line interfaces should be used instead of graphical interfaces.

  • Construct absolute and relative paths that identify specific files and directories.

  • Demonstrate the use of tab completion and explain its advantages.

  • Create a directory hierarchy that matches a given diagram.

  • Create files in the directory hierarchy using an editor or by copying and renaming existing files.

  • Delete, copy, and move specified files and/or directories.

Contents

  1. Introducing the Shell
  2. Working with files and directories
  3. Glossary of terms

Introducing the Shell

Back to top

Motivation

Back to top

Usually you move around your computer and run programs through graphical user interfaces (GUIs). For example, Finder for Mac and Explorer for Windows. These GUIs are convenient because you can use your mouse to navigate to different folders and open different files. However, there are some things you simply can’t do from these GUIs.

The Unix Shell (or the command line) allows you to do everything you would do through Finder/Explorer, and a lot more. But it’s so scary! I thought so at first, too. Since then, I’ve learned that it’s just another way to navigate your computer and run programs, and it can be super useful for your work. For instance, you can use it to combine existing tools into a pipeline to automate analyses, you can write a script to do things for you and improve reproducibility, you can interact with remote machines and supercomputers that are far away from you, and sometimes it’s the only option for the program you want to run.

We’re going to use it to:

  1. Organize our Python code and plots from the Python plotting lesson.
  2. Perform version control using git during the rest of the workshop.

What the Shell looks like

Back to top

When you open up the terminal for the first time, it can look pretty scary - it’s basically just a blank screen. Don’t worry - we’ll take you through how to use it step by step.

The first line of the shell shows a prompt - the shell is waiting for an input. When you’re following along in the lesson, don’t type the prompt when typing commands. To make the prompt the same for all of us, run this command:

PS1='$ '

Tree Structure

Back to top

The first thing we need to learn when using the shell is how to get around our computer. The shell folder (directory) structure is the same file structure as you’re used to. We call the way that different directories are nested the “directory tree”. You start at the root directory (/) and you can move “up” and “down” the tree. Here’s an example:

Now that we understand directory trees a bit, let’s check it out from the command line. We can see where we are by using the command pwd which stands for “print working directory”, or the directory we are currently in:

pwd
/home/USERNAME/

Congrats! You just ran your first command from the command line. The output is a file path to a location (a directory) on your computer.

The output will look a little different depending on what operating system you’re using:

Let’s check to see what’s in your home directory using the ls command, which lists all of the files in your working directory:

ls
Desktop     Downloads   Movies      Pictures
Documents   Library     Music       Public

You should see some files and directories you’re familiar with such as Documents and Desktop.

If you make a typo, don’t worry. If the shell can’t find a command you type, it will show you a helpful error message.

ks
ks: command not found

This error message tells us the command we tried to run, ks, is not a command that is recognized, letting us know we might have made a mistake when typing.

Man and Help

Back to top

Often we’ll want to learn more about how to use a certain command such as ls. There are several different ways you can learn more about a specific command.

Some commands have additional information that can be found by using the -h or --help flags. This will print brief documentation for the command:

man -h
man --help

Other commands, such as ls, don’t have help flags, but have manual pages with more information. We can navigate the manual page using the man command to view the description of a command and its options. For example, if you want to know more about the navigation options of ls you can type man ls on the command line:

man ls

On the manual page for ls, we see a section titled options. These options, also called flags, are like arguments in Python functions, and allow us to customize how ls runs.

To get out of the man page, click q.

Sometimes, commands will have multiple flags that we want to use at the same time. For example, ls has a flag -F that displays a slash after all directories, as well as a flag -a that includes hidden files and directories (ones that begin with a .). There are two ways to run ls using both of these flags:

ls -F -a
ls -Fa

Note that when we run the -a command, we see a . and a .. in the directory. The . corresponds to the current directory we are in and the .. corresponds to the directory directly above us in the directory tree. We’ll learn more about why this is useful in a bit.

Using the Manual Pages

Use man to open the manual for the command ls.

What flags would you use to…

  1. Print files in order of size?
  2. Print files in order of the last time they were edited?
  3. Print more information about the files?
  4. Print more information about the files with unit suffixes?
  5. Print files in order of size AND also print more information about the files?

Solution

  1. ls -S
  2. ls -t
  3. ls -l
  4. ls -lh
  5. ls -lS

Next, let’s move to our Desktop. To do this, we use cd to change directories.

Run the following command:

cd Desktop

Let’s see if we’re in the right place:

pwd
/home/USERNAME/Desktop

We just moved down the directory tree into the Desktop directory.

What files and directories do you have on your Desktop? How can you check?

ls
list.txt
un-report
notes.pdf
Untitled.png

Your Desktop will likely look different, but the important thing is that you see the folder we worked in for the Python plotting lesson. Is the un-report directory listed on your Desktop?

How can we get into the un-report directory?

cd un-report

We just went down the directory tree again.

Let’s see what files are in un-report:

ls
awesome_plot.png
awesome_hist_plot.png
gapminder_1997.csv
gapminder_data.csv
gdp_population.ipynb

Is it what you expect? Are the files you made in the Python plotting lesson there?

Now let’s move back up the directory tree. First, let’s try this command:

cd Desktop
cd: Desktop: No such file or directory

This doesn’t work because the Desktop directory is not within the directory that we are currently in.

To move up the directory tree, you can use .., which is the parent of the current directory:

cd ..
pwd
/home/USERNAME/Desktop

Everything that we’ve been doing is working with file paths. We tell the computer where we want to go using cd plus the file path. We can also tell the computer what files we want to list by giving a file path to ls:

ls un-report
awesome_plot.png
awesome_hist_plot.png
gapminder_1997.csv
gapminder_data.csv
gdp_population.ipynb
ls ..
list.txt
un-report
notes.pdf
Untitled.png

What happens if you just type cd without a file path?

cd
pwd
/home/USERNAME

It takes you back to your home directory!

To get back to your projects directory you can use the following command:

cd Desktop/un-report

We have been using relative paths, meaning you use your current working directory to get to where you want to go.

You can also use the absolute path, or the entire path from the root directory. What’s listed when you use the pwd command is the absolute path:

pwd

You can also use ~ for the path to your home directory:

cd ~
pwd
/home/USERNAME

Absolute vs Relative Paths

Starting from /Users/amanda/data, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda?

  1. cd .
  2. cd /
  3. cd /home/amanda
  4. cd ../..
  5. cd ~
  6. cd home
  7. cd ~/data/..
  8. cd
  9. cd ..

Solution

  1. No: . stands for the current directory.
  2. No: / stands for the root directory.
  3. No: Amanda’s home directory is /Users/amanda.
  4. No: this goes up two levels, i.e. ends in /Users.
  5. Yes: ~ stands for the user’s home directory, in this case /Users/amanda.
  6. No: this would navigate into a directory home in the current directory if it exists.
  7. Yes: unnecessarily complicated, but correct.
  8. Yes: shortcut to go back to the user’s home directory.
  9. Yes: goes up one level.

Working with files and directories

Back to top

Now that we know how to move around your computer using the command line, our next step is to organize the project that we started in the Python plotting lesson You might ask: why would I use the command line when I could just use the GUI? My best response is that if you ever need to use a high-performance computing cluster (such as Great Lakes at the University of Michigan), you’ll have no other option. You might also come to like it more than clicking around to get places once you get comfortable, because it’s a lot faster!

First, let’s make sure we’re in the right directory (the un-reports directory):

pwd
/home/USERNAME/Desktop/un-reports

If you’re not there, cd to the correct place.

Next, let’s remind ourselves what files are in this directory:

ls
awesome_plot.png
awesome_hist_plot.png
gapminder_1997.csv
gapminder_data.csv
gdp_population.ipynb

TODO: update listing output

You can see that right now all of our files are in our main directory. However, it can start to get crazy if you have too many different files of different types all in one place! We’re going to create a better project directory structure that will help us organize our files. This is really important, particularly for larger projects. If you’re interested in learning more about structuring computational biology projects in particular, here is a useful article.

What do you think good would be a good way to organize our files?

One way is the following:

.
├── code
│   └── gdp_population.ipynb
├── data
│   ├── gapminder_1997.csv
    └── gapminder_data.csv
└── figures
    ├── awesome_plot.png
    └── awesome_hist_plot.png

The Jupyter notebook goes in the code directory, the gapminder datasets go in the data directory, and the figures go in the figures directory. This way, all of the files are organized into a clearer overall structure.

A few notes about naming files and directories:

So how do we make our directory structure look like this?

First, we need to make a new directory. Let’s start with the code directory. To do this, we use the command mkdir plus the name of the directory we want to make:

mkdir code

Now, let’s see if that directory exists now:

ls
awesome_plot.png
awesome_hist_plot.png
code
gapminder_1997.csv
gapminder_data.csv

How can we check to see if there’s anything in the code directory?

ls code

Nothing in there yet, which is expected since we just made the directory.

The next step is to move the gdp_population.ipynb file into the code directory. To do this, we use the mv command. The first argument after mv is the file you want to move, and the second argument is the place you want to move it:

mv gdp_population.ipynb code

Okay, let’s see what’s in our current directory now:

ls
awesome_plot.png
awesome_hist_plot.png
code
gapminder_1997.csv
gapminder_data.csv

gdp_population.ipynb is no longer there! Where did it go? Let’s check the code directory, where we moved it to:

ls code
gdp_population.ipynb

There it is!

Creating directories and moving files

Create a data directory and move gapminder_data.csv and gapminder_1997.csv into the newly created data directory.

Solution

From the un-report directory:

mkdir data
mv gapminder_data.csv data
mv gapminder_1997.csv data

Okay, now we have the code and data in the right place. But we have several figures that should still be in their own directory.

First, let’s make a figures directory:

mkdir figures

Next, we have to move the figures. But we have so many figures! It’d be annoying to move them one at a time. Thankfully, we can use a wildcard to move them all at once. Wildcards are used to match files and directories to patterns.

One example of a wildcard is the asterisk, *. This special character is interpreted as “multiple characters of any kind”.

Let’s see how we can use a wildcard to list only files with the extension .png:

ls *png
awesome_plot.png
awesome_hist_plot.png

See how only the files ending in .png were listed? The shell expands the wildcard to create a list of matching file names before running the commands. Can you guess how we move all of these files at once to the figures directory?

mv *png figures

We can also use the wildcard to list all of the files in all of the directories:

ls *
code:
gdp_population.ipynb

data:
gapminder_1997.csv  gapminder_data.csv

figures:
awesome_plot.png    awesome_hist_plot.png

This output shows each directory name, followed by its contents on the next line. As you can see, all of the files are now in the right place!

Working with Wildcards

Suppose we are in a directory containing the following files:

cubane.pdb
ethane.pdb
methane.pdb
octane.pdb
pentane.pdb
propane.pdb
README.md

What would be the output of the following commands?

  1. ls *
  2. ls *.pdb
  3. ls *ethane.pdb
  4. ls *ane
  5. ls p*

Solution

  1. cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb README.md
  2. cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
  3. ethane.pdb methane.pdb
  4. None. None of the files end in only ane. This would have listed files if ls *ane* were used instead.
  5. pentane.pdb propane.pdb

Viewing Files

Back to top

To view and navigate the contents of a file we can use the command less. This will open a full screen view of the file.

For instance, we can run the command less on our gapminder_data.csv file:

less data/gapminder_data.csv

To navigate, press spacebar to scroll to the next page and b to scroll up to the previous page. You can also use the up and down arrows to scroll line-by-line. Note that less defaults to line wrapping, meaning that any lines longer than the width of the screen will be wrapped to the next line. To exit less, press the letter q.

One particularly useful flag for less is -S which cuts off really long lines (rather than having the text wrap around):

less -S data/gapminder_data.csv

To navigate, press spacebar to scroll to the next page and b to scroll up to the previous page. You can also use the up and down arrows to scroll line-by-line. Note that less defaults to line wrapping, meaning that any lines longer than the width of the screen will be wrapped to the next line, (to disable this use the option -S when running less, ex less -S file.txt). To exit less, press the letter q.

Note that not all file types can be viewed with less. While we can open PDFs and excel spreadsheets easily with programs on our computer, less doesn’t render them well on the command line. For example, if we try to less a .pdf file we will see a warning.

less figures/awesome_plot.png
figures/awesome_plot.png may be a binary file.  See it anyway?

If we say “yes”, less will render the file but it will appear as a seemingly random display of characters that won’t make much sense to us.

Glossary of terms

Back to top

Key Points

  • A shell is a program whose primary purpose is to read commands and run other programs.

  • Tab completion can help you save a lot of time and frustration.

  • The shell’s main advantages are its support for automating repetitive tasks and its capacity to access network machines.

  • Information is stored in files, which are stored in directories (folders).

  • Directories nested in other directories for a directory tree.

  • cd [path] changes the current working directory.

  • ls [path] prints a listing of a specific file or directory.

  • ls lists the current working directory.

  • pwd prints the user’s current working directory.

  • / is the root directory of the whole file system.

  • A relative path specifies a location starting from the current location.

  • An absolute path specifies a location from the root of the file system.

  • Directory names in a path are separated with / on Unix, but \ on Windows.

  • .. means ‘the directory above the current one’; . on its own means ‘the current directory’.

  • cp [old] [new] copies a file.

  • mkdir [path] creates a new directory.

  • mv [old] [new] moves (renames) a file or directory.

  • rm [path] removes (deletes) a file.

  • * matches zero or more characters in a filename.

  • The shell does not have a trash bin — once something is deleted, it’s really gone.


Intro to Git & GitHub

Overview

Teaching: 90 min
Exercises: 60 min
Questions
  • What is version control and why should I use it?

  • How do I get set up to use Git?

  • How do I share my changes with others on the web?

  • How can I use version control to collaborate with other people?

Objectives
  • Explain what version control is and why it’s useful.

  • Configure git the first time it is used on a computer.

  • Learn the basic git workflow.

  • Push, pull, or clone a remote repository.

  • Describe the basic collaborative workflow with GitHub.

Contents

  1. Background
  2. Setting up git
  3. Creating a Repository
  4. Tracking Changes
  5. Intro to GitHub
  6. Collaborating with GitHub
  7. BONUS

Background

Back to top

We’ll start by exploring how version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, automated version control is much better than this situation:

Piled Higher and Deeper by Jorge Cham, http://www.phdcomics.com/comics/archive_print.php?comicid=1531

“Piled Higher and Deeper” by Jorge Cham, http://www.phdcomics.com

We’ve all been in this situation before: it seems ridiculous to have multiple nearly-identical versions of the same document. Some word processors let us deal with this a little better, such as Microsoft Word’s Track Changes, Google Docs’ version history, or LibreOffice’s Recording and Displaying Changes.

Version control systems start with a base version of the document and then record changes you make each step of the way. You can think of it as a recording of your progress: you can rewind to start at the base document and play back each change you made, eventually arriving at your more recent version.

Changes Are Saved Sequentially

Once you think of changes as separate from the document itself, you can then think about “playing back” different sets of changes on the base document, ultimately resulting in different versions of that document. For example, two users can make independent sets of changes on the same document.

Different Versions Can be Saved

Unless multiple users make changes to the same section of the document - a conflict - you can incorporate two sets of changes into the same base document.

Multiple Versions Can be Merged

A version control system is a tool that keeps track of these changes for us, effectively creating different versions of our files. It allows us to decide which changes will be made to the next version (each record of these changes is called a commit), and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers, facilitating collaboration among different people.

Paper Writing

  • Imagine you drafted an excellent paragraph for a paper you are writing, but later ruin it. How would you retrieve the excellent version of your conclusion? Is it even possible?

  • Imagine you have 5 co-authors. How would you manage the changes and comments they make to your paper? If you use LibreOffice Writer or Microsoft Word, what happens if you accept changes made using the Track Changes option? Do you have a history of those changes?

Solution

  • Recovering the excellent version is only possible if you created a copy of the old version of the paper. The danger of losing good versions often leads to the problematic workflow illustrated in the PhD Comics cartoon at the top of this page.

  • Collaborative writing with traditional word processors is cumbersome. Either every collaborator has to work on a document sequentially (slowing down the process of writing), or you have to send out a version to all collaborators and manually merge their comments into your document. The ‘track changes’ or ‘record changes’ option can highlight changes for you and simplifies merging, but as soon as you accept changes you will lose their history. You will then no longer know who suggested that change, why it was suggested, or when it was merged into the rest of the document. Even online word processors like Google Docs or Microsoft Office Online do not fully resolve these problems.

Setting up Git

Back to top

When we use Git on a new computer for the first time, we need to configure a few things. Below are a few examples of configurations we will set as we get started with Git:

On a command line, Git commands are written as git verb options, where verb is what we actually want to do and options is additional optional information which may be needed for the verb. So here is how Riley sets up their new laptop:

$ git config --global user.name "Riley Shor"
$ git config --global user.email "Riley.Shor@fake.email.address"

Please use your own name and email address instead of Riley’s. This user name and email will be associated with your subsequent Git activity, which means that any changes pushed to GitHub, BitBucket, GitLab or another Git host server in a later lesson will include this information.

For these lessons, we will be interacting with GitHub and so the email address used should be the same as the one used when setting up your GitHub account. If you are concerned about privacy, please review GitHub’s instructions for keeping your email address private.

GitHub, GitLab, & BitBucket

GitHub, GitLab, & BitBucket are websites where you can store your git repositories, share them with the world, and collaborate with others. You can think of them like email applications. You may have a gmail address, and you can choose to manage your email through one of many services such as the Gmail app, Microsoft Outlook, Apple’s Mail app, etc. They have different interfaces and features, but all of them allow you to manage your email. Similarly, GitHub, GitLab, & BitBucket have different interfaces and features, but they all allow you to store, share, and collaborate with others on your git repos.

Line Endings

As with other keys, when you hit Return on your keyboard, your computer encodes this input as a character. Different operating systems use different character(s) to represent the end of a line. (You may also hear these referred to as newlines or line breaks.) Because Git uses these characters to compare files, it may cause unexpected issues when editing a file on different machines. Though it is beyond the scope of this lesson, you can read more about this issue in the Pro Git book.

You can change the way Git recognizes and encodes line endings using the core.autocrlf command to git config. The following settings are recommended:

On macOS and Linux:

$ git config --global core.autocrlf input

And on Windows:

$ git config --global core.autocrlf true

Editing Files

Back to top

TODO: Merge in this content from the shell lesson

Beyond viewing the content of files, we may want to be able to edit or write files on the command line. There are many different text editors you can use to edit files on the command line, but we will talk about nano since it is a bit easier to learn. To edit a file with nano type nano file.txt. If the file exists, it will open the file in a nano window, if the file does not exist it will be created. One nice feature of nano is that it has a cheat sheet along the bottom with some common commands you’ll need. When you are ready to save (write) your file, you type Ctrl+O. Along the bottom will appear a prompt for the file name to write to. The current name of the file will appear here, to keep the name as it is hit enter otherwise you can change the name of the file then hit enter. To exit nano, press Ctrl+X. If you forget to save before exiting, no worries, nano will prompt you to first save the file.

Riley also has to set their favorite text editor, nano.

$ git config --global core.editor "nano -w"

If you have a different preferred text editor, it is possible to reconfigure the text editor for Git to other editors whenever you want to change it. Vim is the default editor. If you did not change your editor and are stuck in Vim, the following instructions will help you exit.

Exiting Vim

Note that Vim is the default editor for many programs. If you haven’t used Vim before and wish to exit a session without saving your changes, press Esc then type :q! and hit Return. If you want to save your changes and quit, press Esc then type :wq and hit Return.

The four commands we just ran above only need to be run once: the flag --global tells Git to use the settings for every project, in your user account, on this computer.

You can check your settings at any time:

$ git config --list

You can change your configuration as many times as you want: use the same commands to choose another editor or update your email address.

Proxy

In some networks you need to use a proxy. If this is the case, you may also need to tell Git about the proxy:

$ git config --global http.proxy proxy-url
$ git config --global https.proxy proxy-url

To disable the proxy, use

$ git config --global --unset http.proxy
$ git config --global --unset https.proxy

Git Help and Manual

Always remember that if you forget a git command, you can access the list of commands by using -h and access the Git manual by using --help :

$ git config -h
$ git config --help

While viewing the manual, remember the : is a prompt waiting for commands and you can press Q to exit the manual.

Creating a Repository

Back to top

Once Git is configured, we can start using it.

First, let’s make sure we are in our un-report directory, if not we need to move into that directory:

$ pwd
$ /home/USERNAME/Desktop/un-report

To get back to your un-report directory you can use the following command:

Mac/git-bash:

cd ~/Desktop/un-report

On Windows’ Unix subsystem for Linux:

cd c/USERNAME/Desktop/un-report

What is currently in our directory?

$ ls
code    data    figures

Now we tell Git to make un-report a repository – a place where Git can store versions of our files:

$ git init

It is important to note that git init will create a repository that includes subdirectories and their files—there is no need to create separate repositories nested within the un-report repository, whether subdirectories are present from the beginning or added later. Also, note that the creation of the un-report directory and its initialization as a repository are completely separate processes.

If we use ls to show the directory’s contents, it appears that nothing has changed:

$ ls

But if we add the -a flag to show everything, we can see that Git has created a hidden directory within un-report called .git:

$ ls -a
.	..	.git	code	data	figures

Git uses this special subdirectory to store all the information about the project, including all files and sub-directories located within the project’s directory. If we ever delete the .git subdirectory, we will lose the project’s history.

We can check that everything is set up correctly by asking Git to tell us the status of our project:

$ git status
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)

nothing added to commit but untracked files present (use "git add" to track)

If you are using a different version of git, the exact wording of the output might be slightly different.

Places to Create Git Repositories

Along with tracking information about un-report (the project we have already created), Riley would also like to track information about countries. Despite our concerns, Riley creates a countries project inside their un-report project with the following sequence of commands:

$ cd ~/Desktop   # return to Desktop directory
$ cd un-report     # go into un-report directory, which is already a Git repository
$ ls -a          # ensure the .git subdirectory is still present in the un-report directory
$ mkdir countries    # make a subdirectory un-report/countries
$ cd countries       # go into countries subdirectory
$ git init       # make the countries subdirectory a Git repository
$ ls -a          # ensure the .git subdirectory is present indicating we have created a new Git repository

Is the git init command, run inside the countries subdirectory, required for tracking files stored in the countries subdirectory?

Solution

No. Riley does not need to make the countries subdirectory a Git repository because the un-report repository will track all files, sub-directories, and subdirectory files under the un-report directory. Thus, in order to track all information about countries, Riley only needed to add the countries subdirectory to the un-report directory.

Additionally, Git repositories can interfere with each other if they are “nested”: the outer repository will try to version-control the inner repository. Therefore, it’s best to create each new Git repository in a separate directory. To be sure that there is no conflicting repository in the directory, check the output of git status. If it looks like the following, you are good to go to create a new repository as shown above:

$ git status
fatal: Not a git repository (or any of the parent directories): .git

Correcting git init Mistakes

We explain to Riley how a nested repository is redundant and may cause confusion down the road. Riley would like to remove the nested repository. How can Riley undo his last git init in the countries subdirectory?

Solution – USE WITH CAUTION!

Background

Removing files from a git repository needs to be done with caution. To remove files from the working tree and not from your working directory, use

$ rm filename

The file being removed has to be in sync with the branch head with no updates. If there are updates, the file can be removed by force by using the -f option. Similarly a directory can be removed from git using rm -r dirname or rm -rf dirname.

Solution

Git keeps all of its files in the .git directory. To recover from this little mistake, Riley can just remove the .git folder in the countries subdirectory by running the following command from inside the un-report directory:

$ rm -rf countries/.git

But be careful! Running this command in the wrong directory, will remove the entire Git history of a project you might want to keep. Therefore, always check your current directory using the command pwd.

Tracking Changes

Back to top

Let’s make sure we’re still in the right directory. You should be in the un-report directory.

$ cd ~/Desktop/un-report

Let’s create a file called notes.txt. We’ll write some notes about the plot we have made so far – later we’ll add more details about the project. We’ll use nano to edit the file; you can use whatever text editor you like.

$ nano notes.txt

Type the text below into the notes.txt file:

We plotted life expectancy over time.

Let’s first verify that the file was properly created by running the list command (ls):

$ ls
notes.txt

notes.txt contains a single line, which we can see by running:

$ cat notes.txt
We plotted life expectancy over time.

If we check the status of our project again, Git tells us that it’s noticed the new file:

$ git status
On branch main

No commits yet

Untracked files:
   (use "git add <file>..." to include in what will be committed)

	notes.txt

nothing added to commit but untracked files present (use "git add" to track)

The “untracked files” message means that there’s a file in the directory that Git isn’t keeping track of. We can tell Git to track a file using git add:

$ git add notes.txt

and then check that the right thing happened:

$ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   notes.txt

Git now knows that it’s supposed to keep track of notes.txt, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:

$ git commit -m "Start notes on analysis"
[main (root-commit) f22b25e] Start notes on analysis
 1 file changed, 1 insertion(+)
 create mode 100644 notes.txt

When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is f22b25e. Your commit may have another identifier.

We use the -m flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured as core.editor) so that we can write a longer message.

Good commit messages start with a brief (<50 characters) statement about the changes made in the commit. Generally, the message should complete the sentence “If applied, this commit will” . If you want to go into more detail, add a blank line between the summary line and your additional notes. Use this additional space to explain why you made changes and/or what their impact will be.

If we run git status now:

$ git status
On branch main
nothing to commit, working directory clean

it tells us everything is up to date. If we want to know what we’ve done recently, we can ask Git to show us the project’s history using git log:

$ git log
commit f22b25e3233b4645dabd0d81e651fe074bd8e73b
Author: Riley Shor <Riley.Shor@fake.email.address>
Date:   Thu Aug 22 09:51:46 2020 -0400

    Start notes on analysis

git log lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created.

Where Are My Changes?

If we run ls at this point, we will still see just one file called notes.txt. That’s because Git saves information about files’ history in the special .git directory mentioned earlier so that our filesystem doesn’t become cluttered (and so that we can’t accidentally edit or delete an old version).

Now suppose Riley adds more information to the file. (Again, we’ll edit with nano and then cat the file to show its contents; you may use a different editor, and don’t need to cat.)

$ nano notes.txt
$ cat notes.txt
We plotted life expectancy over time.
Each point represents a country.

When we run git status now, it tells us that a file it already knows about has been modified:

$ git status
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   notes.txt

no changes added to commit (use "git add" and/or "git commit -a")

The last line is the key phrase: “no changes added to commit”. We have changed this file, but we haven’t told Git we will want to save those changes (which we do with git add) nor have we saved them (which we do with git commit). So let’s do that now. It is good practice to always review our changes before saving them. We do this using git diff. This shows us the differences between the current state of the file and the most recently saved version:

$ git diff
diff --git a/notes.txt b/notes.txt
index df0654a..315bf3a 100644
--- a/notes.txt
+++ b/notes.txt
@@ -1 +1,2 @@
 We plotted life expectancy over time.
+Each point represents a country.

The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we break it down into pieces:

  1. The first line tells us that Git is producing output similar to the Unix diff command comparing the old and new versions of the file.
  2. The second line tells exactly which versions of the file Git is comparing; df0654a and 315bf3a are unique computer-generated labels for those versions.
  3. The third and fourth lines once again show the name of the file being changed.
  4. The remaining lines are the most interesting, they show us the actual differences and the lines on which they occur. In particular, the + marker in the first column shows where we added a line.

After reviewing our change, it’s time to commit it:

$ git commit -m "Add information on points"
$ git status
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   notes.txt

no changes added to commit (use "git add" and/or "git commit -a")

Whoops: Git won’t commit because we didn’t use git add first. Let’s fix that:

$ git add notes.txt
$ git commit -m "Add information on points"
[main 34961b1] Add information on points
 1 file changed, 1 insertion(+)

Git insists that we add files to the set we want to commit before actually committing anything. This allows us to commit our changes in stages and capture changes in logical portions rather than only large batches. For example, suppose we’re adding a few citations to relevant research to our thesis. We might want to commit those additions, and the corresponding bibliography entries, but not commit some of our work drafting the conclusion (which we haven’t finished yet).

To allow for this, Git has a special staging area where it keeps track of things that have been added to the current changeset but not yet committed.

Staging Area

If you think of Git as taking snapshots of changes over the life of a project, git add specifies what will go in a snapshot (putting things in the staging area), and git commit then actually takes the snapshot, and makes a permanent record of it (as a commit). If you don’t have anything staged when you type git commit, Git will prompt you to use git commit -a or git commit --all, which is kind of like gathering everyone to take a group photo! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. (Going back to the group photo simile, you might get an extra with incomplete makeup walking on the stage for the picture because you used -a!) Try to stage things manually, or you might find yourself searching for “how to undo a commit” more than you would like! We’ll show you how to do this a little later in this lesson.

The Git Staging Area

Let’s watch as our changes to a file move from our editor to the staging area and into long-term storage. First, we’ll add another line to the file:

$ nano notes.txt
$ cat notes.txt
We plotted life expectancy over time.
Each point represents a country.
Continents are grouped by color.
$ git diff
diff --git a/notes.txt b/notes.txt
index 315bf3a..b36abfd 100644
--- a/notes.txt
+++ b/notes.txt
@@ -1,2 +1,3 @@
 We plotted life expectancy over time.
 Each point represents a country.
+Continents are grouped by color.

So far, so good: we’ve added one line to the end of the file (shown with a + in the first column). Now let’s put that change in the staging area and see what git diff reports:

$ git add notes.txt
$ git diff

There is no output: as far as Git can tell, there’s no difference between what it’s been asked to save permanently and what’s currently in the directory. However, if we do this:

$ git diff --staged
diff --git a/notes.txt b/notes.txt
index 315bf3a..b36abfd 100644
--- a/notes.txt
+++ b/notes.txt
@@ -1,2 +1,3 @@
 We plotted life expectancy over time.
 Each point represents a country.
+Continents are grouped by color.

it shows us the difference between the last committed change and what’s in the staging area. Let’s save our changes:

$ git commit -m "Add note about point color"
[main 005937f] Add note about point color
 1 file changed, 1 insertion(+)

check our status:

$ git status
On branch main
nothing to commit, working directory clean

and look at the history of what we’ve done so far:

$ git log
commit 005937fbe2a98fb83f0ade869025dc2636b4dad5
Author: Riley Shor <Riley.Shor@fake.email.address>
Date:   Thu Aug 22 10:14:07 2020 -0400

    Add note about point color

commit 34961b159c27df3b475cfe4415d94a6d1fcd064d
Author: Riley Shor <Riley.Shor@fake.email.address>
Date:   Thu Aug 22 10:07:21 2020 -0400

    Add information on points

commit f22b25e3233b4645dabd0d81e651fe074bd8e73b
Author: Riley Shor <Riley.Shor@fake.email.address>
Date:   Thu Aug 22 09:51:46 2020 -0400

    Start notes on analysis

Word-based diffing

Sometimes, e.g. in the case of the text documents a line-wise diff is too coarse. That is where the --color-words option of git diff comes in very useful as it highlights the changed words using colors.

Paging the Log

When the output of git log is too long to fit in your screen, git uses a program to split it into pages of the size of your screen. When this “pager” is called, you will notice that the last line in your screen is a :, instead of your usual prompt.

  • To get out of the pager, press Q.
  • To move to the next page, press Spacebar.
  • To search for some_word in all pages, press / and type some_word. Navigate through matches pressing N.

Limit Log Size

To avoid having git log cover your entire terminal screen, you can limit the number of commits that Git lists by using -N, where N is the number of commits that you want to view. For example, if you only want information from the last commit you can use:

$ git log -1
commit 005937fbe2a98fb83f0ade869025dc2636b4dad5
Author: Riley Shor <Riley.Shor@fake.email.address>
Date:   Thu Aug 22 10:14:07 2020 -0400

   Add note about point color

You can also reduce the quantity of information using the --oneline option:

$ git log --oneline
005937f Add note about point color
34961b1 Add information on points
f22b25e Start notes on analysis

You can also combine the --oneline option with others. One useful combination adds --graph to display the commit history as a text-based graph and to indicate which commits are associated with the current HEAD, the current branch main, or other Git references:

$ git log --oneline --graph
* 005937f (HEAD -> main) Add note about point color
* 34961b1 Add information on points
* f22b25e Start notes on analysis

Directories

Two important facts you should know about directories in Git.

  1. Git does not track directories on their own, only files within them. Try it for yourself:

    $ mkdir analysis
    $ git status
    $ git add analysis
    $ git status
    

    Note, our newly created empty directory analysis does not appear in the list of untracked files even if we explicitly add it (via git add) to our repository. This is the reason why you will sometimes see .gitkeep files in otherwise empty directories. Unlike .gitignore, these files are not special and their sole purpose is to populate a directory so that Git adds it to the repository. In fact, you can name such files anything you like.

  2. If you create a directory in your Git repository and populate it with files, you can add all files in the directory at once by:

    git add <directory-with-files>
    

    Try it for yourself:

    $ touch analysis/file-1.txt analysis/file-2.txt
    $ git status
    $ git add analysis
    $ git status
    

    Note: the touch command creates blank text files that you can later edit with your preferred text editor.

    Before moving on, we will commit these changes.

    $ git commit -m "Create blank text files"
    

To recap, when we want to add changes to our repository, we first need to add the changed files to the staging area (git add) and then commit the staged changes to the repository (git commit):

The Git Commit Workflow

Choosing a Commit Message

Which of the following commit messages would be most appropriate for the last commit made to notes.txt?

  1. “Changes”
  2. “Added line ‘Continents are grouped by color.’ to notes.txt”
  3. “Describe grouping”

Solution

Answer 1 is not descriptive enough, and the purpose of the commit is unclear; and answer 2 is redundant to using “git diff” to see what changed in this commit; but answer 3 is good: short, descriptive, and imperative.

Committing Changes to Git

Which command(s) below would save the changes of myfile.txt to my local Git repository?

  1. $ git commit -m "my recent changes"
    
  2. $ git init myfile.txt
    $ git commit -m "my recent changes"
    
  3. $ git add myfile.txt
    $ git commit -m "my recent changes"
    
  4. $ git commit -m myfile.txt "my recent changes"
    

Solution

  1. Would only create a commit if files have already been staged.
  2. Would try to create a new repository.
  3. Is correct: first add the file to the staging area, then commit.
  4. Would try to commit a file “my recent changes” with the message myfile.txt.

Committing Multiple Files

The staging area can hold changes from any number of files that you want to commit as a single snapshot.

  1. Add some text to notes.txt noting your decision to consider writing a manuscript.
  2. Create a new file manuscript.txt with your initial thoughts.
  3. Add changes from both files to the staging area, and commit those changes.

Solution

First we make our changes to the notes.txt and manuscript.txt files:

$ nano notes.txt
$ cat notes.txt
Maybe I should start with a draft manuscript.
$ nano manuscript.txt
$ cat manuscript.txt
This is where I will write an awesome manuscript.

Now you can add both files to the staging area. We can do that in one line:

$ git add notes.txt manuscript.txt

Or with multiple commands:

$ git add notes.txt
$ git add manuscript.txt

Now the files are ready to commit. You can check that using git status. If you are ready to commit use:

$ git commit -m "Note plans to start a draft manuscript"
[main cc127c2]
 Note plans to start a draft manuscript
 2 files changed, 2 insertions(+)
 create mode 100644 manuscript.txt

workshop Repository

  • Create a new Git repository on your computer called workshop.
  • Write three lines about what you have learned about Python and bash a file called notes.txt, commit your changes
  • Modify one line, add a fourth line
  • Display the differences between its updated state and its original state.

Solution

If needed, move out of the un-report folder:

$ cd ..

Create a new folder called workshop and ‘move’ into it:

$ mkdir workshop
$ cd workshop

Initialise git:

$ git init

Create your file notes.txt using nano or another text editor. Once in place, add and commit it to the repository:

$ git add notes.txt
$ git commit -m "Add notes file"

Modify the file as described (modify one line, add a fourth line). To display the differences between its updated state and its original state, use git diff:

$ git diff notes.txt

Intro to GitHub

Back to top

Now that you’ve created a git repo and gotten the hang of the basic git workflow, it’s time to share your repo with the world. Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, Bitbucket or GitLab to hold those main copies.

Let’s start by sharing the changes we’ve made to our current project with the world. Log in to GitHub, then click on the icon in the top right corner to create a new repository called un-report.

Creating a Repository on GitHub (Step 1)

Name your repository un-report and then click Create Repository.

Important options

Since this repository will be connected to a local repository, it needs to be empty. Leave “Initialize this repository with a README” unchecked, and keep “None” as options for both “Add .gitignore” and “Add a license.” See the “GitHub License and README files” exercise below for a full explanation of why the repository needs to be empty.

In the screenshots below, the Owner is ‘mkuzak’ and the Repository name is ‘planets’. You should instead see your own username for the Owner and you should name the repository un-report.

Creating a Repository on GitHub (Step 2)

As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository:

Creating a Repository on GitHub (Step 3)

This effectively does the following on GitHub’s servers:

$ mkdir un-report
$ cd un-report
$ git init

If you remember back to when we added and committed our earlier work on notes.txt, we had a diagram of the local repository which looked like this:

The Local Repository with Git Staging Area

Now that we have two repositories, we need a diagram like this:

Freshly-Made GitHub Repository

Note that our local repository still contains our earlier work on notes.txt, but the remote repository on GitHub appears empty as it doesn’t contain any files yet.

Linking a local repository to GitHub

The next step is to connect the two repositories. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it:

Where to Find Repository URL on GitHub

Copy that URL from the browser, go into the local un-report repository, and run this command:

$ git remote add origin https://github.com/USERNAME/un-report.git

Make sure to replace USERNAME with your actual GitHub username so it will use the correct URL for your repository; that should be the only difference.

origin is a local name used to refer to the remote repository. It could be called anything, but origin is a convention that is often used by default in git and GitHub, so it’s helpful to stick with this unless there’s a reason not to.

We can check that the command has worked by running git remote -v:

$ git remote -v
origin   https://github.com/USERNAME/un-report.git (push)
origin   https://github.com/USERNAME/un-report.git (fetch)

Now we want to send our local git information to GitHub. While the default for code you put on GitHub is that anyone can view or make copies of your code, in order to make changes to your repository, you need to be able to log in so GitHub can recognize you as someone who is authorized to make changes.

Setting up your GitHub Personal Access Token (PAT)

When you use the GitHub website, you need to login with a username and password. By default, only you will be able to make any changes to the repositories you create. In order to perform git commands on your own computer that interact with GitHub, we need a way to tell GitHub who you are. Rather than requiring you to type your password every time, you can create identify yourself with a personal access token (PAT). Let’s first tell git that we would like it to remember our credentials so we don’t have to constantly retype them. At the command line type:

git config --global credential.helper store

Like the previous git config commands we ran before, this tells git to store our account information so it doesn’t have to ask us for it every time we use git on the command line.

The information git stores is your personal access token (PAT). These tokens are basically a secret word that only you know that allows you to access all your stuff. Think of these tokens like a key to your house. You never want to hand over the keys to your house to someone you don’t trust. But as long as you hang on to that key, you are free to access all your stuff.

What’s the difference between passwords and PATs?

You might be wondering why we can’t just type a password to login and need to use a PAT instead. There are a few reasons:

  • Human created passwords maybe be easy to guess and are often reused across many sites. You don’t want to make it easy for someone to copy your keys nor is it safe just to have one key that can unlock everything you own (your house, your car, your secret money vault, etc)
  • PATs are generated by computers, for computers. The PATs are much longer than most human created passwords and have random combinations of letters and characters that are very difficult to guess
  • A user can generate multiple PATs for the same account for different uses with different permissions
  • Github now requires the use of PATs when using HTTPS (so we don’t really have a choice) Overall PATs are more secure if you also keep them private.

To create a PAT, you’ll need to be logged into to GitHub. Click your profile icon on the top left and choose “Setting” from the dropdown. On the main settings page there is a long list of options on the left. Scroll down till you find “Developer Settings”. Next you should see three options: “GitHub Apps”, “OAuth Apps”, and “Personal access tokens”. We want to create a token so click on the last link. You should now see a link to “ Generate a personal access token”. Click that. (You should now be at https://github.com/settings/tokens/new)

On the “New personal access token” form, the first field you see is for “Note.” You can actually create multiple tokens. This note field helps you remember what the token was for. It’s a good idea to create one per computer you use so the note field would be something like “work-laptop”, “home-macbook”, or “greatlakes-project”. Next you will see an option for “Expiration.” Since your tokens are like the keys to your house, it might be a good idea that if you forgot about your tokens, they just stop working after a while so no one else can misuse them. When your tokens expire, you can just generate a new one. GitHub recommends you choose an expiration date so we can just choose “90 days” or whatever is appropriate for you. (Note: You will have to repeat this processes of generating a new PAT when an existing PAT expires.)

Finally we must choose the “scopes” associated with this token. Just like you may have different keys on your key chain to different rooms, you can choose which of the GitHub “doors” your token can unlock. For now, choose the checkboxes next to “repo” and “user” (each of these main checkboxes will also select multiple sub-checkboxes which is what we want). In the future if you need a token with more access to GitHub features, you can create a new one. It’s best to choose the minimum set of permissions you need just in case anyone else were to get ahold of your token.

Finally, press the “Generate” button on the bottom. You will see your token in a green box on that page. It will be a long string of numbers and letters starting with “gph_”. There is an icon at the end of the token that will copy that special value to your clipboard. We will use this as your password when logging in during the next step.

Pushing changes to github

Now that we’ve set up the remote server information and have generated a personal access token, we are ready to send our data to GitHub. This command will push the changes from our local repository to the repository on GitHub:

$ git push origin main

When it asks you for your username, use your GitHub username, and when it asks you for a password, paste in the token that we just created. Then you should see something like the following output:

Enumerating objects: 16, done.
Counting objects: 100% (16/16), done.
Delta compression using up to 8 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (16/16), 1.45 KiB | 372.00 KiB/s, done.
Total 16 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), done.
To https://github.com/USERNAME/un-report.git
 * [new branch]      main -> main

Our local and remote repositories are now in this state:

GitHub Repository After First Push

The ‘-u’ Flag

You may see a -u option used with git push in some documentation. This option is synonymous with the --set-upstream-to option for the git branch command, and is used to associate the current branch with a remote branch so that the git pull command can be used without any arguments. To do this, simply use git push -u origin main once the remote has been set up.

We can pull changes from the remote repository to the local one as well:

$ git pull origin main
From https://github.com/USERNAME/un-report
 * branch            main     -> FETCH_HEAD
Already up-to-date.

Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository.

GitHub GUI

Browse to your un-report repository on GitHub. Under the Code tab, find and click on the text that says “XX commits” (where “XX” is some number). Hover over, and click on, the three buttons to the right of each commit. What information can you gather/explore from these buttons? How would you get that same information in the shell?

Solution

The left-most button (with the picture of a clipboard) copies the full identifier of the commit to the clipboard. In the shell, git log will show you the full commit identifier for each commit.

When you click on the middle button, you’ll see all of the changes that were made in that particular commit. Green shaded lines indicate additions and red ones removals. In the shell we can do the same thing with git diff. In particular, git diff ID1..ID2 where ID1 and ID2 are commit identifiers (e.g. git diff a3bf1e5..041e637) will show the differences between those two commits.

The right-most button lets you view all of the files in the repository at the time of that commit. To do this in the shell, we’d need to checkout the repository at that particular time. We can do this with git checkout ID where ID is the identifier of the commit we want to look at. If we do this, we need to remember to put the repository back to the right state afterwards!

Uploading files directly in GitHub browser

Github also allows you to skip the command line and upload files directly to your repository without having to leave the browser. There are two options. First you can click the “Upload files” button in the toolbar at the top of the file tree. Or, you can drag and drop files from your desktop onto the file tree. You can read more about this on this GitHub page

Push vs. Commit

In this lesson, we introduced the “git push” command. How is “git push” different from “git commit”?

Solution

When we push changes, we’re interacting with a remote repository to update it with the changes we’ve made locally (often this corresponds to sharing the changes we’ve made with others). Commit only updates your local repository.

GitHub License and README files

In this section we learned about creating a remote repository on GitHub, but when you initialized your GitHub repo, you didn’t add a readme or a license file. If you had, what do you think would have happened when you tried to link your local and remote repositories?

Solution

In this case, we’d see a merge conflict due to unrelated histories. When GitHub creates a readme file, it performs a commit in the remote repository. When you try to pull the remote repository to your local repository, Git detects that they have histories that do not share a common origin and refuses to merge.

$ git pull origin main
warning: no common commits
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/USERNAME/un-report
 * branch            main     -> FETCH_HEAD
 * [new branch]      main     -> origin/main
fatal: refusing to merge unrelated histories

You can force git to merge the two repositories with the option --allow-unrelated-histories. Be careful when you use this option and carefully examine the contents of local and remote repositories before merging.

$ git pull --allow-unrelated-histories origin main
From https://github.com/USERNAME/un-report
 * branch            main     -> FETCH_HEAD
Merge made by the 'recursive' strategy.
notes.txt | 1 +
1 file changed, 1 insertion(+)
create mode 100644 notes.txt

Collaborating with GitHub

Back to top

For the next step, get into pairs. One person will be the “Owner” and the other will be the “Collaborator”. The goal is that the Collaborator add changes into the Owner’s repository. We will switch roles at the end, so both persons will play Owner and Collaborator.

Practicing By Yourself

If you’re working through this lesson on your own, you can carry on by opening a second terminal window. This window will represent your partner, working on another computer. You won’t need to give anyone access on GitHub, because both ‘partners’ are you.

The Owner needs to give the Collaborator access. On GitHub, click the settings button on the right, select Manage access, click Invite a collaborator, and then enter your partner’s username.

Adding Collaborators on GitHub

To accept access to the Owner’s repo, the Collaborator needs to go to https://github.com/notifications. Once there they can accept access to the Owner’s repo.

Next, the Collaborator needs to download a copy of the Owner’s repository to her machine. This is called “cloning a repo”. To clone the Owner’s repo into her Desktop folder, the Collaborator enters:

$ git clone https://github.com/USERNAME/un-report.git ~/Desktop/USERNAME-un-report

Replace USERNAME with the Owner’s username.

The Collaborator can now make a change in their clone of the Owner’s repository, exactly the same way as we’ve been doing before:

$ cd ~/Desktop/USERNAME-un-report
$ nano notes.txt
$ cat notes.txt

You can write anything you like. Now might be a good time to list the dependencies of the project – the tools and packages that are needed to run the code.

Dependencies:
- R >= 4.0
- tidyverse
$ git add notes.txt
$ git commit -m "List dependencies"
 1 file changed, 1 insertion(+)
 create mode 100644 notes.txt

Then push the change to the Owner’s repository on GitHub:

$ git push origin main
Enumerating objects: 4, done.
Counting objects: 4, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 306 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/USERNAME/un-report.git
   9272da5..29aba7c  main -> main

Note that we didn’t have to create a remote called origin: Git uses this name by default when we clone a repository. (This is why origin was a sensible choice earlier when we were setting up remotes by hand.)

Take a look at the Owner’s repository on its GitHub website now (you may need to refresh your browser.) You should be able to see the new commit made by the Collaborator.

To download the Collaborator’s changes from GitHub, the Owner now enters:

$ git pull origin main
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 3 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/USERNAME/un-report
 * branch            main     -> FETCH_HEAD
   9272da5..29aba7c  main     -> origin/main
Updating 9272da5..29aba7c
Fast-forward
 notes.txt | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 notes.txt

Now the three repositories (Owner’s local, Collaborator’s local, and Owner’s on GitHub) are back in sync!

A Basic Collaborative Workflow

In practice, it is good to be sure that you have an updated version of the repository you are collaborating on, so you should git pull before making your changes. The basic collaborative workflow would be:

  • update your local repo with git pull,
  • make your changes and stage them with git add,
  • commit your changes with git commit -m, and
  • upload the changes to GitHub with git push

It is better to make many commits with smaller changes rather than one commit with massive changes: small commits are easier to read and review.

Switch Roles and Repeat

Switch roles and repeat the whole process.

Review Changes

The Owner pushed commits to the repository without giving any information to the Collaborator. How can the Collaborator find out what has changed on GitHub?

Solution

On GitHub, the Collaborator can go to the repository and click on “commits” to view the most recent commits pushed to the repository.

github-commits

Comment Changes in GitHub

The Collaborator has some questions about one line change made by the Owner and has some suggestions to propose.

With GitHub, it is possible to comment the diff of a commit. From the main repository page, click on “commits”, and click on a recent commit. Hover your mouse over a line of code, and a blue plus icon will appear to open a comment window.

The Collaborator posts comments and suggestions using the GitHub interface.

comment-icon

Version History, Backup, and Version Control

Some backup software (e.g. Time Machine on macOS, Google Drive) can keep a history of the versions of your files. They also allow you to recover specific versions. How is this functionality different from version control? What are some of the benefits of using version control, Git and GitHub?

Solution

Automated backup software gives you less control over how often backups are created and it is often difficult to compare changes between backups. However, Git has a steeper learning curve than backup software. Advantages of using Git and GitHub for version control include:

  • Great control over which files to include in commits and when to make commits.
  • Very popular way to collaborate on code and analysis projects among programmers, data scientists, and researchers.
  • Free and open source.
  • GitHub allows you to share your project with the world and accept contributions from outside collaborators.

Some more about remotes

In this episode and the previous one, our local repository has had a single “remote”, called origin. A remote is a copy of the repository that is hosted somewhere else, that we can push to and pull from, and there’s no reason that you have to work with only one. For example, on some large projects you might have your own copy in your own GitHub account (you’d probably call this origin) and also the main “upstream” project repository (let’s call this upstream for the sake of examples). You would pull from upstream from time to time to get the latest updates that other people have committed.

Remember that the name you give to a remote only exists locally. It’s an alias that you choose - whether origin, or upstream, or fred - and not something intrinstic to the remote repository.

The git remote family of commands is used to set up and alter the remotes associated with a repository. Here are some of the most useful ones:

  • git remote -v lists all the remotes that are configured (we already used this in the last episode)
  • git remote add [name] [url] is used to add a new remote
  • git remote remove [name] removes a remote. Note that it doesn’t affect the remote repository at all - it just removes the link to it from the local repo.
  • git remote set-url [name] [newurl] changes the URL that is associated with the remote. This is useful if it has moved, e.g. to a different GitHub account, or from GitHub to a different hosting service. Or, if we made a typo when adding it!
  • git remote rename [oldname] [newname] changes the local alias by which a remote is known - its name. For example, one could use this to change upstream to fred.

Bonus

Back to top

Exploring history

We can refer to commits by their identifiers shown in log. You can also refer to the most recent commit of the working directory by using the identifier HEAD.

We’ve been adding one line at a time to notes.txt, so it’s easy to track our progress by looking, so let’s do that using our HEADs. Before we start, let’s make a change to notes.txt, adding yet another line.

$ nano notes.txt
$ cat notes.txt
We plotted life expectancy over time.
Each point represents a country.
Continents are grouped by color.
An ill-considered change.

Now, let’s see what we get.

$ git diff HEAD notes.txt
diff --git a/notes.txt b/notes.txt
index b36abfd..0848c8d 100644
--- a/notes.txt
+++ b/notes.txt
@@ -1,3 +1,4 @@
 We plotted life expectancy over time.
 Each point represents a country.
 Continents are grouped by color.
+An ill-considered change.

which is the same as what you would get if you leave out HEAD (try it). The real goodness in all this is when you can refer to previous commits. We do that by adding ~1 (where “~” is “tilde”, pronounced [til-duh]) to refer to the commit one before HEAD.

$ git diff HEAD~1 notes.txt

If we want to see the differences between older commits we can use git diff again, but with the notation HEAD~1, HEAD~2, and so on, to refer to them:

$ git diff HEAD~3 notes.txt
diff --git a/notes.txt b/notes.txt
index df0654a..b36abfd 100644
--- a/notes.txt
+++ b/notes.txt
@@ -1 +1,4 @@
 We plotted life expectancy over time.
+Each point represents a country.
+Continents are grouped by color.
+An ill-considered change

We could also use git show which shows us what changes we made at an older commit as well as the commit message, rather than the differences between a commit and our working directory that we see by using git diff.

$ git show HEAD~3 notes.txt
commit f22b25e3233b4645dabd0d81e651fe074bd8e73b
Author: Riley Shor <Riley.Shor@fake.email.address>
Date:   Thu Aug 22 09:51:46 2020 -0400

    Make a change that I'll regret later

diff --git a/notes.txt b/notes.txt
new file mode 100644
index 0000000..df0654a
--- /dev/null
+++ b/notes.txt
@@ -0,0 +1 @@
+We plotted life expectancy over time.

In this way, we can build up a chain of commits. The most recent end of the chain is referred to as HEAD; we can refer to previous commits using the ~ notation, so HEAD~1 means “the previous commit”, while HEAD~123 goes back 123 commits from where we are now.

We can also refer to commits using those long strings of digits and letters that git log displays. These are unique IDs for the changes, and “unique” really does mean unique: every change to any set of files on any computer has a unique 40-character identifier. Our first commit was given the ID f22b25e3233b4645dabd0d81e651fe074bd8e73b, so let’s try this:

$ git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b notes.txt
diff --git a/notes.txt b/notes.txt
index df0654a..93a3e13 100644
--- a/notes.txt
+++ b/notes.txt
@@ -1 +1,4 @@
 We plotted life expectancy over time.
+Each point represents a country.
+Continents are grouped by color.
+An ill-considered change

That’s the right answer, but typing out random 40-character strings is annoying, so Git lets us use just the first few characters (typically seven for normal size projects):

$ git diff f22b25e notes.txt
diff --git a/notes.txt b/notes.txt
index df0654a..93a3e13 100644
--- a/notes.txt
+++ b/notes.txt
@@ -1 +1,4 @@
 We plotted life expectancy over time.
+Each point represents a country.
+Continents are grouped by color.
+An ill-considered change

All right! So we can save changes to files and see what we’ve changed. Now, how can we restore older versions of things? Let’s suppose we change our mind about the last update to notes.txt (the “ill-considered change”).

git status now tells us that the file has been changed, but those changes haven’t been staged:

$ git status
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   notes.txt

no changes added to commit (use "git add" and/or "git commit -a")

We can put things back the way they were by using git checkout:

$ git checkout HEAD notes.txt
$ cat notes.txt
We plotted life expectancy over time.
Each point represents a country.
Continents are grouped by color.

As you might guess from its name, git checkout checks out (i.e., restores) an old version of a file. In this case, we’re telling Git that we want to recover the version of the file recorded in HEAD, which is the last saved commit. If we want to go back even further, we can use a commit identifier instead:

$ git checkout f22b25e notes.txt
$ cat notes.txt
 We plotted life expectancy over time.
$ git status
On branch main
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    modified:   notes.txt

Notice that the changes are currently in the staging area. Again, we can put things back the way they were by using git checkout:

$ git checkout HEAD notes.txt

Don’t Lose Your HEAD

Above we used

$ git checkout f22b25e notes.txt

to revert notes.txt to its state after the commit f22b25e. But be careful! The command checkout has other important functionalities and Git will misunderstand your intentions if you are not accurate with the typing. For example, if you forget notes.txt in the previous command.

$ git checkout f22b25e
Note: checking out 'f22b25e'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

 git checkout -b <new-branch-name>

HEAD is now at f22b25e Make a change that I'll regret later

The “detached HEAD” is like “look, but don’t touch” here, so you shouldn’t make any changes in this state. After investigating your repo’s past state, reattach your HEAD with git checkout main.

It’s important to remember that we must use the commit number that identifies the state of the repository before the change we’re trying to undo. A common mistake is to use the number of the commit in which we made the change we’re trying to discard. In the example below, we want to retrieve the state from before the most recent commit (HEAD~1), which is commit f22b25e:

Git Checkout

We have now reverted our current file and commit to the latest version without the bug. But we have kept the commit and history from the commit that had the error.

Simplifying the Common Case

If you read the output of git status carefully, you’ll see that it includes this hint:

(use "git checkout -- <file>..." to discard changes in working directory)

As it says, git checkout without a version identifier restores files to the state saved in HEAD. The double dash -- is needed to separate the names of the files being recovered from the command itself: without it, Git would try to use the name of the file as the commit identifier.

The fact that files can be reverted one by one tends to change the way people organize their work. If everything is in one large document, it’s hard (but not impossible) to undo changes to the introduction without also undoing changes made later to the conclusion. If the introduction and conclusion are stored in separate files, on the other hand, moving backward and forward in time becomes much easier.

Recovering Older Versions of a File

Jennifer has made changes to the Python script that she has been working on for weeks, and the modifications she made this morning “broke” the script and it no longer runs. She has spent ~ 1hr trying to fix it, with no luck…

Luckily, she has been keeping track of her project’s versions using Git! Which commands below will let her recover the last committed version of her Python script called data_cruncher.py?

  1. $ git checkout HEAD

  2. $ git checkout HEAD data_cruncher.py

  3. $ git checkout HEAD~1 data_cruncher.py

  4. $ git checkout <unique ID of last commit> data_cruncher.py

  5. Both 2 and 4

Solution

The answer is (5)-Both 2 and 4.

The checkout command restores files from the repository, overwriting the files in your working directory. Answers 2 and 4 both restore the latest version in the repository of the file data_cruncher.py. Answer 2 uses HEAD to indicate the latest, whereas answer 4 uses the unique ID of the last commit, which is what HEAD means.

Answer 3 gets the version of data_cruncher.py from the commit before HEAD, which is NOT what we wanted.

Answer 1 can be dangerous! Without a filename, git checkout will restore all files in the current directory (and all directories below it) to their state at the commit specified. This command will restore data_cruncher.py to the latest commit version, but it will also restore any other files that are changed to that version, erasing any changes you may have made to those files! As discussed above, you are left in a detached HEAD state, and you don’t want to be there.

Undoing changes

Back to top

Reverting a Commit

Jennifer is collaborating on her Python script with her colleagues and realizes her last commit to the project’s repository contained an error and she wants to undo it. git revert [erroneous commit ID] will create a new commit that reverses Jennifer’s erroneous commit. Therefore git revert is different to git checkout [commit ID] because git checkout returns the files within the local repository to a previous state, whereas git revert reverses changes committed to the local and project repositories.
Below are the right steps and explanations for Jennifer to use git revert, what is the missing command in step 1 below?

  1. ________ # Look at the git history of the project to find the commit ID

  2. Copy the ID (the first few characters of the ID, e.g. 0b1d055).

  3. git revert [commit ID]

  4. Type in the new commit message.

  5. Save and close

Solution

Use git log to look at the git history to find the commit ID.

Understanding Workflow and History

What is the output of the last command in

$ echo "Here are my notes from the workshop." > notes.txt
$ git add notes.txt
$ echo "I learned the unix shell, git & github, and the Python programming language." >> notes.txt
$ git commit -m "Create workshop notes"
$ git checkout HEAD notes.txt
$ cat notes.txt #this will print the contents of notes.txt to the screen
  1. I learned the unix shell, git & github, and the Python programming language.
    
  2. Here are my notes from the workshop.
    
  3. Here are my notes from the workshop.
    I learned the unix shell, git & github, and the Python programming language.
    
  4. Error because you have changed notes.txt without committing the changes
    

Solution

The answer is 2.

The command git add notes.txt places the current version of notes.txt into the staging area. The changes to the file from the second echo command are only applied to the working copy, not the version in the staging area.

So, when git commit -m "Create workshop notes" is executed, the version of notes.txt committed to the repository is the one from the staging area and has only one line.

At this time, the working copy still has the second line (and git status will show that the file is modified). However, git checkout HEAD notes.txt replaces the working copy with the most recently committed version of notes.txt.

So, cat notes.txt will output

 Here are my notes from the workshop..

Checking Understanding of git diff

Consider this command: git diff HEAD~3 notes.txt. What do you predict this command will do if you execute it? What happens when you do execute it? Why?

Solution

The diff will show the difference between the current version of notes.txt and the version that existed 3 commits ago.

Try another command, git diff [ID] notes.txt, where [ID] is replaced with the unique identifier for your most recent commit. What do you think will happen, and what does happen?

Solution

The diff will show the difference between the current version of notes.txt and the version that exited in the commit from [ID].

Getting Rid of Staged Changes

git checkout can be used to restore a previous commit when unstaged changes have been made, but will it also work for changes that have been staged but not committed? Make a change to notes.txt, add that change, and use git checkout to see if you can remove your change.

Solution

git checkout notes.txt does not work for this purpose. Instead, use the restore command with the staged flag: git restore --staged notes.txt

Explore and Summarize Histories

Exploring history is an important part of Git, and often it is a challenge to find the right commit ID, especially if the commit is from several months ago.

Imagine the analysis project has more than 50 files. You would like to find a commit that modifies some specific text in notes.txt. When you type git log, a very long list appeared. How can you narrow down the search?

Recall that the git diff command allows us to explore one specific file, e.g., git diff notes.txt. We can apply a similar idea here.

$ git log notes.txt

Unfortunately some of these commit messages are very ambiguous, e.g., update files. How can you search through these files?

Both git diff and git log are very useful and they summarize a different part of the history for you. Is it possible to combine both? Let’s try the following:

$ git log --patch notes.txt

You should get a long list of output, and you should be able to see both commit messages and the difference between each commit.

Question: What does the following command do?

$ git log --patch HEAD~9 *.txt

Key Points

  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.


Python for Data Analysis

Overview

Teaching: 150 min
Exercises: 30 min
Questions
  • How can I summarize my data in Python?

  • How can Python help make my research more reproducible?

  • How can I combine two datasets from different sources?

  • How can data tidying facilitate answering analysis questions?

Objectives
  • To become familiar with the common methods of the Python pandas library.

  • To be able to use pandas to prepare data for analysis.

  • To be able to combine two different data sources using joins.

  • To be able to create plots and summary tables to answer analysis questions.

Contents

  1. Getting started
  2. An introduction to data analysis with pandas
  3. Cleaning up data
  4. Joining data frames
  5. Analyzing combined data
  6. Finishing with Git and GitHub
  7. Bonus exercises

Getting Started

Yesterday we spent a lot of time making plots in Python using the seaborn library. Visualizing data using plots is a very powerful skill in Python, but what if we would like to work with only a subset of our data? Or clean up messy data, calculate summary statistics, create a new variable, or join two datasets together? There are several different methods for doing this in Python, and we will touch on a few today using the fast and powerful pandas library.

Reading in the data

We will start by reading in the complete gapminder dataset that we used yesterday into our fresh new Jupyter notebook. Let’s type the code into a cell: gapminder = pd.read_csv("./data/gapminder_data.csv")

Exercise

If we look in the console now, we’ll see we’ve received an error message saying that “name ‘pd’ is not defined”. Hint: Libraries…

Solution

What this means is that Python did not recognize the pd part of the code and thus cannot find the read_csv() function we are trying to call. The reason for this usually is that we are trying to run a function from a library that we have not yet imported. This is a very common error message that you will probably see again when using Python. It’s important to remember that you will need to import any libraries you want to use into Python each time you start a new notebook. The read_csv function comes from the pandas library so we will just import the pandas library and run the code again.

Now that we know what’s wrong, We will use the read_csv() function from the pandas library. Import the pandas library (along with another common library numpy) and read in the gapminder dataset using the code below.

import numpy as np
import pandas as pd

gapminder = pd.read_csv("./data/gapminder_data.csv")
gapminder     # this line is just to show the data in the Jupyter notebook output
          country  year         pop continent  lifeExp   gdpPercap
0     Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1     Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2     Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3     Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4     Afghanistan  1972  13079460.0      Asia   36.088  739.981106
...           ...   ...         ...       ...      ...         ...
1699     Zimbabwe  1987   9216418.0    Africa   62.351  706.157306
1700     Zimbabwe  1992  10704340.0    Africa   60.377  693.420786
1701     Zimbabwe  1997  11404948.0    Africa   46.809  792.449960
1702     Zimbabwe  2002  11926563.0    Africa   39.989  672.038623
1703     Zimbabwe  2007  12311143.0    Africa   43.487  469.709298

[1704 rows x 6 columns]

The output above gives us an overview of the data with its first and last few rows, the names of the columns, and the numbers of rows and columns.

If we want more information, we can apply the info() method to a data frame to print some basic information about it. In Python we use the dot notation to apply a method to an object.

Note: When applying a method, we always need to follow the method name by a pair of parentheses, even if we are not passing any arguments to the method.

gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   year       1704 non-null   int64  
 2   pop        1704 non-null   float64
 3   continent  1704 non-null   object 
 4   lifeExp    1704 non-null   float64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB

Sometimes (especially when our data has many rows) we just want to take a look at the first few rows of the data. We can apply the head() method to select the first few rows of a data frame.

gapminder.head()
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3  Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4  Afghanistan  1972  13079460.0      Asia   36.088  739.981106

By default, the head() method selects the first 5 rows of the data frame. You can change the number of rows by passing a number as an argument to the method. For example, we can use the code below to select the first 3 rows.

gapminder.head(3)
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710

Similarly, we can apply the tail() method to select the last few rows of a data frame.

gapminder.tail()
       country  year         pop continent  lifeExp   gdpPercap
1699  Zimbabwe  1987   9216418.0    Africa   62.351  706.157306
1700  Zimbabwe  1992  10704340.0    Africa   60.377  693.420786
1701  Zimbabwe  1997  11404948.0    Africa   46.809  792.449960
1702  Zimbabwe  2002  11926563.0    Africa   39.989  672.038623
1703  Zimbabwe  2007  12311143.0    Africa   43.487  469.709298

Now we have the tools necessary to work through this lesson.

An introduction to data analysis with pandas

Get stats fast with describe()

Back to top

Pandas has a handy method describe() that will generate the summary statistics of the data.

gapminder.describe()
             year           pop      lifeExp      gdpPercap
count  1704.00000  1.704000e+03  1704.000000    1704.000000
mean   1979.50000  2.960121e+07    59.474439    7215.327081
std      17.26533  1.061579e+08    12.917107    9857.454543
min    1952.00000  6.001100e+04    23.599000     241.165876
25%    1965.75000  2.793664e+06    48.198000    1202.060309
50%    1979.50000  7.023596e+06    60.712500    3531.846988
75%    1993.25000  1.958522e+07    70.845500    9325.462346
max    2007.00000  1.318683e+09    82.603000  113523.132900

The output above shows the summary (or descriptive) statistics for the four numerical columns in our data.

If we are interested in specific columns with specific statistics, we can also apply the agg() method to aggregate a column based on some aggregation functions.

Let’s say we would like to know what is the mean life expectancy in the dataset.

(
    gapminder
    .agg({'lifeExp' : 'mean'})
)
lifeExp    59.474439
dtype: float64

Other aggregation functions for common descriptive statistics include median, min, max, std (for standard deviation), and var (for variance).

Narrow down rows with query()

Back to top

Let’s take a look at the value we just calculated, which tells us the mean life expectancy for all rows in the data was 59.47. That seems a bit low, doesn’t it? What’s going on?

Well, remember the dataset contains rows from many different years and many different countries. It’s likely that life expectancy has increased overtime, so it may not make sense to average over all the years at the same time.

Use the max() method to find the most recent year in the data set.

Practice getting descriptive statistics

Find the most recent year in the dataset.

Solution:

(
    gapminder['year']
    .max()
)
2007

So we see that the most recent year in the dataset is 2007. Let’s calculate the life expectancy for all countries for only that year. To do that, we will apply the query() method to only use the rows for that year before calculating the mean life expectancy.

(
    gapminder
    .query("year == 2007")
    .agg({'lifeExp' : 'mean'})
)
lifeExp    67.007423
dtype: float64

Querying the dataset

What is the mean GDP per capita for the first year in the dataset? Hint: The data frame has a column called “gdpPercap”.

Solution

Identify the earliest year in our dataset by applying the agg method.

(
    gapminder
    .agg({'year' : 'min'})
)
year    1952
dtype: int64

We see here that the first year in the dataset is 1952. Query the data to only include year 1952, and determine the mean GDP per capita.

(
    gapminder
    .query("year == 1952")
    .agg({'gdpPercap' : 'mean'})
)
gdpPercap    3725.276046
dtype: float64

By chaining the two methods query() and agg() we were able to calculate the mean GDP per capita in the year 1952.

Notice how the method chaining allows us to combine these two simple steps into a more complicated data extraction? We took the data, queried the year column, selected the gdpPercap columns, then took its mean value. The string argument we passed to query() needs to be an expression that will return TRUE or FALSE for each row. We use == (double equals) when evaluating if two values are equal, and we use = (single equal) when assigning values. Try changing the code above to use query("year = 2007") and see what happens.

Other common Python comparison operators

We can also use the operator == to evaluate if two strings are the same. For example, the code below returns all the rows from the United States.

(
    gapminder
    .query("country == 'United States'")
)
            country  year          pop continent  lifeExp    gdpPercap
1608  United States  1952  157553000.0  Americas   68.440  13990.48208
1609  United States  1957  171984000.0  Americas   69.490  14847.12712
1610  United States  1962  186538000.0  Americas   70.210  16173.14586
1611  United States  1967  198712000.0  Americas   70.760  19530.36557
1612  United States  1972  209896000.0  Americas   71.340  21806.03594
1613  United States  1977  220239000.0  Americas   73.380  24072.63213
1614  United States  1982  232187835.0  Americas   74.650  25009.55914
1615  United States  1987  242803533.0  Americas   75.020  29884.35041
1616  United States  1992  256894189.0  Americas   76.090  32003.93224
1617  United States  1997  272911760.0  Americas   76.810  35767.43303
1618  United States  2002  287675526.0  Americas   77.310  39097.09955
1619  United States  2007  301139947.0  Americas   78.242  42951.65309

Note: In a query() expression, any string values (e.g., United States in the code above) need to be wrapped with quotation marks.

Note: In a query() expression, any column names that does not include any special characters (e.g., a white space) do not need to be wrapped with anything. However, if a column name does include special characters, the name needs to be wrapped with a pair of backticks `` (the key above the Tab key on your keyboard).

Oftentimes we may wish to query the data based on more than a single criterion. In a query() expression we can combine multiple criteria with Python logical operators and or or. For example, the code below returns all the rows that are from the United States and after 2000.

(
    gapminder
    .query("country == 'United States' and year > 2000")
)
            country  year          pop continent  lifeExp    gdpPercap
1618  United States  2002  287675526.0  Americas   77.310  39097.09955
1619  United States  2007  301139947.0  Americas   78.242  42951.65309

Note if the logical operators are all and, we can also separate them by chaining multiple query() methods. The code below generates the same results as above.

(
    gapminder
    .query("country == 'United States'")
    .query("year > 2000")
)

Sometimes we may wish to query the data based on whether a value is from a list or not. In a query() expression we can use the Python membership operator in to achieve that. For example, the code below returns all the rows from a list of countries (the United States and Canada).

(
    gapminder
    .query("country in ['United States', 'Canada']")
)
            country  year          pop continent  lifeExp    gdpPercap
240          Canada  1952   14785584.0  Americas   68.750  11367.16112
241          Canada  1957   17010154.0  Americas   69.960  12489.95006
242          Canada  1962   18985849.0  Americas   71.300  13462.48555
243          Canada  1967   20819767.0  Americas   72.130  16076.58803
244          Canada  1972   22284500.0  Americas   72.880  18970.57086
245          Canada  1977   23796400.0  Americas   74.210  22090.88306
246          Canada  1982   25201900.0  Americas   75.760  22898.79214
247          Canada  1987   26549700.0  Americas   76.860  26626.51503
248          Canada  1992   28523502.0  Americas   77.950  26342.88426
249          Canada  1997   30305843.0  Americas   78.610  28954.92589
250          Canada  2002   31902268.0  Americas   79.770  33328.96507
251          Canada  2007   33390141.0  Americas   80.653  36319.23501
1608  United States  1952  157553000.0  Americas   68.440  13990.48208
1609  United States  1957  171984000.0  Americas   69.490  14847.12712
1610  United States  1962  186538000.0  Americas   70.210  16173.14586
1611  United States  1967  198712000.0  Americas   70.760  19530.36557
1612  United States  1972  209896000.0  Americas   71.340  21806.03594
1613  United States  1977  220239000.0  Americas   73.380  24072.63213
1614  United States  1982  232187835.0  Americas   74.650  25009.55914
1615  United States  1987  242803533.0  Americas   75.020  29884.35041
1616  United States  1992  256894189.0  Americas   76.090  32003.93224
1617  United States  1997  272911760.0  Americas   76.810  35767.43303
1618  United States  2002  287675526.0  Americas   77.310  39097.09955
1619  United States  2007  301139947.0  Americas   78.242  42951.65309

In a query() expression we can refer to variables in the environment by prefixing them with an ‘@’ character. For example, the code below generates the same results as above.

country_list = ['United States', 'Canada']

(
    gapminder
    .query("country in @country_list")
)

Lastly, we can use the not in operator to evaluate if a value is not in a list. For example, the code below returns all the rows for 2007 in the Americas except for the United States and Canada.

(
    gapminder
    .query("year == 2007")
    .query("continent == 'Americas'")
    .query("country not in ['United States', 'Canada']")
)
                  country  year          pop continent  lifeExp     gdpPercap
59              Argentina  2007   40301927.0  Americas   75.320  12779.379640
143               Bolivia  2007    9119152.0  Americas   65.554   3822.137084
179                Brazil  2007  190010647.0  Americas   72.390   9065.800825
287                 Chile  2007   16284741.0  Americas   78.553  13171.638850
311              Colombia  2007   44227550.0  Americas   72.889   7006.580419
359            Costa Rica  2007    4133884.0  Americas   78.782   9645.061420
395                  Cuba  2007   11416987.0  Americas   78.273   8948.102923
443    Dominican Republic  2007    9319622.0  Americas   72.235   6025.374752
455               Ecuador  2007   13755680.0  Americas   74.994   6873.262326
479           El Salvador  2007    6939688.0  Americas   71.878   5728.353514
611             Guatemala  2007   12572928.0  Americas   70.259   5186.050003
647                 Haiti  2007    8502814.0  Americas   60.916   1201.637154
659              Honduras  2007    7483763.0  Americas   70.198   3548.330846
791               Jamaica  2007    2780132.0  Americas   72.567   7320.880262
995                Mexico  2007  108700891.0  Americas   76.195  11977.574960
1115            Nicaragua  2007    5675356.0  Americas   72.899   2749.320965
1187               Panama  2007    3242173.0  Americas   75.537   9809.185636
1199             Paraguay  2007    6667147.0  Americas   71.752   4172.838464
1211                 Peru  2007   28674757.0  Americas   71.421   7408.905561
1259          Puerto Rico  2007    3942491.0  Americas   78.746  19328.709010
1559  Trinidad and Tobago  2007    1056608.0  Americas   69.819  18008.509240
1631              Uruguay  2007    3447496.0  Americas   76.384  10611.462990
1643            Venezuela  2007   26084662.0  Americas   73.747  11415.805690

Grouping rows using groupby()

Back to top

We see that the life expectancy in 2007 is much larger than the value we got using all of the rows. It seems life expectancy is increasing which is good news. But now we might be interested in calculating the mean for each year. Rather than doing a bunch of different query()’s for each year, we can instead use the groupby() method. This method allows us to tell the code to treat the rows in logical groups, so rather than aggregating over all the rows, we will get one summary value for each group. “Group by” is often often referred to as split-apply-combine.

(
    gapminder
    .groupby('year')
)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x157f8d2b0>

If we just apply the groupby() method to our data frame, we would only get a data object called “DataFrameGroupBy”. This is because this method only group the data based on our specification, in this case, by year. We have not specify what kind of the aggregation functions that we wish to apply to each of the groups.

We can take a closer look at the data object after the groupby() by its indices property.

(
    gapminder
    .groupby('year')
    .indices
)
{1952: array([   0,   12,   24,   36,   48,   60,   72,   84,   96,  108,  120,
         132,  144,  156,  168,  180,  192,  204,  216,  228,  240,  252,
         264,  276,  288,  300,  312,  324,  336,  348,  360,  372,  384,
         396,  408,  420,  432,  444,  456,  468,  480,  492,  504,  516,
         528,  540,  552,  564,  576,  588,  600,  612,  624,  636,  648,
         660,  672,  684,  696,  708,  720,  732,  744,  756,  768,  780,
         792,  804,  816,  828,  840,  852,  864,  876,  888,  900,  912,
         924,  936,  948,  960,  972,  984,  996, 1008, 1020, 1032, 1044,
        1056, 1068, 1080, 1092, 1104, 1116, 1128, 1140, 1152, 1164, 1176,
        1188, 1200, 1212, 1224, 1236, 1248, 1260, 1272, 1284, 1296, 1308,
        1320, 1332, 1344, 1356, 1368, 1380, 1392, 1404, 1416, 1428, 1440,
        1452, 1464, 1476, 1488, 1500, 1512, 1524, 1536, 1548, 1560, 1572,
        1584, 1596, 1608, 1620, 1632, 1644, 1656, 1668, 1680, 1692]),
...
2007: array([  11,   23,   35,   47,   59,   71,   83,   95,  107,  119,  131,
         143,  155,  167,  179,  191,  203,  215,  227,  239,  251,  263,
         275,  287,  299,  311,  323,  335,  347,  359,  371,  383,  395,
         407,  419,  431,  443,  455,  467,  479,  491,  503,  515,  527,
         539,  551,  563,  575,  587,  599,  611,  623,  635,  647,  659,
         671,  683,  695,  707,  719,  731,  743,  755,  767,  779,  791,
         803,  815,  827,  839,  851,  863,  875,  887,  899,  911,  923,
         935,  947,  959,  971,  983,  995, 1007, 1019, 1031, 1043, 1055,
        1067, 1079, 1091, 1103, 1115, 1127, 1139, 1151, 1163, 1175, 1187,
        1199, 1211, 1223, 1235, 1247, 1259, 1271, 1283, 1295, 1307, 1319,
        1331, 1343, 1355, 1367, 1379, 1391, 1403, 1415, 1427, 1439, 1451,
        1463, 1475, 1487, 1499, 1511, 1523, 1535, 1547, 1559, 1571, 1583,
        1595, 1607, 1619, 1631, 1643, 1655, 1667, 1679, 1691, 1703])}

It shows the indices of each group (i.e., each year in this case).

We can double-check the indices of the first group by manually querying the data for year 1952. The first column in the output shows the indices.

(
    gapminder
    .query("year == 1952")
)
                 country  year         pop continent  lifeExp    gdpPercap
0            Afghanistan  1952   8425333.0      Asia   28.801   779.445314
12               Albania  1952   1282697.0    Europe   55.230  1601.056136
24               Algeria  1952   9279525.0    Africa   43.077  2449.008185
36                Angola  1952   4232095.0    Africa   30.015  3520.610273
48             Argentina  1952  17876956.0  Americas   62.485  5911.315053
...                  ...   ...         ...       ...      ...          ...
1644             Vietnam  1952  26246839.0      Asia   40.412   605.066492
1656  West Bank and Gaza  1952   1030585.0      Asia   43.160  1515.592329
1668          Yemen Rep.  1952   4963829.0      Asia   32.548   781.717576
1680              Zambia  1952   2672000.0    Africa   42.038  1147.388831
1692            Zimbabwe  1952   3080907.0    Africa   48.451   406.884115

In practice, we will just trust the groupby() method will do its job, and just apply aggregation functions of our interest by calling the aggregation method agg() after the groupby().

(
    gapminder
    .groupby('year')
    .agg({'lifeExp' : 'mean'})
)
        lifeExp
year           
1952  49.057620
1957  51.507401
1962  53.609249
1967  55.678290
1972  57.647386
1977  59.570157
1982  61.533197
1987  63.212613
1992  64.160338
1997  65.014676
2002  65.694923
2007  67.007423

The groupby() method expects you to pass in the name of a column (or a list of columns) in your data.

Grouping the data

Try calculating the mean life expectancy by continent.

Solution

(
    gapminder
    .groupby('continent')
    .agg({'lifeExp' : 'mean'})
)
             lifeExp
continent           
Africa     48.865330
Americas   64.658737
Asia       60.064903
Europe     71.903686
Oceania    74.326208

By chaining the two methods groupby() and agg() we are able to calculate the mean life expectancy by continent.

Sometimes we may wish to apply more than one aggregation method. For example, we may want to know the mean and minimum life expectancy by continents. To do so, we can use the aggregation method called agg() and pass it a list of aggregation methods.

(
    gapminder
    .groupby('continent')
    .agg({'lifeExp' : ['mean', 'min']})
)
             lifeExp        
                mean     min
continent                   
Africa     48.865330  23.599
Americas   64.658737  37.579
Asia       60.064903  28.801
Europe     71.903686  43.585
Oceania    74.326208  69.120

Sort data with sort_values()

The sort_values() method allows us to sort our data by some value. Let’s use the full gapminder data. We will take the mean value for each continent in 2007 and then sort it so the continents with the longest life expectancy are on top. Which continent might you guess has be the highest life expectancy before running the code?

(
    gapminder
    .query("year == 2007")
    .groupby('continent')
    .agg({'lifeExp' : 'mean'})
    .sort_values('lifeExp', ascending=False)
)
             lifeExp
continent           
Oceania    80.719500
Europe     77.648600
Americas   73.608120
Asia       70.728485
Africa     54.806038

Notice we passed the argument ascending=False to the sort_values() method to sort the values in a descending order so the largest values are on top. The default is to put the smallest values on top.

Make new variables with assign()

Back to top

Sometimes we want to create a new column in our data. We can use the pandas assign() method to assign new columns to a data frame.

We have a column for the population and the GDP per capita. If we wanted to get the total GDP, we could multiply the per capita GDP values by the total population. Below is what the code would look like:

Here we use the lambda function.

(
    gapminder
    .assign(gdp=lambda df: df['pop'] * df['gdpPercap'])
)
          country  year         pop continent  lifeExp   gdpPercap           gdp
0     Afghanistan  1952   8425333.0      Asia   28.801  779.445314  6.567086e+09
1     Afghanistan  1957   9240934.0      Asia   30.332  820.853030  7.585449e+09
2     Afghanistan  1962  10267083.0      Asia   31.997  853.100710  8.758856e+09
3     Afghanistan  1967  11537966.0      Asia   34.020  836.197138  9.648014e+09
4     Afghanistan  1972  13079460.0      Asia   36.088  739.981106  9.678553e+09
...           ...   ...         ...       ...      ...         ...           ...
1699     Zimbabwe  1987   9216418.0    Africa   62.351  706.157306  6.508241e+09
1700     Zimbabwe  1992  10704340.0    Africa   60.377  693.420786  7.422612e+09
1701     Zimbabwe  1997  11404948.0    Africa   46.809  792.449960  9.037851e+09
1702     Zimbabwe  2002  11926563.0    Africa   39.989  672.038623  8.015111e+09
1703     Zimbabwe  2007  12311143.0    Africa   43.487  469.709298  5.782658e+09

[1704 rows x 7 columns]

This will add a new column called “gdp” to our data. We use the column names as if they were regular values that we want to perform mathematical operations on and provide the name in front of an equals sign.

Assigning multiple columns

We can also assign multiple columns by separating them with a comma inside assign(). Try making a new column for this data frame called popInMillions that is the population in million.

Solution:

(
    gapminder
    .assign(gdp=lambda df: df['pop'] * df['gdpPercap'],
            popInMillions=lambda df: df['pop'] / 1_000_000)
)
          country  year         pop continent  lifeExp   gdpPercap           gdp  popInMillions
0     Afghanistan  1952   8425333.0      Asia   28.801  779.445314  6.567086e+09       8.425333
1     Afghanistan  1957   9240934.0      Asia   30.332  820.853030  7.585449e+09       9.240934
2     Afghanistan  1962  10267083.0      Asia   31.997  853.100710  8.758856e+09      10.267083
3     Afghanistan  1967  11537966.0      Asia   34.020  836.197138  9.648014e+09      11.537966
4     Afghanistan  1972  13079460.0      Asia   36.088  739.981106  9.678553e+09      13.079460
...           ...   ...         ...       ...      ...         ...           ...            ...
1699     Zimbabwe  1987   9216418.0    Africa   62.351  706.157306  6.508241e+09       9.216418
1700     Zimbabwe  1992  10704340.0    Africa   60.377  693.420786  7.422612e+09      10.704340
1701     Zimbabwe  1997  11404948.0    Africa   46.809  792.449960  9.037851e+09      11.404948
1702     Zimbabwe  2002  11926563.0    Africa   39.989  672.038623  8.015111e+09      11.926563
1703     Zimbabwe  2007  12311143.0    Africa   43.487  469.709298  5.782658e+09      12.311143

[1704 rows x 8 columns]

Subset columns

Back to top

Sometimes we may want to select a subset of columns from our data based on the column names. If we want to select a single column, we can use the square bracket [] notation. For example, if we want to select the population column from our data, we can do:

gapminder['pop']
0        8425333.0
1        9240934.0
2       10267083.0
3       11537966.0
4       13079460.0
           ...    
1699     9216418.0
1700    10704340.0
1701    11404948.0
1702    11926563.0
1703    12311143.0
Name: pop, Length: 1704, dtype: float64

If we want to select multiple columns, we can pass a list of column names into (another) pair of square brackets.

gapminder[['pop', 'year']]
             pop  year
0      8425333.0  1952
1      9240934.0  1957
2     10267083.0  1962
3     11537966.0  1967
4     13079460.0  1972
...          ...   ...
1699   9216418.0  1987
1700  10704340.0  1992
1701  11404948.0  1997
1702  11926563.0  2002
1703  12311143.0  2007

[1704 rows x 2 columns]

Note: There are two nested pairs of square brackets in the code above. The outer square brackets is the notation for selecting columns from a data frame by name. The inner square brackets define a Python list that contains the column names. Try removing one pair of brackets and see what happens.

Another way to select columns is to use the filter() method. The code below gives the same output as the above.

(
    gapminder
    .filter(['pop', 'year'])
)

We can also apply the drop() method to drop/remove particular columns. For example, if we want everything but the continent and population columns, we can do:

(
    gapminder
    .drop(columns=['continent', 'pop'])
)
          country  year  lifeExp   gdpPercap
0     Afghanistan  1952   28.801  779.445314
1     Afghanistan  1957   30.332  820.853030
2     Afghanistan  1962   31.997  853.100710
3     Afghanistan  1967   34.020  836.197138
4     Afghanistan  1972   36.088  739.981106
...           ...   ...      ...         ...
1699     Zimbabwe  1987   62.351  706.157306
1700     Zimbabwe  1992   60.377  693.420786
1701     Zimbabwe  1997   46.809  792.449960
1702     Zimbabwe  2002   39.989  672.038623
1703     Zimbabwe  2007   43.487  469.709298

[1704 rows x 4 columns]

selecting columns

Create a data frame with only the country, continent, year, and lifeExp columns.

Solution:

There are multiple ways to do this exercise. Here are two different possibilities.

(
    gapminder
    .filter(['country', 'continent', 'year', 'lifeExp'])
)
          country continent  year  lifeExp
0     Afghanistan      Asia  1952   28.801
1     Afghanistan      Asia  1957   30.332
2     Afghanistan      Asia  1962   31.997
3     Afghanistan      Asia  1967   34.020
4     Afghanistan      Asia  1972   36.088
...           ...       ...   ...      ...
1699     Zimbabwe    Africa  1987   62.351
1700     Zimbabwe    Africa  1992   60.377
1701     Zimbabwe    Africa  1997   46.809
1702     Zimbabwe    Africa  2002   39.989
1703     Zimbabwe    Africa  2007   43.487

[1704 rows x 4 columns]
(
    gapminder
    .drop(columns=['pop', 'gdpPercap'])
)
          country  year continent  lifeExp
0     Afghanistan  1952      Asia   28.801
1     Afghanistan  1957      Asia   30.332
2     Afghanistan  1962      Asia   31.997
3     Afghanistan  1967      Asia   34.020
4     Afghanistan  1972      Asia   36.088
...           ...   ...       ...      ...
1699     Zimbabwe  1987    Africa   62.351
1700     Zimbabwe  1992    Africa   60.377
1701     Zimbabwe  1997    Africa   46.809
1702     Zimbabwe  2002    Africa   39.989
1703     Zimbabwe  2007    Africa   43.487

[1704 rows x 4 columns]

Bonus: Additional features of the filter() method

The filter() method can be used to filter columns by their names. It may become handy if you are working with a dataset that has a lot of columns. For example, let’s say we wanted to select the year column and all the columns that contain the letter “e”. You can do that with:

(
    gapminder
    .filter(like='e')
)
      year continent  lifeExp   gdpPercap
0     1952      Asia   28.801  779.445314
1     1957      Asia   30.332  820.853030
2     1962      Asia   31.997  853.100710
3     1967      Asia   34.020  836.197138
4     1972      Asia   36.088  739.981106
...    ...       ...      ...         ...
1699  1987    Africa   62.351  706.157306
1700  1992    Africa   60.377  693.420786
1701  1997    Africa   46.809  792.449960
1702  2002    Africa   39.989  672.038623
1703  2007    Africa   43.487  469.709298

[1704 rows x 4 columns]

This returns the four columns we are interested in.

Applying filter() with regular expression

For those of you who know regular expression (pattern matching in text), the filter() method also supports it. For example, let’s say we want to select all the columns that start with the letter “c”. We can do that with:

Solution

(
    gapminder
    .filter(regex='^c')
)
          country continent
0     Afghanistan      Asia
1     Afghanistan      Asia
2     Afghanistan      Asia
3     Afghanistan      Asia
4     Afghanistan      Asia
...           ...       ...
1699     Zimbabwe    Africa
1700     Zimbabwe    Africa
1701     Zimbabwe    Africa
1702     Zimbabwe    Africa
1703     Zimbabwe    Africa

[1704 rows x 2 columns]

Similarly, if we want to select all the columns that end with the letter “p”. We can do that with:

Solution

(
    gapminder
    .filter(regex='p$')
)
             pop  lifeExp   gdpPercap
0      8425333.0   28.801  779.445314
1      9240934.0   30.332  820.853030
2     10267083.0   31.997  853.100710
3     11537966.0   34.020  836.197138
4     13079460.0   36.088  739.981106
...          ...      ...         ...
1699   9216418.0   62.351  706.157306
1700  10704340.0   60.377  693.420786
1701  11404948.0   46.809  792.449960
1702  11926563.0   39.989  672.038623
1703  12311143.0   43.487  469.709298

[1704 rows x 3 columns]

Changing the shape of the data

Back to top

Data comes in many shapes and sizes, and one way we classify data is either “wide” or “long.” Data that is “long” has one row per observation. The gapminder data is in a long format. We have one row for each country for each year and each different measurement for that country is in a different column. We might describe this data as “tidy” because it makes it easy to work with pandas and seaborn. As tidy as it may be, sometimes we may want our data in a “wide” format. Typically in “wide” format each row represents a group of observations and each value is placed in a different column rather than a different row. For example maybe we want only one row per country and want to spread the life expectancy values into different columns (one for each year).

The pandas methods pivot() and melt() make it easy to switch between the two formats.

(
    gapminder
    .filter(['country', 'continent', 'year', 'lifeExp'])
    .pivot(columns='year', 
           index=['country', 'continent'], 
           values='lifeExp')
)
year                            1952    1957    1962    1967    1972    1977    1982    1987    1992    1997    2002    2007
country            continent                                                                                                
Afghanistan        Asia       28.801  30.332  31.997  34.020  36.088  38.438  39.854  40.822  41.674  41.763  42.129  43.828
Albania            Europe     55.230  59.280  64.820  66.220  67.690  68.930  70.420  72.000  71.581  72.950  75.651  76.423
Algeria            Africa     43.077  45.685  48.303  51.407  54.518  58.014  61.368  65.799  67.744  69.152  70.994  72.301
Angola             Africa     30.015  31.999  34.000  35.985  37.928  39.483  39.942  39.906  40.647  40.963  41.003  42.731
Argentina          Americas   62.485  64.399  65.142  65.634  67.065  68.481  69.942  70.774  71.868  73.275  74.340  75.320
...                              ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
Vietnam            Asia       40.412  42.887  45.363  47.838  50.254  55.764  58.816  62.820  67.662  70.672  73.017  74.249
West Bank and Gaza Asia       43.160  45.671  48.127  51.631  56.532  60.765  64.406  67.046  69.718  71.096  72.370  73.422
Yemen Rep.         Asia       32.548  33.970  35.180  36.984  39.848  44.175  49.113  52.922  55.599  58.020  60.308  62.698
Zambia             Africa     42.038  44.077  46.023  47.768  50.107  51.386  51.821  50.821  46.100  40.238  39.193  42.384
Zimbabwe           Africa     48.451  50.469  52.358  53.995  55.635  57.674  60.363  62.351  60.377  46.809  39.989  43.487

[142 rows x 12 columns]

Notice here that we tell pivot() which columns to pull the names we wish our new columns to be named from the year variable, and the values to populate those columns from the lifeExp variable. We see that the resulting table has new columns by year, and the values populate it with country and continent dictating the rows.

The pandas melt() method allows us to “melt” a table from wide format to long format. The code below converts our wide table back to the long format.

(
    gapminder
    .filter(['country', 'continent', 'year', 'lifeExp'])
    .pivot(columns='year', 
           index=['country', 'continent'], 
           values='lifeExp')
    .reset_index()
    .melt(id_vars=['country', 'continent'],
          value_name='lifeExp')
)
                 country continent  year  lifeExp
0            Afghanistan      Asia  1952   28.801
1                Albania    Europe  1952   55.230
2                Algeria    Africa  1952   43.077
3                 Angola    Africa  1952   30.015
4              Argentina  Americas  1952   62.485
...                  ...       ...   ...      ...
1699             Vietnam      Asia  2007   74.249
1700  West Bank and Gaza      Asia  2007   73.422
1701          Yemen Rep.      Asia  2007   62.698
1702              Zambia    Africa  2007   42.384
1703            Zimbabwe    Africa  2007   43.487

[1704 rows x 4 columns]

Before we move on to more data cleaning, let’s create the final gapminder data frame we will be working with for the rest of the lesson!

Final Americas 2007 gapminder dataset

  • Read in the gapminder_data.csv file.
  • Filter out the year 2007 and the continent “Americas”.
  • Drop the year and continent columns from the data frame.
  • Save the new data frame into a variable called gapminder_2007.

Solution:

gapminder_2007 = (
    gapminder
    .query("year == 2007 and continent == 'Americas'")
    .drop(columns=['year', 'continent'])
)
                  country          pop  lifeExp     gdpPercap
59              Argentina   40301927.0   75.320  12779.379640
143               Bolivia    9119152.0   65.554   3822.137084
179                Brazil  190010647.0   72.390   9065.800825
251                Canada   33390141.0   80.653  36319.235010
287                 Chile   16284741.0   78.553  13171.638850
311              Colombia   44227550.0   72.889   7006.580419
359            Costa Rica    4133884.0   78.782   9645.061420
395                  Cuba   11416987.0   78.273   8948.102923
443    Dominican Republic    9319622.0   72.235   6025.374752
455               Ecuador   13755680.0   74.994   6873.262326
479           El Salvador    6939688.0   71.878   5728.353514
611             Guatemala   12572928.0   70.259   5186.050003
647                 Haiti    8502814.0   60.916   1201.637154
659              Honduras    7483763.0   70.198   3548.330846
791               Jamaica    2780132.0   72.567   7320.880262
995                Mexico  108700891.0   76.195  11977.574960
1115            Nicaragua    5675356.0   72.899   2749.320965
1187               Panama    3242173.0   75.537   9809.185636
1199             Paraguay    6667147.0   71.752   4172.838464
1211                 Peru   28674757.0   71.421   7408.905561
1259          Puerto Rico    3942491.0   78.746  19328.709010
1559  Trinidad and Tobago    1056608.0   69.819  18008.509240
1619        United States  301139947.0   78.242  42951.653090
1631              Uruguay    3447496.0   76.384  10611.462990
1643            Venezuela   26084662.0   73.747  11415.805690

Awesome! This is the data frame we will be using later on in this lesson.

Reviewing Git and GitHub

Now that we have our gapminder data prepared, let’s use what we learned about git and GitHub in the previous lesson to add, commit, and push our changes.

Open Terminal/Git Bash, if you do not have it open already. First we’ll need to navigate to our un-report directory.

Let’s start by printing our current working directory and listing the items in the directory, to see where we are.

pwd
ls

Now, we’ll navigate to the un-report directory.

cd ~/Desktop/un-report  
ls

To start, let’s pull to make sure our local repository is up to date.

git status
git pull

Not let’s add and commit our changes.

git status
git add
git status "gapminder_data_analysis.ipynb"  
git commit -m "Create data analysis file"  

Finally, let’s check our commits and then push the commits to GitHub.

git status
git log --online  
git push 
git status

Cleaning up data

Back to top

Researchers are often pulling data from several sources, and the process of making data compatible with one another and prepared for analysis can be a large undertaking. Luckily, there are many functions that allow us to do this with pandas. We’ve been working with the gapminder dataset, which contains population and GDP data by year. In this section, we practice cleaning and preparing a second dataset containing CO2 emissions data by country and year, sourced from the UN.

It’s always good to go into data cleaning with a clear goal in mind. Here, we’d like to prepare the CO2 UN data to be compatible with our gapminder data so we can directly compare GDP to CO2 emissions. To make this work, we’d like a data frame that contains a column with the country name, and columns for different ways of measuring CO2 emissions. We will also want the data to be collected as close to 2007 as possible (the last year we have data for in gapminder). Let’s start with reading the data using pandas’s read_csv() function.

pd.read_csv("./data/co2-un-data.csv")
                      T24 CO2 emission estimates Unnamed: 2                                         Unnamed: 3 Unnamed: 4 Unnamed: 5  \
0     Region/Country/Area                    NaN       Year                                             Series      Value  Footnotes   
1                       8                Albania       1975  Emissions (thousand metric tons of carbon diox...  4338.3340        NaN   
2                       8                Albania       1985  Emissions (thousand metric tons of carbon diox...  6929.9260        NaN   
3                       8                Albania       1995  Emissions (thousand metric tons of carbon diox...  1848.5490        NaN   
4                       8                Albania       2005  Emissions (thousand metric tons of carbon diox...  3825.1840        NaN   
...                   ...                    ...        ...                                                ...        ...        ...   
2128                  716               Zimbabwe       2005  Emissions per capita (metric tons of carbon di...     0.7940        NaN   
2129                  716               Zimbabwe       2010  Emissions per capita (metric tons of carbon di...     0.6720        NaN   
2130                  716               Zimbabwe       2015  Emissions per capita (metric tons of carbon di...     0.7490        NaN   
2131                  716               Zimbabwe       2016  Emissions per capita (metric tons of carbon di...     0.6420        NaN   
2132                  716               Zimbabwe       2017  Emissions per capita (metric tons of carbon di...     0.5880        NaN   

                                             Unnamed: 6  
0                                                Source  
1     International Energy Agency, IEA World Energy ...  
2     International Energy Agency, IEA World Energy ...  
3     International Energy Agency, IEA World Energy ...  
4     International Energy Agency, IEA World Energy ...  
...                                                 ...  
2128  International Energy Agency, IEA World Energy ...  
2129  International Energy Agency, IEA World Energy ...  
2130  International Energy Agency, IEA World Energy ...  
2131  International Energy Agency, IEA World Energy ...  
2132  International Energy Agency, IEA World Energy ...  

[2133 rows x 7 columns]

Looking at the table that is outputted above we can see that there appear to be two rows at the top of the file that contain information about the data in the table. The first is a header that tells us the table number and its name. Ideally, we’d skip that. We can do this using the skiprows argument in read_csv() by giving it a number of rows to skip.

pd.read_csv("./data/co2-un-data.csv", skiprows=1)
      Region/Country/Area Unnamed: 1  Year                                             Series     Value Footnotes  \
0                       8    Albania  1975  Emissions (thousand metric tons of carbon diox...  4338.334       NaN   
1                       8    Albania  1985  Emissions (thousand metric tons of carbon diox...  6929.926       NaN   
2                       8    Albania  1995  Emissions (thousand metric tons of carbon diox...  1848.549       NaN   
3                       8    Albania  2005  Emissions (thousand metric tons of carbon diox...  3825.184       NaN   
4                       8    Albania  2010  Emissions (thousand metric tons of carbon diox...  3930.295       NaN   
...                   ...        ...   ...                                                ...       ...       ...   
2127                  716   Zimbabwe  2005  Emissions per capita (metric tons of carbon di...     0.794       NaN   
2128                  716   Zimbabwe  2010  Emissions per capita (metric tons of carbon di...     0.672       NaN   
2129                  716   Zimbabwe  2015  Emissions per capita (metric tons of carbon di...     0.749       NaN   
2130                  716   Zimbabwe  2016  Emissions per capita (metric tons of carbon di...     0.642       NaN   
2131                  716   Zimbabwe  2017  Emissions per capita (metric tons of carbon di...     0.588       NaN   

                                                 Source  
0     International Energy Agency, IEA World Energy ...  
1     International Energy Agency, IEA World Energy ...  
2     International Energy Agency, IEA World Energy ...  
3     International Energy Agency, IEA World Energy ...  
4     International Energy Agency, IEA World Energy ...  
...                                                 ...  
2127  International Energy Agency, IEA World Energy ...  
2128  International Energy Agency, IEA World Energy ...  
2129  International Energy Agency, IEA World Energy ...  
2130  International Energy Agency, IEA World Energy ...  
2131  International Energy Agency, IEA World Energy ...  

[2132 rows x 7 columns]

Now the output table looks better.

Another thing we can do is to tell the read_csv() function what the column names should be with the names argument where we give it the column names we want as a Python list. If we do this, then we need to skip 2 rows including the original column headings. Let’s also save this data frame to co2_emissions_dirty so that we don’t have to read it every time we want to clean it even more.

co2_emissions_dirty = (
    pd.read_csv("./data/co2-un-data.csv", skiprows=2,
                names=['region', 'country', 'year', 'series', 'value', 'footnotes', 'source'])
)
co2_emissions_dirty
      region   country  year                                             series     value footnotes  \
0          8   Albania  1975  Emissions (thousand metric tons of carbon diox...  4338.334       NaN   
1          8   Albania  1985  Emissions (thousand metric tons of carbon diox...  6929.926       NaN   
2          8   Albania  1995  Emissions (thousand metric tons of carbon diox...  1848.549       NaN   
3          8   Albania  2005  Emissions (thousand metric tons of carbon diox...  3825.184       NaN   
4          8   Albania  2010  Emissions (thousand metric tons of carbon diox...  3930.295       NaN   
...      ...       ...   ...                                                ...       ...       ...   
2127     716  Zimbabwe  2005  Emissions per capita (metric tons of carbon di...     0.794       NaN   
2128     716  Zimbabwe  2010  Emissions per capita (metric tons of carbon di...     0.672       NaN   
2129     716  Zimbabwe  2015  Emissions per capita (metric tons of carbon di...     0.749       NaN   
2130     716  Zimbabwe  2016  Emissions per capita (metric tons of carbon di...     0.642       NaN   
2131     716  Zimbabwe  2017  Emissions per capita (metric tons of carbon di...     0.588       NaN   

                                                 source  
0     International Energy Agency, IEA World Energy ...  
1     International Energy Agency, IEA World Energy ...  
2     International Energy Agency, IEA World Energy ...  
3     International Energy Agency, IEA World Energy ...  
4     International Energy Agency, IEA World Energy ...  
...                                                 ...  
2127  International Energy Agency, IEA World Energy ...  
2128  International Energy Agency, IEA World Energy ...  
2129  International Energy Agency, IEA World Energy ...  
2130  International Energy Agency, IEA World Energy ...  
2131  International Energy Agency, IEA World Energy ...  

[2132 rows x 7 columns]

Bonus: Another way to deal with the column names

Many data analysts prefer to have their column names be in all lower case. We can apply the rename() method to set all of the column names to lower case.

(
    pd.read_csv("./data/co2-un-data.csv", skiprows=1)
    .rename(columns=str.lower)
)
      region/country/area unnamed: 1  year                                             series     value footnotes  \
0                       8    Albania  1975  Emissions (thousand metric tons of carbon diox...  4338.334       NaN   
1                       8    Albania  1985  Emissions (thousand metric tons of carbon diox...  6929.926       NaN   
2                       8    Albania  1995  Emissions (thousand metric tons of carbon diox...  1848.549       NaN   
3                       8    Albania  2005  Emissions (thousand metric tons of carbon diox...  3825.184       NaN   
4                       8    Albania  2010  Emissions (thousand metric tons of carbon diox...  3930.295       NaN   
...                   ...        ...   ...                                                ...       ...       ...   
2127                  716   Zimbabwe  2005  Emissions per capita (metric tons of carbon di...     0.794       NaN   
2128                  716   Zimbabwe  2010  Emissions per capita (metric tons of carbon di...     0.672       NaN   
2129                  716   Zimbabwe  2015  Emissions per capita (metric tons of carbon di...     0.749       NaN   
2130                  716   Zimbabwe  2016  Emissions per capita (metric tons of carbon di...     0.642       NaN   
2131                  716   Zimbabwe  2017  Emissions per capita (metric tons of carbon di...     0.588       NaN   

                                                 source  
0     International Energy Agency, IEA World Energy ...  
1     International Energy Agency, IEA World Energy ...  
2     International Energy Agency, IEA World Energy ...  
3     International Energy Agency, IEA World Energy ...  
4     International Energy Agency, IEA World Energy ...  
...                                                 ...  
2127  International Energy Agency, IEA World Energy ...  
2128  International Energy Agency, IEA World Energy ...  
2129  International Energy Agency, IEA World Energy ...  
2130  International Energy Agency, IEA World Energy ...  
2131  International Energy Agency, IEA World Energy ...  

[2132 rows x 7 columns]

We previously saw how we can subset columns from a data frame using the select function. There are a lot of columns with extraneous information in this dataset, let’s subset out the columns we are interested in.

Reviewing selecting columns

Select the country, year, series, and value columns from our dataset.

Solution:

(
    co2_emissions_dirty
    .filter(['country', 'year', 'series', 'value'])
)
       country  year                                             series     value
0      Albania  1975  Emissions (thousand metric tons of carbon diox...  4338.334
1      Albania  1985  Emissions (thousand metric tons of carbon diox...  6929.926
2      Albania  1995  Emissions (thousand metric tons of carbon diox...  1848.549
3      Albania  2005  Emissions (thousand metric tons of carbon diox...  3825.184
4      Albania  2010  Emissions (thousand metric tons of carbon diox...  3930.295
...        ...   ...                                                ...       ...
2127  Zimbabwe  2005  Emissions per capita (metric tons of carbon di...     0.794
2128  Zimbabwe  2010  Emissions per capita (metric tons of carbon di...     0.672
2129  Zimbabwe  2015  Emissions per capita (metric tons of carbon di...     0.749
2130  Zimbabwe  2016  Emissions per capita (metric tons of carbon di...     0.642
2131  Zimbabwe  2017  Emissions per capita (metric tons of carbon di...     0.588

[2132 rows x 4 columns]

The series column has two methods of quantifying CO2 emissions - “Emissions (thousand metric tons of carbon dioxide)” and “Emissions per capita (metric tons of carbon dioxide)”. Those are long titles that we’d like to shorten to make them easier to work with. We can shorten them to “emissions_total” and “emissions_percap” using the recode function. We can achieve this by applying the pandas replace() method to replace the values. When using the replace() method we need to tell it which column we want to replace values with and then what is the old value (e.g. “Emissions (thousand metric tons of carbon dioxide)”) and new values (e.g. “emissions_total”).

(
    co2_emissions_dirty
    .filter(['country', 'year', 'series', 'value'])
    .replace({'series': {"Emissions (thousand metric tons of carbon dioxide)":"emissions_total",
                         "Emissions per capita (metric tons of carbon dioxide)":"emissions_percap"}, 
             })
)
       country  year            series     value
0      Albania  1975   emissions_total  4338.334
1      Albania  1985   emissions_total  6929.926
2      Albania  1995   emissions_total  1848.549
3      Albania  2005   emissions_total  3825.184
4      Albania  2010   emissions_total  3930.295
...        ...   ...               ...       ...
2127  Zimbabwe  2005  emissions_percap     0.794
2128  Zimbabwe  2010  emissions_percap     0.672
2129  Zimbabwe  2015  emissions_percap     0.749
2130  Zimbabwe  2016  emissions_percap     0.642
2131  Zimbabwe  2017  emissions_percap     0.588

[2132 rows x 4 columns]

Recall that we’d like to have separate columns for the two ways that we CO2 emissions data. To achieve this, we’ll apply the pivot method that we used previously. The columns we want to spread out are “series” (i.e. the columns argument) and “value” (i.e. the value argument).

(
    co2_emissions_dirty
    .filter(['country', 'year', 'series', 'value'])
    .replace({'series': {"Emissions (thousand metric tons of carbon dioxide)":"emissions_total",
                         "Emissions per capita (metric tons of carbon dioxide)":"emissions_percap"}, 
             })
    .pivot(index=['country', 'year'], columns='series', values='value')
    .reset_index()
)
series   country  year  emissions_percap  emissions_total
0        Albania  1975             1.804         4338.334
1        Albania  1985             2.337         6929.926
2        Albania  1995             0.580         1848.549
3        Albania  2005             1.270         3825.184
4        Albania  2010             1.349         3930.295
...          ...   ...               ...              ...
1061    Zimbabwe  2005             0.794        10272.774
1062    Zimbabwe  2010             0.672         9464.714
1063    Zimbabwe  2015             0.749        11822.362
1064    Zimbabwe  2016             0.642        10368.900
1065    Zimbabwe  2017             0.588         9714.938

[1066 rows x 4 columns]

Excellent! The last step before we can join this data frame is to get the most data that is for the year closest to 2007 so we can make a more direct comparison to the most recent data we have from gapminder. For the sake of time, we’ll just tell you that we want data from 2005.

Bonus: How did we determine that 2005 is the closest year to 2007?

We want to make sure we pick a year that is close to 2005, but also a year that has a decent amount of data to work with. One useful tool is the value_counts() method, which will tell us how many times a value is repeated in a column of a data frame. Let’s use this function on the year column to see which years we have data for and to tell us whether we have a good number of countries represented in that year.

(
    co2_emissions_dirty
    .filter(['country', 'year', 'series', 'value'])
    .replace({'series': {"Emissions (thousand metric tons of carbon dioxide)> ":"emissions_total",
                         "Emissions per capita (metric tons of carbon dioxide)> ":"emissions_percap"}, 
             })
    .pivot(index=['country', 'year'], columns='series', values='value')
    .reset_index()
    .value_counts(['year'])
    .sort_index()
)
year
1975    111
1985    113
1995    136
2005    140
2010    140
2015    142
2016    142
2017    142
Name: count, dtype: int64

It looks like we have data for 140 countries in 2005 and 2010. We chose 2005 because it is closer to 2007.

Filtering rows and removing columns

Filter out data from 2005 and then drop the year column. (Since we will have only data from one year, it is now irrelevant.)

Solution:

(
    co2_emissions_dirty
    .filter(['country', 'year', 'series', 'value'])
    .replace({'series': {"Emissions (thousand metric tons of carbon dioxide)> > ":"emissions_total",
                         "Emissions per capita (metric tons of carbon dioxide)> > ":"emissions_percap"}, 
             })
    .pivot(index=['country', 'year'], columns='series', values='value')
    .reset_index()
    .query("year == 2005")
    .drop(columns='year')
)
series                     country  emissions_percap  emissions_total
3                          Albania             1.270         3825.184
11                         Algeria             2.327        77474.130
19                          Angola             0.314         6146.691
27                       Argentina             3.819       149476.040
33                         Armenia             1.385         4129.845
...                            ...               ...              ...
1029    Venezuela (Boliv. Rep. of)             5.141       137701.548
1037                      Viet Nam             0.940        79230.185
1045                         Yemen             0.915        18836.222
1053                        Zambia             0.176         2120.692
1061                      Zimbabwe             0.794        10272.774

[140 rows x 3 columns]

Finally, let’s go ahead and assign the output of this code chunk, which is the cleaned data frame, to a variable name:

co2_emissions = (
    co2_emissions_dirty
    .filter(['country', 'year', 'series', 'value'])
    .replace({'series': {'Emissions (thousand metric tons of carbon dioxide)':'emissions_total',
                         'Emissions per capita (metric tons of carbon dioxide)':'emissions_percap'}, 
             })
    .pivot(index=['country', 'year'], columns='series', values='value')
    .reset_index()
    .query("year == 2005")
    .drop(columns='year')
)

Joining data frames

Back to top

Now we’re ready to join our CO2 emissions data to the gapminder data. Previously we saw that we could read in and query the gapminder data like this to get the data from the Americas for 2007 so we can create a new data frame with our filtered data:

gapminder_2007 = (
    gapminder
    .query("year == 2007 and continent == 'Americas'")
    .drop(columns=['year', 'continent'])
)

Look at the data in co2_emissions and gapminder_data_2007. If you had to merge these two data frames together, which column would you use to merge them together? If you said “country” - good job!

We’ll call country our “key”. Now, when we join them together, can you think of any problems we might run into when we merge things? We might not have CO2 emissions data for all of the countries in the gapminder dataset and vice versa. Also, a country might be represented in both data frames but not by the same name in both places. As an example, write down the name of the country that the University of Michigan is in - we’ll come back to your answer shortly!

pandas has a number of tools for joining data frames together depending on what we want to do with the rows of the data of countries that are not represented in both data frames. Here we’ll be using “inner join” and “outer join”.

In an “inner join”, the new data frame only has those rows where the same key is found in both data frames. This is a very commonly used join.

Bonus: Other pandas join methods

There are other types of join too. For a left join, if the key is present in the left hand data frame, it will appear in the output, even if it is not found in the the right hand data frame. For a right join, the opposite is true. For a outer (or full) join, all possible keys are included in the output data frame.

Let’s give the merge() method a try.

(
    gapminder_2007
    .merge(co2_emissions, how='inner', on='country')
)
                country          pop  lifeExp     gdpPercap  emissions_percap  emissions_total
0             Argentina   40301927.0   75.320  12779.379640             3.819       149476.040
1                Brazil  190010647.0   72.390   9065.800825             1.667       311623.799
2                Canada   33390141.0   80.653  36319.235010            16.762       540431.495
3                 Chile   16284741.0   78.553  13171.638850             3.343        54434.634
4              Colombia   44227550.0   72.889   7006.580419             1.238        53585.300
5            Costa Rica    4133884.0   78.782   9645.061420             1.286         5463.059
6                  Cuba   11416987.0   78.273   8948.102923             2.220        25051.431
7    Dominican Republic    9319622.0   72.235   6025.374752             1.897        17522.139
8               Ecuador   13755680.0   74.994   6873.262326             1.742        23926.725
9           El Salvador    6939688.0   71.878   5728.353514             1.037         6252.815
10            Guatemala   12572928.0   70.259   5186.050003             0.811        10621.597
11                Haiti    8502814.0   60.916   1201.637154             0.214         1980.992
12             Honduras    7483763.0   70.198   3548.330846             0.976         7192.737
13              Jamaica    2780132.0   72.567   7320.880262             3.746        10281.648
14               Mexico  108700891.0   76.195  11977.574960             3.854       412385.135
15            Nicaragua    5675356.0   72.899   2749.320965             0.750         4032.083
16               Panama    3242173.0   75.537   9809.185636             2.035         6776.118
17             Paraguay    6667147.0   71.752   4172.838464             0.599         3472.665
18                 Peru   28674757.0   71.421   7408.905561             1.037        28632.888
19  Trinidad and Tobago    1056608.0   69.819  18008.509240            13.243        17175.823
20              Uruguay    3447496.0   76.384  10611.462990             1.549         5151.871

Do you see that we now have data from both data frames joined together?

One thing to notice is that gapminder data had 25 rows, but the output of our join only had 21. Let’s investigate. It appears that there must have been countries in the gapminder data that did not appear in our CO2 emission data.

Let’s do another merge for this, this time with an outer join. If we set the indicator argument to True, it will add a new column called _merge to the merged data, and the value indicates whether a particular record appeared at left_only, right_only, or both. Then we can do a query to show the data for the keys on the left that are missing from the data frame on the right.

(
    gapminder_2007
    .merge(co2_emissions, how='outer', on='country', indicator=True)
    .query("_merge == 'left_only'")
)
          country          pop  lifeExp     gdpPercap  emissions_percap  emissions_total     _merge
1         Bolivia    9119152.0   65.554   3822.137084               NaN              NaN  left_only
20    Puerto Rico    3942491.0   78.746  19328.709010               NaN              NaN  left_only
22  United States  301139947.0   78.242  42951.653090               NaN              NaN  left_only
24      Venezuela   26084662.0   73.747  11415.805690               NaN              NaN  left_only

We can see that the CO2 emission data were missing for Bolivia, Puerto Rico, United States, and Venezuela.

We can query the CO2 emission data to check if there are records containing these names.

Note we can split a long string by adding a backslash \ (it’s called a line continuation character) at the end of each line. The string will continue on the next line as if it were a single line.

(
    co2_emissions
    .query("country.str.contains('Bolivia') or \
            country.str.contains('Puerto Rico') or \
            country.str.contains('United States') or \
            country.str.contains('Venezuela')")
)
series                     country  emissions_percap  emissions_total
101     Bolivia (Plurin. State of)             0.984         8975.809
1007      United States of America            19.268      5703220.175
1029    Venezuela (Boliv. Rep. of)             5.141       137701.548

From the outputs above we can see that Bolivia, the United States, and Venezuela are called different things in the CO2 emission data. Puerto Rico isn’t a country; it’s part of the United States. We can apply the replace() method to these country names in the CO2 emission data so that the country names for Bolivia, United States, and Venezuela, match those in the gapminder data.

(
    co2_emissions
    .replace({'country':{'Bolivia (Plurin. State of)':'Bolivia',
                         'United States of America':'United States',
                         'Venezuela (Boliv. Rep. of)':'Venezuela'}
             })
)
series    country  emissions_percap  emissions_total
3         Albania             1.270         3825.184
11        Algeria             2.327        77474.130
19         Angola             0.314         6146.691
27      Argentina             3.819       149476.040
33        Armenia             1.385         4129.845
...           ...               ...              ...
1029    Venezuela             5.141       137701.548
1037     Viet Nam             0.940        79230.185
1045        Yemen             0.915        18836.222
1053       Zambia             0.176         2120.692
1061     Zimbabwe             0.794        10272.774

[140 rows x 3 columns]
(
    gapminder_2007
    .merge(co2_emissions.replace({'country':{'Bolivia (Plurin. State of)':'Bolivia',
                                  'United States of America':'United States',
                                  'Venezuela (Boliv. Rep. of)':'Venezuela'}
                                 }),
           how='outer', on='country', indicator=True)
    .query("_merge == 'left_only'")
)
        country        pop  lifeExp    gdpPercap  emissions_percap  emissions_total     _merge
20  Puerto Rico  3942491.0   78.746  19328.70901               NaN              NaN  left_only

Now we see that the replacement of the country names enabled the join for all countries in the gapminder, and we are left with Puerto Rico. In the next exercise, let’s replace the name Puerto Rico to the United States in the gapminder data and then use the groupby() method to aggregate the data. We’ll use the population data to weight the life expectancy and GDP values.

In the gapminder data, let’s first replace the name Puerto Rico to the United States.

(
    gapminder_2007
    .replace({'country':{'Puerto Rico':'United States'}})
)
                  country          pop  lifeExp     gdpPercap
59              Argentina   40301927.0   75.320  12779.379640
143               Bolivia    9119152.0   65.554   3822.137084
179                Brazil  190010647.0   72.390   9065.800825
251                Canada   33390141.0   80.653  36319.235010
287                 Chile   16284741.0   78.553  13171.638850
311              Colombia   44227550.0   72.889   7006.580419
359            Costa Rica    4133884.0   78.782   9645.061420
395                  Cuba   11416987.0   78.273   8948.102923
443    Dominican Republic    9319622.0   72.235   6025.374752
455               Ecuador   13755680.0   74.994   6873.262326
479           El Salvador    6939688.0   71.878   5728.353514
611             Guatemala   12572928.0   70.259   5186.050003
647                 Haiti    8502814.0   60.916   1201.637154
659              Honduras    7483763.0   70.198   3548.330846
791               Jamaica    2780132.0   72.567   7320.880262
995                Mexico  108700891.0   76.195  11977.574960
1115            Nicaragua    5675356.0   72.899   2749.320965
1187               Panama    3242173.0   75.537   9809.185636
1199             Paraguay    6667147.0   71.752   4172.838464
1211                 Peru   28674757.0   71.421   7408.905561
1259        United States    3942491.0   78.746  19328.709010
1559  Trinidad and Tobago    1056608.0   69.819  18008.509240
1619        United States  301139947.0   78.242  42951.653090
1631              Uruguay    3447496.0   76.384  10611.462990
1643            Venezuela   26084662.0   73.747  11415.805690

Now we have to group Puerto Rico and the US together, aggregating and calculating the data for all of the other columns. This is a little tricky - we will need a populated-weighted mean of lifeExp and gdpPercap.

(
    gapminder_2007
    .replace({'country':{'Puerto Rico':'United States'}})
    .groupby('country')
    .apply(lambda df: pd.Series({'pop': np.sum(df['pop']),
                                 'gdpPercap': np.sum(df['gdpPercap'] * df['pop']) / np.sum(df['pop']),
                                 'lifeExp': np.sum(df['lifeExp'] * df['pop']) / np.sum(df['pop']),
                                }))
)
                             pop     gdpPercap    lifeExp
country                                                  
Argentina             40301927.0  12779.379640  75.320000
Bolivia                9119152.0   3822.137084  65.554000
Brazil               190010647.0   9065.800825  72.390000
Canada                33390141.0  36319.235010  80.653000
Chile                 16284741.0  13171.638850  78.553000
Colombia              44227550.0   7006.580419  72.889000
Costa Rica             4133884.0   9645.061420  78.782000
Cuba                  11416987.0   8948.102923  78.273000
Dominican Republic     9319622.0   6025.374752  72.235000
Ecuador               13755680.0   6873.262326  74.994000
El Salvador            6939688.0   5728.353514  71.878000
Guatemala             12572928.0   5186.050003  70.259000
Haiti                  8502814.0   1201.637154  60.916000
Honduras               7483763.0   3548.330846  70.198000
Jamaica                2780132.0   7320.880262  72.567000
Mexico               108700891.0  11977.574960  76.195000
Nicaragua              5675356.0   2749.320965  72.899000
Panama                 3242173.0   9809.185636  75.537000
Paraguay               6667147.0   4172.838464  71.752000
Peru                  28674757.0   7408.905561  71.421000
Trinidad and Tobago    1056608.0  18008.509240  69.819000
United States        305082438.0  42646.380702  78.248513
Uruguay                3447496.0  10611.462990  76.384000
Venezuela             26084662.0  11415.805690  73.747000

Let’s check to see if it worked!

(
    gapminder_2007
    .replace({'country':{'Puerto Rico': 'United States'}})
    .groupby('country')
    .apply(lambda df: pd.Series({'pop': np.sum(df['pop']),
                                 'gdpPercap': np.sum(df['gdpPercap'] * df['pop']) / np.sum(df['pop']),
                                 'lifeExp': np.sum(df['lifeExp'] * df['pop']) / np.sum(df['pop']),
                                }))
    .merge(co2_emissions.replace({'country': {"Bolivia (Plurin. State of)":"Bolivia",
                                              "United States of America":"United States",
                                              "Venezuela (Boliv. Rep. of)":"Venezuela"}}),
           how='outer', on='country', indicator=True)
    .query("_merge == 'left_only'")
)
Empty DataFrame
Columns: [country, pop, gdpPercap, lifeExp, emissions_percap, emissions_total, _merge]
Index: []

Now the output above returns an empty data frame, which tells us that we have reconciled all of the keys from the gapminder data with the data in the CO2 emission data.

Finally, let’s merge the data with inner join to create a new data frame.

gapminder_co2 = (
    gapminder_2007
    .replace({'country':{'Puerto Rico': 'United States'}})
    .groupby('country')
    .apply(lambda df: pd.Series({'pop': np.sum(df['pop']),
                                 'gdpPercap': np.sum(df['gdpPercap'] * df['pop']) / np.sum(df['pop']),
                                 'lifeExp': np.sum(df['lifeExp'] * df['pop']) / np.sum(df['pop']),
                                }))
    .merge(co2_emissions.replace({'country': {"Bolivia (Plurin. State of)":"Bolivia",
                                              "United States of America":"United States",
                                              "Venezuela (Boliv. Rep. of)":"Venezuela"}}),
           how='inner', on='country')
)

One last thing! What if we’re interested in distinguishing between countries in North America and South America? We want to create two groups - Canada, the United States, and Mexico in one and the other countries in another.

We can apply the assign() method to add a new column and use the numpy function np.where() to help us define the region.

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United States', 'Mexico']), 'north', 'south'))
)
                country          pop     gdpPercap    lifeExp  emissions_percap  emissions_total region
0             Argentina   40301927.0  12779.379640  75.320000             3.819       149476.040  south
1               Bolivia    9119152.0   3822.137084  65.554000             0.984         8975.809  south
2                Brazil  190010647.0   9065.800825  72.390000             1.667       311623.799  south
3                Canada   33390141.0  36319.235010  80.653000            16.762       540431.495  north
4                 Chile   16284741.0  13171.638850  78.553000             3.343        54434.634  south
5              Colombia   44227550.0   7006.580419  72.889000             1.238        53585.300  south
6            Costa Rica    4133884.0   9645.061420  78.782000             1.286         5463.059  south
7                  Cuba   11416987.0   8948.102923  78.273000             2.220        25051.431  south
8    Dominican Republic    9319622.0   6025.374752  72.235000             1.897        17522.139  south
9               Ecuador   13755680.0   6873.262326  74.994000             1.742        23926.725  south
10          El Salvador    6939688.0   5728.353514  71.878000             1.037         6252.815  south
11            Guatemala   12572928.0   5186.050003  70.259000             0.811        10621.597  south
12                Haiti    8502814.0   1201.637154  60.916000             0.214         1980.992  south
13             Honduras    7483763.0   3548.330846  70.198000             0.976         7192.737  south
14              Jamaica    2780132.0   7320.880262  72.567000             3.746        10281.648  south
15               Mexico  108700891.0  11977.574960  76.195000             3.854       412385.135  north
16            Nicaragua    5675356.0   2749.320965  72.899000             0.750         4032.083  south
17               Panama    3242173.0   9809.185636  75.537000             2.035         6776.118  south
18             Paraguay    6667147.0   4172.838464  71.752000             0.599         3472.665  south
19                 Peru   28674757.0   7408.905561  71.421000             1.037        28632.888  south
20  Trinidad and Tobago    1056608.0  18008.509240  69.819000            13.243        17175.823  south
21        United States  305082438.0  42646.380702  78.248513            19.268      5703220.175  north
22              Uruguay    3447496.0  10611.462990  76.384000             1.549         5151.871  south
23            Venezuela   26084662.0  11415.805690  73.747000             5.141       137701.548  south

Let’s look at the output - see how the Canada, US, and Mexico rows are all labeled as “north” and everything else is labeled as “south”.

We have reached our data cleaning goals! One of the best aspects of doing all of these steps coded in Python is that our efforts are reproducible, and the raw data is maintained. With good documentation of data cleaning and analysis steps, we could easily share our work with another researcher who would be able to repeat what we’ve done. However, it’s also nice to have a saved csv copy of our clean data. That way we can access it later without needing to redo our data cleaning, and we can also share the cleaned data with collaborators. We can apply the to_csv method to a data frame to save it to a CSV file.

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United States', 'Mexico']), 'north', 'south'))
    .to_csv("./data/gapminder_co2.csv")
)

Great - Now we can move on to the analysis!

Analyzing combined data

Back to top

For our analysis, we have two questions we’d like to answer. First, is there a relationship between the GDP of a country and the amount of CO2 emitted (per capita)? Second, Canada, the United States, and Mexico account for nearly half of the population of the Americas. What percent of the total CO2 production do they account for?

To answer the first question, we’ll plot the CO2 emitted (on a per capita basis) against the GDP (on a per capita basis) using a scatter plot:

import seaborn.objects as so

(
    so.Plot(gapminder_co2, x='gdpPercap', y='emissions_percap')
    .add(so.Dot())
    .label(x="GDP (per capita)",
           y="CO2 emitted (per capita)",
           title="There is a strong association between a nation's GDP \nand the amount of CO2 it produces")
)

plot of chunk PlotPercapCO2vsGDP

Tip: Notice we used the \n in our title to get a new line to prevent it from getting cut off.

To help clarify the association, we can add a fitted line representing a 3rd order polynomial regression model.

(
    so.Plot(gapminder_co2, x='gdpPercap', y='emissions_percap')
    .add(so.Dot(), label='data')
    .add(so.Line(color='red'), so.PolyFit(order=3), label='model')
    .label(x="GDP (per capita)",
           y="CO2 emitted (per capita)",
           title="There is a strong association between a nation's GDP \nand the amount of CO2 it produces")
)

plot of chunk PlotPercapCO2vsGDPSmooth

We can force the line to be straight using order=1 as an argument to so.PolyFit.

(
    so.Plot(gapminder_co2, x='gdpPercap', y='emissions_percap')
    .add(so.Dot(), label='data')
    .add(so.Line(color='red'), so.PolyFit(order=1), label='model')
    .label(x="GDP (per capita)",
           y="CO2 emitted (per capita)",
           title="There is a strong association between a nation's GDP \nand the amount of CO2 it produces")
)

plot of chunk PlotPercapCO2vsGDP1SmoothLm

In addition, we see that only two or three countries have very high GDP/emissions, while the rest of the countries are cluttered in the lower ranges of the axes. To make it easier to see the relationship we can set the x and y axis to a logarithmic scale. Lastly, we can also add a text layer that displays the country names next to the markers.

(
    so.Plot(gapminder_co2, x='gdpPercap', y='emissions_percap', text='country')
    .add(so.Dot(alpha=.8, pointsize=8))
    .add(so.Text(color='gray', valign='bottom', fontsize=10))
    .scale(x='log', y='log')
    .label(x="GDP (per capita)",
           y="CO2 emitted (per capita)",
           title="There is a strong association between a nation's GDP \nand the amount of CO2 it produces")
    .limit(x=(None, 70_000), y=(None, 30))
)

plot of chunk PlotPercapCO2vsGDP1SmoothLm

To answer our first question, as the title of our plot indicates there is indeed a strong association between a nation’s GDP and the amount of CO2 it produces.

For the second question, we want to create two groups - Canada, the United States, and Mexico in one and the other countries in another.

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United States', 'Mexico']), 'north', 'south'))
)
                country          pop     gdpPercap    lifeExp  emissions_percap  emissions_total region
0             Argentina   40301927.0  12779.379640  75.320000             3.819       149476.040  south
1               Bolivia    9119152.0   3822.137084  65.554000             0.984         8975.809  south
2                Brazil  190010647.0   9065.800825  72.390000             1.667       311623.799  south
3                Canada   33390141.0  36319.235010  80.653000            16.762       540431.495  north
4                 Chile   16284741.0  13171.638850  78.553000             3.343        54434.634  south
5              Colombia   44227550.0   7006.580419  72.889000             1.238        53585.300  south
6            Costa Rica    4133884.0   9645.061420  78.782000             1.286         5463.059  south
7                  Cuba   11416987.0   8948.102923  78.273000             2.220        25051.431  south
8    Dominican Republic    9319622.0   6025.374752  72.235000             1.897        17522.139  south
9               Ecuador   13755680.0   6873.262326  74.994000             1.742        23926.725  south
10          El Salvador    6939688.0   5728.353514  71.878000             1.037         6252.815  south
11            Guatemala   12572928.0   5186.050003  70.259000             0.811        10621.597  south
12                Haiti    8502814.0   1201.637154  60.916000             0.214         1980.992  south
13             Honduras    7483763.0   3548.330846  70.198000             0.976         7192.737  south
14              Jamaica    2780132.0   7320.880262  72.567000             3.746        10281.648  south
15               Mexico  108700891.0  11977.574960  76.195000             3.854       412385.135  north
16            Nicaragua    5675356.0   2749.320965  72.899000             0.750         4032.083  south
17               Panama    3242173.0   9809.185636  75.537000             2.035         6776.118  south
18             Paraguay    6667147.0   4172.838464  71.752000             0.599         3472.665  south
19                 Peru   28674757.0   7408.905561  71.421000             1.037        28632.888  south
20  Trinidad and Tobago    1056608.0  18008.509240  69.819000            13.243        17175.823  south
21        United States  305082438.0  42646.380702  78.248513            19.268      5703220.175  north
22              Uruguay    3447496.0  10611.462990  76.384000             1.549         5151.871  south
23            Venezuela   26084662.0  11415.805690  73.747000             5.141       137701.548  south

Now we can use this column to repeat our groupby() method.

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United States', 'Mexico']), 'north', 'south'))
    .groupby('region')[["emissions_total", "pop"]]
    .sum()
)
        emissions_total          pop
region                              
north       6656036.805  447173470.0
south        889331.721  451697714.0

We see that although Canada, the United States, and Mexico account for close to half the population of the Americas, they account for 88% of the CO2 emitted. We just did this math quickly by plugging the numbers from our table into the console to get the percentages. Can we make that a little more reproducible by calculating percentages for population and total emissions into our data before summarizing?

Map plots

The plotly library also has useful functions to draw your data on a map. There are lots of different ways to draw maps but here’s a quick example of making a choropleth map using the gapminder data. Here we will plot each country with a color indicating the life expectancy in 1997.

In order for the map function px.choropleth() to understand the countries in the gapminder data, we need to first convert the country names to standard 3-letter country codes.

NOTE: we haven’t learned how to modify the data in this way yet, but we’ll learn about that in the next lesson. Just take for granted that it works for now :)

(
    gapminder_1997
    .replace({'country' : {'United States' : 'United States of America',
                           'United Kingdom' : 'United Kingdom of Great Britain and Northern Ireland',
                          }})
    .merge(pd.read_csv("./data/country-iso.csv")
           .rename(columns={'name' : 'country'}),
           on='country', how='inner')
    .pipe(px.choropleth, 
          locations='alpha-3',
          color='lifeExp',
          hover_name='country',
          hover_data=['lifeExp', 'pop'])
)

plot of chunk mapPlots

Notice that this map helps to show that we actually have some gaps in the data. We are missing observations for countries like Russia and many countries in central Africa. Thus, it’s important to acknowledge that any patterns or trends we see in the data might not apply to those regions.

Finishing with Git and GitHub

Back to top

Awesome work! Let’s make sure it doesn’t go to waste. Time to add, commit, and push our changes to GitHub again - do you remember how?

changing directories

Print your current working directory and list the items in the directory to check where you are. If you are not in the un-report directory, navigate there.

Solution:

pwd  
ls 
cd ~/Desktop/un-report  
ls

reviewing git and GitHub

Pull to make sure our local repository is up to date. Then add, commit, and push your commits to GitHub. Don’t forget to check your git status periodically to make sure everything is going as expected!

Solution:

git status 
git pull
git status 
git add "gapminder_data_analysis.ipynb"  
git status 
git commit -m "Create data analysis file"  
git status 
git log --online  
git push
git status 

Bonus exercises

Back to top

Calculating percent

What percentage of the population and CO2 emissions in the Americas does the United States make up? What percentage of the population and CO2 emission does North America make up?

Solution

Create a new variable using assign() that calculates percentages for the pop and total variables.

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United States', 'Mexico']), 'north', 'south'), 
            emissions_total_perc=lambda df: df['emissions_total']/df['emissions_total'].sum()*100,
            pop_perc=lambda df: df['pop']/df['pop'].sum()*100)
)
                country          pop     gdpPercap    lifeExp  emissions_percap  emissions_total region  emissions_total_perc   pop_perc
0             Argentina   40301927.0  12779.379640  75.320000             3.819       149476.040  south              1.981030   4.483615
1               Bolivia    9119152.0   3822.137084  65.554000             0.984         8975.809  south              0.118958   1.014512
2                Brazil  190010647.0   9065.800825  72.390000             1.667       311623.799  south              4.130001  21.138807
3                Canada   33390141.0  36319.235010  80.653000            16.762       540431.495  north              7.162427   3.714675
4                 Chile   16284741.0  13171.638850  78.553000             3.343        54434.634  south              0.721431   1.811688
5              Colombia   44227550.0   7006.580419  72.889000             1.238        53585.300  south              0.710175   4.920344
6            Costa Rica    4133884.0   9645.061420  78.782000             1.286         5463.059  south              0.072403   0.459897
7                  Cuba   11416987.0   8948.102923  78.273000             2.220        25051.431  south              0.332011   1.270147
8    Dominican Republic    9319622.0   6025.374752  72.235000             1.897        17522.139  south              0.232224   1.036814
9               Ecuador   13755680.0   6873.262326  74.994000             1.742        23926.725  south              0.317105   1.530328
10          El Salvador    6939688.0   5728.353514  71.878000             1.037         6252.815  south              0.082870   0.772045
11            Guatemala   12572928.0   5186.050003  70.259000             0.811        10621.597  south              0.140770   1.398746
12                Haiti    8502814.0   1201.637154  60.916000             0.214         1980.992  south              0.026254   0.945944
13             Honduras    7483763.0   3548.330846  70.198000             0.976         7192.737  south              0.095327   0.832573
14              Jamaica    2780132.0   7320.880262  72.567000             3.746        10281.648  south              0.136264   0.309291
15               Mexico  108700891.0  11977.574960  76.195000             3.854       412385.135  north              5.465407  12.093044
16            Nicaragua    5675356.0   2749.320965  72.899000             0.750         4032.083  south              0.053438   0.631387
17               Panama    3242173.0   9809.185636  75.537000             2.035         6776.118  south              0.089805   0.360694
18             Paraguay    6667147.0   4172.838464  71.752000             0.599         3472.665  south              0.046024   0.741724
19                 Peru   28674757.0   7408.905561  71.421000             1.037        28632.888  south              0.379476   3.190085
20  Trinidad and Tobago    1056608.0  18008.509240  69.819000            13.243        17175.823  south              0.227634   0.117548
21        United States  305082438.0  42646.380702  78.248513            19.268      5703220.175  north             75.585707  33.940618
22              Uruguay    3447496.0  10611.462990  76.384000             1.549         5151.871  south              0.068279   0.383536
23            Venezuela   26084662.0  11415.805690  73.747000             5.141       137701.548  south              1.824981   2.901936

This table shows that the United states makes up 33% of the population of the Americas, but accounts for 76% of total emissions. Now let’s take a look at population and emission for the two different continents:

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United > > States', 'Mexico']), 'north', 'south'), 
            emissions_total_perc=lambda df: df['emissions_total']/df> > ['emissions_total'].sum()*100,
            pop_perc=lambda df: df['pop']/df['pop'].sum()*100)
    .groupby('region')
    .agg({'emissions_total_perc' : 'sum', 
          'pop_perc' : 'sum'})
)
        emissions_total_perc   pop_perc
region                                 
north              88.213542  49.748337
south              11.786458  50.251663

CO2 bar plot

Create a bar plot of the percent of emissions for each country, colored by north and south America.

Solution

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United > > States', 'Mexico']), 'north', 'south'), 
            emissions_total_perc=lambda df: df['emissions_total']/df> > ['emissions_total'].sum()*100,
            pop_perc=lambda df: df['pop']/df['pop'].sum()*100)
    .pipe(so.Plot, x='country', y='emissions_total_perc', color='region')
    .add(so.Bar())
)

plot of chunk unnamed-chunk-3

Now switch the x and y axis to make the country names more readable.

Solution

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United > > States', 'Mexico']), 'north', 'south'), 
            emissions_total_perc=lambda df: df['emissions_total']/df> > ['emissions_total'].sum()*100,
            pop_perc=lambda df: df['pop']/df['pop'].sum()*100)
    .pipe(so.Plot, x='emissions_total_perc', y='country', color='region')
    .add(so.Bar())
)

plot of chunk unnamed-chunk-4

Reorder the bars in descending order. Hint: what method we used earlier to sort data values?

Solution

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United > > States', 'Mexico']), 'north', 'south'), 
            emissions_total_perc=lambda df: df['emissions_total']/df> > ['emissions_total'].sum()*100,
            pop_perc=lambda df: df['pop']/df['pop'].sum()*100)
    .sort_values('emissions_total_perc', ascending=False)
    .pipe(so.Plot, x='emissions_total_perc', y='country', color='region')
    .add(so.Bar())
)

plot of chunk unnamed-chunk-5

Practice making it look pretty!

low emissions

Find the 3 countries with lowest per capita emissions.

Solution

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United States', 'Mexico']), 'north', 'south'), 
            emissions_total_perc=lambda df: df['emissions_total']/df['emissions_total'].sum()*100,
            pop_perc=lambda df: df['pop']/df['pop'].sum()*100,)
    .sort_values('emissions_percap', ascending=True)[['country', 'emissions_percap']]
    .head(3)
)
      country  emissions_percap
12      Haiti             0.214
18   Paraguay             0.599
16  Nicaragua             0.750

Create a bar chart for the per capita emissions for just those three countries.

Solution

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United States', 'Mexico']), 'north', 'south'), 
            emissions_total_perc=lambda df: df['emissions_total']/df['emissions_total'].sum()*100,
            pop_perc=lambda df: df['pop']/df['pop'].sum()*100,)
    .query("country in ['Haiti', 'Paraguay', 'Nicaragua']")
    .pipe(so.Plot, x='country', y='emissions_percap')
    .add(so.Bar())
)

plot of chunk unnamed-chunk-7

Reorder them in descending order.

Solution

(
    gapminder_co2
    .assign(region=lambda df: np.where(df['country'].isin(['Canada', 'United States', 'Mexico']), 'north', 'south'), 
            emissions_total_perc=lambda df: df['emissions_total']/df['emissions_total'].sum()*100,
            pop_perc=lambda df: df['pop']/df['pop'].sum()*100)
    .query("country in ['Haiti', 'Paraguay', 'Nicaragua']")
    .sort_values('emissions_percap', ascending=False)
    .pipe(so.Plot, x='country', y='emissions_percap')
    .add(so.Bar())
)

plot of chunk unnamed-chunk-8

Key Points

  • Library importing is an important first step in preparing a Python environment.

  • Data analysis in Python facilitates reproducible research.

  • There are many useful methods in the pandas library that can aid in data analysis.

  • Assessing data source and structure is an important first step in analysis.

  • Preparing data for analysis can take significant effort and planning.


Jupyter Notebook and Markdown

Overview

Teaching: 45 min
Exercises: 30 min
Questions
  • How can I make reproducible reports using Jupyter notebook?

  • How do I format the notebook using Markdown?

Objectives
  • To create a Jupyter Notebook that combines text, code, and figures.

  • To use Markdown to format our notebook.

  • To be aware of the various report formats that can be rendered from Jupyter Notebook.

  • To practice using the Unix Shell, GitHub, and Python through paired programming exercises.

Contents

  1. Why use Jupyter Notebook?
  2. Creating a notebook directory
  3. Basic components of a Jupyter Notebook
  4. Exporting Jupyter notebook
  5. Beyond JupyterLab and Jupyter notebook
  6. Integrating it all together: Paired exercise

Recall that our goal is to generate a report to the United Nations on how a country’s life expectancy is related to GDP.

Discussion

How do you usually share data analyses with your collaborators? Many people share them through a Word or PDF document, a spreadsheet, slides, a graphic, etc.

Why use Jupyter Notebook?

Back to top

In Jupyter Notebook, you can incorporate ordinary text (ex. experimental methods, analysis and discussion of results) alongside code and figures! This is useful for writing reproducible reports and publications and sharing work with collaborators. Because the code is embedded in the notebook, the tables and figures are reproducible. Anyone can run the code and get the same results. If you find an error or want to add more to the report, you can just re-run the document and you’ll have updated tables and figures! This concept of combining text and code is called literate programming. To do this we use Jupyter Notebook, which combines Markdown (renders plain text) with Python. A Jupyter Notebook can be exported as an HTML, PDF, or other document formats that we can share with others.

Creating a notebook directory

Back to top

To get started, let’s use the Unix Shell to create a directory within un-report called notebooks where we will write our notebooks to the UN. First, open the Unix Shell and cd to un-report:

pwd
mkdir notebooks
/home/USERNAME/Desktop/un-report/notebooks/

Basic components of a Jupyter notebook

Back to top

Creating a Jupyter Notebook

Now that we have a better understanding of what we can use jupyter notebooks for, let’s start writing a notebook!

We can create a Jupyter notebook in the same way that we did in the previous lessons. To create a Jupyter notebook file:

A Jupyter notebook is composed of cells. So far we have only used “code” cells, which is the default cell type for us to write and execute codes. In addition to code cells, Jupyter notebook also supports “Markdown” cells for us to include texts, images, tables, and other things as part of a notebook.

Introduction to Markdown

Markdown is a simple way for creating formatted text.

Let’s convert the first cell in the notebook from a code cell to a markdown cell using the dropdown menu located at the tool bar near the top of the notebook.

We can create headers and subheaders using one or more pound signs # followed by a whitespace. For example we can add the following headers. We can run the Markdown cell the same way as running a code cell to see a rendered (formatted) version of the Markdown text we just typed in.

# UM Carpentries Workshop - Python
## Day 2: Jupyter Notebook and Markdown
### 2023-12-12

OK, now that we know how to make headers, let’s practice some more Markdown syntax.

In JupyterLab click the Help menu, then click Markdown Reference, read through the Markdown syntax. Go through the “10 minute Markdown tutorial”.

Exporting Jupyter notebook

Back to top

We can export a notebook to other formats by clicking the File menu, then Save and Export Notebook As….

We can save a notebook as an HTML file if we want to publish it on a web. We can even export it as presentation slides. When doing so, we can first click the right sidebar (the button icon with two gears). Then assign a “Slide Type” for each cell, which allow us to control whether and how a cell should be included in the slides.

Beyond JupyterLab and Jupyter notebook

Back to top

Jupyter Notebook is a great and popular tool for learning programming, exploratory data analysis, among other things. However, there are some drawbacks of Jupyter Notebooks that may or may not be important to you.

Google Colab

Google Colaboratory, or Colab for short, is an online Jupyter Notebook service that requires no setup to use on your own computer, and it allows co-editing by multiple people on the same notebook. Think of it as a Google Docs but for Jupyter notebooks. Note it does require you to sign in to your Google account to use it.

Python’s .py file

Currently it is not easy to do version control with Jupyter Notebooks, as a notebook raw file is in JSON with many details that is not easy for humans to read (e.g., when checking the differences of two versions.)

The good news that is we can also directly write and run Python files which have a file extension .py. The Python file is a plain text file that is straight forward to do version control.

For example, we can create a Python file from the Launcher tab by clicking the button Python File. Then we can copy and paste our code to the Python file. We can run a Python file by going to our terminals (also available inside JupyterLab from the Launcher tab), and then type python followed by the name of the Python file that we wish to run, for example, python abc.py.

ProTip: In JupyterLab we can drag the terminal to the bottom of the main work area, so that we can see both our python file and the terminal.

plot of chunk unnamed-chunk-13

Integrating it all together: Paired exercise

Back to top

You’ve learned so much in the past two days - how to use the Unix Shell to move around your computer, how to use git for version control and GitHub for collaborating with others on code, how to make pretty plots and do data analysis in Python, and how to incorporate it all into a Jupyter Notebook. Now, you’re going to work in pairs to practice everything you learned. Ideally, you’ll have the same pair as for the git/GitHub lesson. Don’t worry - if you have questions, the instructor and helpers are here to help you out!

Only one of the people in your pair is going to create the Jupyter Notebook file. The other person is going to collaborate with that person using GitHub. So the first step is to choose one person in your pair to create/host the Jupyter Notebook file.

For the person who is going to host the new Jupyter Notebook file:

  1. Make a new Jupyter Notebook file in the notebooks directory
  2. Give it an informative title.

For the person who is going to collaborate with the host of the Jupyter Notebook file:

If you don’t already have your partner’s GitHub repo cloned from the git/GitHub lesson, clone their repo to your Desktop under the name USERNAME-un-report. If you don’t remember how to do this, you can review the git lesson.

The way you will collaborate with each other is as follows:

  1. For each exercise, both people will be thinking about how to answer the question, but only one person will be writing the code. This is called paired programming.
  2. Once you have completed 3 exercises, the person working on the exercises will add, commit, and push the changes to GitHub.
  3. Then the other person will pull the changes from GitHub.
  4. The person who pulled changes will code for the next exercise.
  5. Repeat the process for as many exercises as you can finish in the remaining time.

Don’t worry if you don’t finish all of the exercises, and it’s not a race between groups! This is just a way for you to practice what you’ve learned. Also, you can switch off more or less frequently depending on how much you want to practice pushing and pulling to/from GitHub.

One note: It may be helpful to copy and paste the questions into the Jupyter Notebook file as you go.

Exercises using the gapminder data

Back to top

First we’re going to start out with a few questions about the gapminder dataset.

[1] The very first step is to read in the gapminder dataset, so do that first!

Solution

import numpy as np
import pandas as pd

gapminder = pd.read_csv("./data/gapminder_data.csv")
print(gapminder.head())
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3  Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4  Afghanistan  1972  13079460.0      Asia   36.088  739.981106

Investigating population over time.

Back to top

[2] Make a scatter plot of year vs. population, separated into a plot for each continent. Hint: you can apply the facet() method to the plot to separate it into multiple plots.

Solution

import seaborn.objects as so

(
    so.Plot(gapminder, x='year', y='pop')
    .add(so.Dot())
    .facet('continent', wrap=3)
)

plot of chunk unnamed-chunk-13

[3] It seems like there are 2 outliers - which countries are those?

Solution

(
    gapminder
    .query("pop > 1e9")
    ['country']
    .unique()
)
array(['China', 'India'], dtype=object)

[4] Plot year vs. population separated into a plot for each continent but excluding the 2 outlier countries.

Solution

(
    gapminder
    .query("country not in ['China', 'India']")
    .pipe(so.Plot, x='year', y='pop')
    .add(so.Dot())
    .facet('continent', wrap=3)
)

plot of chunk unnamed-chunk-15

Bonus questions: come back to these if you have time at the end

Back to top

[5] It’s hard to see which country is which here. Can you change the scatter plot to a line plot so we can get a better sense of trends over time? Hint: This website has more information: https://www.r-graph-gallery.com/line-chart-several-groups-ggplot2.html

Solution

(
    gapminder
    .query("country not in ['China', 'India']")
    .pipe(so.Plot, x='year', y='pop', group='country')
    .add(so.Line())
    .facet('continent', wrap=3)
    .save("../fig/python-markdown/06-unnamed-3.png", bbox_inches='tight', dpi=300)
)

plot of chunk unnamed-chunk-17

Looking into life expectancy a bit more.

Back to top

[6] What country had the highest life expectancy in 1982? Hint: You can apply the max() method to a column when setting up your query.

Solution

(
    gapminder
    .query("year == 1982")
    .query("lifeExp == lifeExp.max()")
)
    country  year          pop continent  lifeExp    gdpPercap
798   Japan  1982  118454974.0      Asia    77.11  19384.10571

[7] Now, do the same thing but for all years! Hint: You can use the groupby() method and then apply a custom function using the apply() method. You can apply the idxmax() method to a column to find the index that has the maximum value.

Solution

(
    gapminder
    .groupby('year')
    .apply(lambda x: x.loc[x['lifeExp'].idxmax()])
)
      country  year          pop continent  lifeExp     gdpPercap
year                                                             
1952   Norway  1952    3327728.0    Europe   72.670  10095.421720
1957  Iceland  1957     165110.0    Europe   73.470   9244.001412
1962  Iceland  1962     182053.0    Europe   73.680  10350.159060
1967   Sweden  1967    7867931.0    Europe   74.160  15258.296970
1972   Sweden  1972    8122293.0    Europe   74.720  17832.024640
1977  Iceland  1977     221823.0    Europe   76.110  19654.962470
1982    Japan  1982  118454974.0      Asia   77.110  19384.105710
1987    Japan  1987  122091325.0      Asia   78.670  22375.941890
1992    Japan  1992  124329269.0      Asia   79.360  26824.895110
1997    Japan  1997  125956499.0      Asia   80.690  28816.584990
2002    Japan  2002  127065841.0      Asia   82.000  28604.591900
2007    Japan  2007  127467972.0      Asia   82.603  31656.068060

[8] Make a jitter plot for the life expectancies of the countries in Asia for each year (year is the x axis, life expectancy is the y axis). Also fix the x and y axis labels.

Solution

(
    gapminder
    .query("continent == 'Asia'")
    .pipe(so.Plot, x='year', y='lifeExp')
    .add(so.Dot(alpha=.7), so.Jitter(.5))
)

plot of chunk unnamed-chunk-20

Bonus questions: come back to these if you have time at the end

Back to top

[9] What are the outliers in life expectancy in Asia for each year (lower life expectancy)?

Solution

(
    gapminder
    .query("continent == 'Asia'")
    .groupby('year')
    .apply(lambda x: x.loc[x['lifeExp'].idxmin()])
)
          country  year         pop continent  lifeExp   gdpPercap
year                                                              
1952  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1957  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
1962  Afghanistan  1962  10267083.0      Asia   31.997  853.100710
1967  Afghanistan  1967  11537966.0      Asia   34.020  836.197138
1972  Afghanistan  1972  13079460.0      Asia   36.088  739.981106
1977     Cambodia  1977   6978607.0      Asia   31.220  524.972183
1982  Afghanistan  1982  12881816.0      Asia   39.854  978.011439
1987  Afghanistan  1987  13867957.0      Asia   40.822  852.395945
1992  Afghanistan  1992  16317921.0      Asia   41.674  649.341395
1997  Afghanistan  1997  22227415.0      Asia   41.763  635.341351
2002  Afghanistan  2002  25268405.0      Asia   42.129  726.734055
2007  Afghanistan  2007  31889923.0      Asia   43.828  974.580338

[10] Make a plot that shows the range (i.e., mean plus/minus standard deviation) for the life expectancies of the countries over time for each continent. Try to fix the x and y axis labels and text, too. Feel free to change the theme if you’d like.

Solution

(
    gapminder
    .pipe(so.Plot, x='year', y='lifeExp')
    .add(so.Range(), so.Est(func='mean', errorbar='sd'))
    .add(so.Dot(), so.Agg())
    .facet('continent', wrap=3)
)

plot of chunk unnamed-chunk-22

[11] Which country has had the greatest increase in life expectancy from 1952 to 2007? Hint: You might want to use the pivot() method to get your data in a format with columns for: country, 1952 life expectancy, 2007 life expectancy, and the difference between 2007 and 1992 life expectancy.

Solution

(
    gapminder
    .query("year in [1952, 2007]")
    .pivot(index='country', columns='year', values='lifeExp')
    .assign(diff=lambda x: x[2007] - x[1952])
    .query("diff == diff.max()")
)
year       1952   2007    diff
country                       
Oman     37.578  75.64  38.062

[12] What countries had a decrease in life expectancy from 1952 to 2007?

Solution

(
    gapminder
    .query("year in [1952, 2007]")
    .pivot(index='country', columns='year', values='lifeExp')
    .assign(diff=lambda x: x[2007] - x[1952])
    .query("diff < 0")
)
year         1952    2007   diff
country                         
Swaziland  41.407  39.613 -1.794
Zimbabwe   48.451  43.487 -4.964

Exercises integrating a new dataset

Back to top

If you finished the questions involving the gapminder dataset (bonus questions are optional), move on to these questions next. Note that we don’t expect you to finish all of these! You can also use them as practice after the workshop if you’d like.

Now that you’ve practiced what you’ve learned with the gapminder data, you’re going to try using what we’ve learned to explore a new dataset.

Preview of the data

Back to top

This dataset has information on the gross domestic expenditure on research and development (R&D) for different countries. We’re going to use it to practice the data analysis workflow that you learned over the course of the workshop.

Data: Gross domestic expenditure on research and development (R & D)

Data source: UN data, under “Science and technology”

Data path: data/rnd-un-data.csv

Raw CSV file:

T27,Gross domestic expenditure on research and development (R&D),,,,,
Region/Country/Area,,Year,Series,Value,Footnotes,Source
8,Albania,2008,Gross domestic expenditure on R & D: as a percentage of GDP (%),0.1541,Partial data.,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed June 2020."
8,Albania,2008,Gross domestic expenditure on R & D: Business enterprises (%),3.2603,Partial data.,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed June 2020."
...

Reading in and cleaning the data

Back to top

[1] First, read in the data. Note that you need to skip the first line of the file because that’s just a title for the whole dataset (see above). Also rename the columns to something more informative (as you learned, there are lots of ways to do this, and different preferences - feel free to use whichever method you want!).

Solution

(
    pd.read_csv("./data/rnd-un-data.csv", skiprows=1)
    .rename(columns={'Unnamed: 1' : 'country'})
    .rename(columns=str.lower)
)
      region/country/area  country  year                                             series    value                  footnotes  \
0                       8  Albania  2008  Gross domestic expenditure on R & D: as a perc...   0.1541              Partial data.   
1                       8  Albania  2008  Gross domestic expenditure on R & D: Business ...   3.2603              Partial data.   
2                       8  Albania  2008  Gross domestic expenditure on R & D: Governmen...  80.8046              Partial data.   
3                       8  Albania  2008  Gross domestic expenditure on R & D: Higher ed...   8.5680              Partial data.   
4                       8  Albania  2008  Gross domestic expenditure on R & D: Funds fro...   7.3672              Partial data.   
...                   ...      ...   ...                                                ...      ...                        ...   
2415                  894   Zambia  2008  Gross domestic expenditure on R & D: as a perc...   0.2782  Break in the time series.   
2416                  894   Zambia  2008  Gross domestic expenditure on R & D: Business ...   3.2277  Break in the time series.   
2417                  894   Zambia  2008  Gross domestic expenditure on R & D: Governmen...  94.8311  Break in the time series.   
2418                  894   Zambia  2008  Gross domestic expenditure on R & D: Private n...   0.3226  Break in the time series.   
2419                  894   Zambia  2008  Gross domestic expenditure on R & D: Funds fro...   1.6187  Break in the time series.   

                                                 source  
0     United Nations Educational, Scientific and Cul...  
1     United Nations Educational, Scientific and Cul...  
2     United Nations Educational, Scientific and Cul...  
3     United Nations Educational, Scientific and Cul...  
4     United Nations Educational, Scientific and Cul...  
...                                                 ...  
2415  United Nations Educational, Scientific and Cul...  
2416  United Nations Educational, Scientific and Cul...  
2417  United Nations Educational, Scientific and Cul...  
2418  United Nations Educational, Scientific and Cul...  
2419  United Nations Educational, Scientific and Cul...  

[2420 rows x 7 columns]

[2] Next, take a look at the “series” column (or whatever you renamed it to), and make the titles shorter and with no spaces to make them easier to work with.

Solution

First let’s take a look at what unique values this column contains.

(
    pd.read_csv("./data/rnd-un-data.csv", skiprows=1)
    .rename(columns={'Unnamed: 1' : 'country'})
    .rename(columns=str.lower)
    ['series'].unique()
)
['Gross domestic expenditure on R & D: as a percentage of GDP (%)'
 'Gross domestic expenditure on R & D: Business enterprises (%)'
 'Gross domestic expenditure on R & D: Government (%)'
 'Gross domestic expenditure on R & D: Higher education (%)'
 'Gross domestic expenditure on R & D: Funds from abroad (%)'
 'Gross domestic expenditure on R & D: Not distributed (%)'
 'Gross domestic expenditure on R & D: Private non-profit (%)']

Now let’s replace them with shorter values, and assign the result to a data frame called rnd.

rnd = (
    pd.read_csv("./data/rnd-un-data.csv", skiprows=1)
    .rename(columns={'Unnamed: 1' : 'country'})
    .rename(columns=str.lower)
    .replace({'series' : {'Gross domestic expenditure on R & D: as a percentage of GDP (%)' : 'gdp_pct',
                          'Gross domestic expenditure on R & D: Business enterprises (%)' : 'business',
                          'Gross domestic expenditure on R & D: Government (%)' : 'government',
                          'Gross domestic expenditure on R & D: Higher education (%)' : 'higher_ed',
                          'Gross domestic expenditure on R & D: Funds from abroad (%)' : 'abroad',
                          'Gross domestic expenditure on R & D: Not distributed (%)' : 'not_distributed',
                          'Gross domestic expenditure on R & D: Private non-profit (%)' : 'non_profit',}})
)
print(rnd)
      region/country/area  country  year      series    value                  footnotes                                             source
0                       8  Albania  2008     gdp_pct   0.1541              Partial data.  United Nations Educational, Scientific and Cul...
1                       8  Albania  2008    business   3.2603              Partial data.  United Nations Educational, Scientific and Cul...
2                       8  Albania  2008  government  80.8046              Partial data.  United Nations Educational, Scientific and Cul...
3                       8  Albania  2008   higher_ed   8.5680              Partial data.  United Nations Educational, Scientific and Cul...
4                       8  Albania  2008      abroad   7.3672              Partial data.  United Nations Educational, Scientific and Cul...
...                   ...      ...   ...         ...      ...                        ...                                                ...
2415                  894   Zambia  2008     gdp_pct   0.2782  Break in the time series.  United Nations Educational, Scientific and Cul...
2416                  894   Zambia  2008    business   3.2277  Break in the time series.  United Nations Educational, Scientific and Cul...
2417                  894   Zambia  2008  government  94.8311  Break in the time series.  United Nations Educational, Scientific and Cul...
2418                  894   Zambia  2008  non_profit   0.3226  Break in the time series.  United Nations Educational, Scientific and Cul...
2419                  894   Zambia  2008      abroad   1.6187  Break in the time series.  United Nations Educational, Scientific and Cul...

[2420 rows x 7 columns]

[3] Next, make a column for each of the data types in the “series” column (or whatever you renamed it to). This should give you the following columns: country name, year, expenditure in general, % of funds from business, % of funds from government, % of funds from higher ed, % of funds from non-profit, % of funds from abroad, % of funds from non-specified sources.

Solution

(
    rnd
    .pivot(columns='series', values='value', index=['country', 'year'])
    .reset_index()
)
series         country  year  abroad  business  gdp_pct  government  higher_ed  non_profit  not_distributed
0              Albania  2008  7.3672    3.2603   0.1541     80.8046      8.568         NaN              NaN
1              Algeria  2005     NaN       NaN   0.0660         NaN        NaN         NaN              NaN
2              Algeria  2017  0.0246    6.7441   0.5424     93.1311        NaN         NaN           0.0312
3       American Samoa  2005     NaN       NaN   0.3647         NaN        NaN         NaN              NaN
4       American Samoa  2006     NaN       NaN   0.3931         NaN        NaN         NaN              NaN
..                 ...   ...     ...       ...      ...         ...        ...         ...              ...
543           Viet Nam  2002  6.3300   18.0600   0.1927     74.1100        NaN         NaN           0.8400
544           Viet Nam  2015  2.8893   58.0950   0.4411     33.0259        NaN         NaN           5.0416
545           Viet Nam  2017  4.4946   64.1201   0.5267     26.9304        NaN         NaN           3.0523
546             Zambia  2005     NaN       NaN   0.0249         NaN        NaN         NaN              NaN
547             Zambia  2008  1.6187    3.2277   0.2782     94.8311        NaN      0.3226              NaN

[548 rows x 9 columns]

Note that there is a lot of missing data.

Now we have our data set up in a way that makes it easier to work with. Feel free to clean up the data more before moving on to the next step if you’d like.

Plotting with the R & D dataset

Back to top

[4] Plot the distribution of percent expenditure using a histogram.

Solution

import seaborn.objects as so

(
    rnd
    .pivot(columns='series', values='value', index=['country', 'year'])
    .reset_index()
    .pipe(so.Plot, x='gdp_pct')
    .add(so.Bars(), so.Hist(bins=30))
)

plot of chunk unnamed-chunk-28

[5] Plot the R&D expenditure by year (discrete x vs continuous y) using a scatter plot. Feel free to try to make the plot more legible if you want.

Solution

(
    rnd
    .pivot(columns='series', values='value', index=['country', 'year'])
    .reset_index()
    .pipe(so.Plot, x='year', y='gdp_pct')
    .add(so.Dot(alpha=.5))
)

plot of chunk unnamed-chunk-29

[6] Plot the R&D expenditure by year (discrete x vs continuous y) using a jitter plot.

Solution

(
    rnd
    .pivot(columns='series', values='value', index=['country', 'year'])
    .reset_index()
    .pipe(so.Plot, x='year', y='gdp_pct')
    .add(so.Dot(alpha=.5), so.Jitter(.5))
)

plot of chunk unnamed-chunk-30

Combining the CO2 and R&D datasets

Back to top

Now we’re going to work with the CO2 and R&D datasets together.

Unfortunately, we don’t have the exact same dates for all of them.

[7] First, read in the CO2 dataset. You can use the code from the Python for data analysis lesson to clean the CO2 data.

Solution

# read in and clean CO2 data

co2 = (
    pd.read_csv("./data/co2-un-data.csv", skiprows=2,
                names=['region', 'country', 'year', 'series', 'value', 'footnotes', 'source'])
    .filter(['country', 'year', 'series', 'value'])
    .replace({'series': {"Emissions (thousand metric tons of carbon dioxide)":"emissions_total",
                         "Emissions per capita (metric tons of carbon dioxide)":"emissions_percap"}, 
             })
    .pivot(index=['country', 'year'], columns='series', values='value')
    .reset_index()
)

print(co2)
series   country  year  emissions_percap  emissions_total
0        Albania  1975             1.804         4338.334
1        Albania  1985             2.337         6929.926
2        Albania  1995             0.580         1848.549
3        Albania  2005             1.270         3825.184
4        Albania  2010             1.349         3930.295
...          ...   ...               ...              ...
1061    Zimbabwe  2005             0.794        10272.774
1062    Zimbabwe  2010             0.672         9464.714
1063    Zimbabwe  2015             0.749        11822.362
1064    Zimbabwe  2016             0.642        10368.900
1065    Zimbabwe  2017             0.588         9714.938

[1066 rows x 4 columns]

[8] Merge the CO2 dataset and the R&D dataset together. Keep only the following columns: country, year, total CO2 emissions, CO2 emissions per capita, and percent of GDP used for R&D.

Solution

(
    co2
    .merge(rnd, how='outer', on=['country', 'year'])
    .filter(['country', 'year', 'emissions_total', 'emissions_percap', 'gdp_pct'])
)
                         country  year  emissions_total  emissions_percap  gdp_pct
0                        Albania  1975         4338.334             1.804      NaN
1                        Albania  1985         6929.926             2.337      NaN
2                        Albania  1995         1848.549             0.580      NaN
3                        Albania  2005         3825.184             1.270      NaN
4                        Albania  2010         3930.295             1.349      NaN
...                          ...   ...              ...               ...      ...
1276                     Uruguay  2011              NaN               NaN   0.3487
1277                  Uzbekistan  2018              NaN               NaN   0.1298
1278  Venezuela (Boliv. Rep. of)  2014              NaN               NaN   0.3371
1279                    Viet Nam  2002              NaN               NaN   0.1927
1280                      Zambia  2008              NaN               NaN   0.2782

[1281 rows x 5 columns]

[9] BONUS: After merging the data sets, there is some missing data. How many NANs are present in each data column for the R&D data set?

Solution

(
    co2
    .merge(rnd, how='outer', on=['country', 'year'])
    .filter(['country', 'year', 'emissions_total', 'emissions_percap', 'gdp_pct'])
    .isnull().sum()
)
country               0
year                  0
emissions_total     215
emissions_percap    215
gdp_pct             737
dtype: int64

[10] You might have noticed that we don’t have both CO2 data and R&D data for all years. Drop the rows in the merged dataset for which the CO2 or R&D values are missing. Save the result to a data frame called co2rnd. HINT: Search the internet for the use of the pandas method dropna() to help you here.

Solution

co2_rnd = (
    co2
    .merge(rnd, how='outer', on=['country', 'year'])
    .filter(['country', 'year', 'emissions_total', 'emissions_percap', 'gdp_pct'])
    .dropna()
)

print(co2_rnd)
                         country  year  emissions_total  emissions_percap  gdp_pct
11                       Algeria  2005        77474.130             2.327   0.0660
15                       Algeria  2017       130493.653             3.158   0.5424
22                        Angola  2016        21458.342             0.745   0.0323
27                     Argentina  2005       149476.040             3.819   0.4207
28                     Argentina  2010       173768.538             4.215   0.5610
...                          ...   ...              ...               ...      ...
1029  Venezuela (Boliv. Rep. of)  2005       137701.548             5.141   0.1891
1030  Venezuela (Boliv. Rep. of)  2010       171468.892             5.907   0.1882
1039                    Viet Nam  2015       182588.799             1.951   0.4411
1041                    Viet Nam  2017       191243.601             2.002   0.5267
1053                      Zambia  2005         2120.692             0.176   0.0249

[331 rows x 5 columns]

[11] How many countries by year do you have after dropping the rows with missing values? HINT: You can use the groupby() method to help you out.

Solution

(
    co2_rnd
    .groupby('year')
    .agg({'country' : 'count'})
)
      country
year         
2005       83
2010       86
2015       94
2016       11
2017       57

Plotting with the CO2 and R&D datasets together

Back to top

[12] Plot R&D expenditure vs. CO2 emission per capita for each country using a scatter plot.

Solution

(
    so.Plot(co2_rnd, x='gdp_pct', y='emissions_percap')
    .add(so.Dots())
)

plot of chunk unnamed-chunk-36

[13] Next, facet the above plot by year.

Solution

(
    so.Plot(co2_rnd, x='gdp_pct', y='emissions_percap')
    .add(so.Dots())
    .facet('year', wrap=3)
)

plot of chunk unnamed-chunk-37

[14] Identify the countries that have five years of records for both C02 emissions and R&D.

Solution

print(
    co2_rnd
    .groupby('country')
    .agg({'year' : 'count'})
    .query('year == 5')
)
            year
country         
Azerbaijan     5
Cuba           5
Panama         5

BONUS

[14] For the countries you identified, plot the Percent of GDP spent on R&D and the per-capita CO2 emissions over time on the same plot. Color the two different values differently.

Solution

(
    co2_rnd
    .query("country in ['Azerbaijan','Cuba','Panama']")
    .pipe(so.Plot, x='year')
    .add(so.Line(color='red', marker='o'), y='emissions_percap', label='CO2 per capita')
    .add(so.Line(marker='o'), y='gdp_pct', label='GDP %')
    .facet('country')
    .label(x="", y="Value")
)

plot of chunk unnamed-chunk-39

Bonus questions

Back to top

[15] For the R&D dataset, each country can have data for one or multiple years. What is the range of the numbers of yearly records for each country?

Solution

(
    rnd
    .groupby('country')
    .agg(year_count=('year', 'count'))
    .agg(['min', 'max'])
)
     year_count
min           1
max           8

[16] Continue with the previous question, how many countries are there for each value within the range? (e.g., 10 countries have two different years and 20 have five different years)

Solution

(
    rnd
    .groupby('country')
    .agg(year_count=('year', 'count'))
    .groupby('year_count')
    .agg(country_count=('year_count', 'count'))
)
            country_count
year_count               
1                      22
2                      16
3                      19
4                      39
5                      37
6                      10
7                       4
8                       1

[17] Create a Jupyter Notebook with some of the information from these exercises. Decide exactly what you want to focus your notebook on, and then also perform additional analyses to include in your notebook. Also make sure your plots are legible and understandable!

Solution

Use the info from the Jupyter Notebook lesson to create a pretty notebook.

Key Points

  • Jupyter Notebook is an easy way to create a report that integrates text, code, and figures.

  • A Jupyter Notebook can be exported to HTML, PDF, and other formats.


Conclusion

Overview

Teaching: 15 min
Exercises: min
Questions
  • What do I do after the workshop to apply what I learned and keep learning more?

  • Where can I learn more coding skills?

  • How do I deal with coding errors (i.e. debug)?

  • What resources are there at the University of Michigan?

  • What other coding concepts should I learn?

Objectives
  • Learn how to get help with code via the Internet and at the University of Michigan.

  • Learn about other coding concepts that would be good to learn in the future.

Where to go from here?: Departing on your own coding journey

Learning and debugging throughout the data programming process.

We have come to the end of this workshop. You learned some basic procedures for importing, managing, visualizing and reporting your data.

As you continue on your coding journey, two things will happen:

  1. You will encounter bugs and need to figure out how to solve them (“debugging”), and
  2. You will want to learn new data processing and analysis techniques.

As we complete the course, we want to share with you some tips and tricks that have helped us on our own programming journeys.

Writing code at the University of Michigan

There are many local opportunities at the University of Michigan or around the Ann Arbor campus to find coding support, learn new programming skills, and connect with other users.

Get help and connect

Dealing with coding errors

Even well seasoned coders run into bugs all the time. Here are some strategies of how programmers try to deal with coding errors:

Debugging code

If searching for your particular code problem hasn’t turned up a solution, you may have to do a bit of debugging. Debugging is the process of finding exactly what caused your error, and changing only what is necessary to fix it. There are many strategies to debugging code. Consider checking out the following resources to learn more about it.

Asking strangers for help

If you are unable to determine what’s wrong with your own code, the internet offers several possible ways to get help: asking questions on programming websites, interacting with developers on GitHub, chatting with other programmers on Slack, or reaching out on Twitter. If you’re intimidated by asking people on the internet, you can also reach out to people at the University of Michigan. You don’t have to do this all on your own. However, there are some important things to keep in mind when asking questions - whether it be to people on the internet, or to people at the university. You may want to consider these tips to help you increase your chances of getting the support you need:

Learning new code

Free open-source programming languages such as Bash, Git and Python are constantly evolving. As you try out new data processing and analysis techniques, you will continue to learn new coding logic, concepts, functions, and libraries. Widely available user tools and documentation are a main benefit of free open-source software.

In the following, we list some strategies and resources we find useful. As you move forward, you are likely to identify other resources that fit your own learning style.

General

Cheat Sheets

A good collection of cheat sheets to print out and hang at your desk.

Free learning platforms available at U-M

Some important advanced coding concepts that you will want to learn if you continue coding a lot

There are some coding concepts that we did not have time to cover in this workshop, but are important to learn as you continue on your journey and begin to perform more sophisticated data analysis projects. While we have not created resources for these topics, we provide some links to where you can learn more. Note that these are more advanced coding topics; you should be come comfortable with what you learned in the workshop before trying to delve deeper into these other concepts. However, you’ll likely come across situations where one of these will be useful, and that’s when you should learn it!

We’ve provided some links below, but feel free to search for other explanations and tutorials as well.

Python coding topics

Some more advanced Python coding topics include:

Domain-specific analyses

We encourage you to investigate domain-specific libraries and software that will help you perform specific tasks related to your own research. The best way to find these libraries is to either ask other people in your field and/or search for specific tasks that you would like to perform. If you’d like to perform the task in Python, include that in your search (e.g. “find pairwise distances for DNA sequences in Python” will help you find the Python library biopython which has a number of tools for computational molecular biology in Python.)

High-performance computing clusters

If you’re performing computationally-intensive analyses, you’ll likely want to use a high-performance computing cluster. At the University of Michigan, many of us work on Great Lakes for much of our research. It can be a bit overwhelming at first, so try to find someone to help you learn the ropes. Sometimes there are also workshops where you can learn more.

Git/GitHub

If you start using Git/GitHub more frequently, it’s useful to learn how to create branches to work on new parts of your analysis. When you’re confident that it works, you can then merge the contents of the branch back into your “main” branch.

Key Points

  • When it comes to trying to figure out how to code something, and debugging, Internet searching is your best friend.

  • There are several resources at the University of Michigan that you can take advantage of if you need help with your code.

  • We didn’t have time to cover all important coding concepts in this workshop, so definitely continue trying to learn more once you get comfortable with the material we covered.

  • There are often packages and tools that you can leverage to perform domain-specific analyses, so search for them!