Introduction

I’ve gotten a couple of questions about how to get started learning some of the more high-powered tools like R and python and how to know when to use these tools over something simple like Google Sheets. In most cases, I would recommend using Google Sheets when you can, but sometimes (as in the example below) it would be incredibly tedious to try to do the analysis by hand.

In this document, I work through an example for which producing the indicator by hand or in Google Sheets would be very difficult. I collect and clean the data for the Carbon Capture Indicator in the Land Category of the Sustainability Pillar.
This example is great to learn from because it covers most of the issues you might run into and uses a lot of the main tools you’d need to know to be able to manipulate and clean data to produce a useable indicator in R. I’ve written this tutorial to be accessible to anyone on the SSPI team, but if anything is unclear, please reach out to me so I can fix it.

I’ll probably also upload a regressions tutorial when I rerun the outcome variable regressions in a few days, so look out for that.

Crash Course on R Fundamentals

Computers store objects in memory and run programs which can manipulate those objects into new objects. Our goal is to learn some useful commands and workflows for effectively manipulating dataframe objects. The language we’ll be using to communicate which manipulations we’d like the computer to execute is called R.1

R is a programming language designed for data manipulation and analysis, and it is typically run in RStudio, an Integrated Development Environment (IDE) designed for data manipulation. RStudio will provide us a nice location to edit and run our code and to see the outputs of our manipulations in real time. To start up, you’ll download R and RStudio following the instructions at the end of the document.2

R files (files ending in .R) store lines of R code. Running such a file executes the code, performing whatever actions author of the file has told R to do.

R Markdown files (files ending in .rmd) are super nifty special files which contain chunks of code you can run like an R file along with writing used to communicate what’s going on in the code or the larger analysis.3 When you “knit” an .rmd file, the code in the chunks runs, producing whatever data manipulations, tables, and graphs you’ve instructed R to do, then the written document is compiled and returned. You’re actually reading the output of an R Markdown file right now. For the SSPI, .rmd files are great for communicating the idea of the kinds of more complex analysis we’d be using R for in a way that’s readable for people who aren’t used to writing code.

R contains objects and functions. Typical objects for us will be values, vectors, and dataframes:

We will call R functions on these different types of objects to achieve the results we’re looking for. Functions take in arguments (objects to act on) and return objects.

Below, we’ll see many examples of the types of objects described above and we’ll get to know many of the standard functions used to manipulate these objects.

One last note: R has a weird notation for variable assignment. Most languages would say x = 5 to make a variable x which has value 5. You can do this in R, but convention is to use the <- operator in R to assign variables. That is, instead of saying x = 5 in R, you’d say x <- 5. Weird, right?

Setting Up the Document

At the top of an R Markdown file, we set up the file we’re going to use. We call the special library function to import standard libraries used in R. A library is a collection of code and documentations written by someone for other users. This code defines some useful objects and functions we will use below. Calling library will set up this outside code so we can use it as a base atop which we will do our own work.

The last step is to set up our working directory. This is the folder on our computer in which we have our data stored. Every time you import data in R, you’ll have to remember to set your working directory. If ever you don’t know what your working directory is, you can call getwd() to figure it out. To set your own working directory, find the filepath to the folder with your data on your computer, and call the setwd function on a string containing the filepath.4

knitr::opts_chunk$set(echo = TRUE) # this line tells RMD to display the code
library(dplyr) # standardlibrary for manipulating data in R
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) # standard library for plotting data in R 
library(tidyr) # standard library we'll use to "spread" the data (see below)
library(rmarkdown) # a library for making pretty tables for the document
# set the working directory (the folder on my computer where i'm keeping the data and stuff)
setwd("/Users/tristanmisko/Documents/Berkeley/ECON/SSPI\ Research/Tutorials/2022-10-07\ R\ Data\ Manipulation\ Tutorial")

Importing the Data

To start, we import the list of SSPI countries so that we don’t have to type the name of each country by hand. I created a nifty little .csv file for this from the main data file online. You can easily create .csv file using File > Download > Comma Separated Values (.csv) on Google Sheets.

To import data, we call the read.csv function. The string "SSPI_countries.csv" is the filename of the data we want to import. The read.csv function takes the argument "SSPI_countries.csv" and returns a dataframe object containing the data from that file. We “pass” the dataframe returned by read.csv to the variable SSPI_countries using the assignment operator <-. Now, whenever we type SSPI_countries below, R knows we’re talking about the dataframe we pulled down.

#"<-" is R's funny notation for "=".  I think "=" also works but I'm an R purist
SSPI_countries <- read.csv("SSPI_countries.csv")
# look at the first few rows of SSPI_countries dataframe object 
paged_table(SSPI_countries)

Notice that the SSPI_countries dataframe has a lot of whitespace in it because I was sloppy in making the SSPI_countries.csv file. To get rid of that, we’re going to select the second and third column of the dataframe. We use the weird R notation c(2,3) to create (hence the c) the vector containing 2 and 3. For our purposes, a vector is just a list of values. Each column and row of a dataframe is a vector, and there are special ways to access each of those in R. More on that later.

The function select takes two arguments: our dataframe SSPI_countries and the vector containing 2 and 3. Calling select on the vector containing 2 and 3 tells R to select the data that is in the second and third column of our dataframe, and to return a dataframe containing only that data in a new dataframe.5 We then assign that returned value to the variable SSPI_countries.

If that seems a little weird at first, that’s because it is. Variable names behave differently on the left and right sides of the assignment operator <-. Remember: everything to the right is evaluated (i.e. all the calculations and function calls and returns) before anything on the left side. In the first line of code below, evaluating the right side means that we access the original dataframe SSPI_countries in memory (i.e. the one with all the whitespace), we do our selection (which returns a new dataframe without any of the whitespace), then we tell R to make SSPI_countries “point at” the clean new dataframe instead of the old one. Below this line, the variable SSPI_countries has in a sense “forgotten about” the old dataframe with all the whitespace—it’s no longer in the computer’s memory—and we only remember the clean new dataframe. We make use of this kind of operation (called “overwriting a variable”) constantly in this file, so it’s worth taking a second to make sure that’s clear before moving on.

# use the select function from the dplyr library to eliminate the whitespace
SSPI_countries <- select(SSPI_countries, c(2,3))
#see how we removed the whitespace
paged_table(SSPI_countries)

Now we’re going to pull in the data we need for the indicator. We use the nrow and ncol functions to see that the dataframe has 7329 rows and 14 columns. We view the names of the columns with colnames and decide that we really only need the columns "Area", "Year", and "Value". We then use our friend the select function to pick only these columns. One nice thing about select is that it can use indices (like c(2,3) above) or column names from our dataframe (like below), and R will figure out which we mean.
We save our selection over the original carbon dataframe and look at the result.

# import the data
carbon <- read.csv("FAOSTAT_data_en_9-28-2022.csv")
# check the dimensions
c(nrow(carbon), ncol(carbon))
## [1] 7329   14
# check column names to select the ones we're going to need
colnames(carbon)
##  [1] "Domain.Code"      "Domain"           "Area.Code"        "Area"            
##  [5] "Element.Code"     "Element"          "Item.Code"        "Item"            
##  [9] "Year.Code"        "Year"             "Unit"             "Value"           
## [13] "Flag"             "Flag.Description"
#select the right columns
carbon <- select(carbon, c("Area", "Year", "Value"))
paged_table(carbon)

We’ll also pull in the forest extent data using the same method. We can overwrite the colnames of the dataframe with whatever we want by reassigning colnames(forest) to a vector of the appropriate length containing our desired names. We can use select with a negative condition too: below we select the columns which are not (!) FL_2010, FL_2016, FL_2017, and FL_2019.

# import the forest data
forest <- read.csv("fra2020-extentOfForest.csv")
#look at the column names
colnames(forest)
##  [1] "X"                    "Forest...1000.ha.."   "Forest...1000.ha...1"
##  [4] "Forest...1000.ha...2" "Forest...1000.ha...3" "Forest...1000.ha...4"
##  [7] "Forest...1000.ha...5" "Forest...1000.ha...6" "Forest...1000.ha...7"
## [10] "Forest...1000.ha...8"
#rename them looking at the online database so we can tell what we're talking aobut
colnames(forest) <- c("Area", "FL_1990", "FL_2000", "FL_2010", "FL_2015", "FL_2016", "FL_2017", "FL_2018", "FL_2019", "FL_2020")
# select only the potentially relevant years
forest <- forest %>% select(!c("FL_2010", "FL_2016", "FL_2017", "FL_2019"))
paged_table(forest)

Carbon Stock Data

Cleaning and Filtering the Data

Next up, we’re going to filter down the carbon data to only the SSPI countries using the SSPI_countries dataframe we imported earlier. To achieve this, we use the aptly named function filter from the dplyr library. The filter function requires a logical vector to work, i.e. for each row in the data it needs to know whether that row is in (TRUE) or out (FALSE).

For our purposes, we want to know whether the Area value in the dataframe (which indicates the country name) is in our list of SSPI Countries. We use the special vector operator %in%, which checks the element on the left of %in% is contained in the vector on the right of %in%. We make the vector countries using the special $ notation for dataframes. The notation SSPI_countries$Country tells R to look into the SSPI_countries dataframe, grab the column Country, and stick the result into a vector, which we’ve chosen to call countries.
This vector has nrow(SSPI_countries)=49 entries, one for each row of the dataframe.

There are a couple of subtleties to watch out for when doing this with country names. Usually we prefer to use the three letter country codes standardized by the UN when we can because some countries have multiple names. In this data, there are no country codes, so we have to make do with names. The usual multiple name suspects in the SSPI are Korea, Rep. (aka. South Korea, Republic of Korea, or simply Korea), Russian Federation (often just listed as Russia), the United States (which sometimes goes by United States of America, US, and USA depending on the dataset), and Slovak Republic (aka. Slovakia). It’s always important to check that you’ve got all the countries once you’ve done a big filtering operation. To do this, we tell R to list out all of the unique Area names after filtering down to the countries whose names are contained in the vector countries.

# make a vector containing the SSPI_country Country names
countries <- SSPI_countries$Country
# list out the unique country names in the filtered datset
unique(filter(carbon, Area %in% countries)$Area)
##  [1] "Argentina"            "Australia"            "Austria"             
##  [4] "Belgium"              "Brazil"               "Canada"              
##  [7] "Chile"                "China"                "Colombia"            
## [10] "Denmark"              "Estonia"              "Finland"             
## [13] "France"               "Germany"              "Greece"              
## [16] "Hungary"              "Iceland"              "India"               
## [19] "Indonesia"            "Ireland"              "Israel"              
## [22] "Italy"                "Japan"                "Latvia"              
## [25] "Lithuania"            "Luxembourg"           "Mexico"              
## [28] "Netherlands"          "New Zealand"          "Norway"              
## [31] "Poland"               "Portugal"             "Russian Federation"  
## [34] "Saudi Arabia"         "Singapore"            "Slovenia"            
## [37] "South Africa"         "Spain"                "Sweden"              
## [40] "Switzerland"          "United Arab Emirates" "Uruguay"

We can see that we’re missing seven countries. As we find below, we have the usual suspects and a few others have alternate spellings. For these, we look through the data and see how they a