Mapping in R

Preface

This document will provide an introduction to producing high quality, customizable maps, all within the R language. Using R for mapping/GIS work has been around for quite some time now but in recent years, there have been numerous developments within the R community that streamline, simplify, and improve the geospatial capabilities of the language. With the improved analytic tools also came improvements to cartography, which will be covered here. As a forewarning - as of time of writing, some components discussed here are considered to be still in development and will be going through changes. However, the foundations - which is what this tutorial is about - is unlikely to significantly change.

As a final note, you do not need to “know R” for you to get something out of this tutorial. Also, those familiar with R will not have to learn an entire new syntax just to work with geological data. One of the awesome things about these new improvements is that mapping and other geospatial operations now use the same tidy/dplyr syntax that you’re familiar with (which is very clean and intuitive for new users to pick up). If you’re familiar with the basic concepts of coding in any language, you’ll do just fine.

Why use R for mapping?

There are a number of reasons for researchers to do mapping in R. The first is that it is a natural extension for someone doing statistical work in R already. There is a surprisingly little amount of effort/code needed to transform existing data structures into map-ready data frames. Speaking of data frames, mapping in R inherits one of R’s biggest strengths: the keeping of clean/tidy data in a data frame structure. If you’re unfamiliar with the concept of data frames (where rows are observations and columns are attributes/values/features), don’t worry, there will be tons of examples coming up.

The other main draw to using a command-line/programming approach to mapping is the potential for reproducability. GUI based software such as ARC and QGIS, while extremely powerful, do not have easy ways for users to share and collaborate on projects. Using software such as R allows users to share and replicate code so that it’s easy to receive feedback and make changes. Future users do not have to go back and redo lots of operations just to change the polygons in a different direction. The open-source and collaboration aspect of R is one of the reasons it is lauded and using R for mapping is no different. Let’s get into it.

The Data Frame

This section is an introduction to using data frames in R. If you’re familiar with them already, feel free to skip.

As mentioned in the section above, the data frame is an extremely tidy and clean way of storing data for statistical operations. The concept is very simple: each row is an observation, and each column is an attribute about that observation. Let’s look at an example:

name<- c('Max', 'Paul', 'Jillian', 'Isa')
relationship <- c('Self', 'Dad', 'Mom', 'Dog')
age <- c(24, 62, 60, 9)
sex <- c('M', 'M', 'F', 'M')
right_handed<-c(TRUE, TRUE, FALSE, NA)
df <- data.frame(name, relationship, age, sex, right_handed, stringsAsFactors = FALSE)
head(df)

##      name relationship age sex right_handed
## 1     Max         Self  24   M         TRUE
## 2    Paul          Dad  62   M         TRUE
## 3 Jillian          Mom  60   F        FALSE
## 4     Isa          Dog   9   M           NA

For those new to R, the <- syntax is the assignment operator (usually = in other languages), and c(x,y) is the syntax to create a vector of multiple values.

As you can see, this data frame contains information about my family members. Each member is a row, while each column contains some information about each member. Data frames can contain variables of any data type as long as each column contains only one type (you can’t have a column contain both ints and strings for example). The four main datatypes covered in this tutorial are

Strings (Characters)
Numbers (Floats/ints/doubles etc)
Boolean (True/False)
Factors

Each of those data types are shown in the code above. Names, relationships, and sexes are all inputted as strings (noted by the quotations around the text). Age is input as a number, which, to be specific, is the data type double. Right Handed is inputted as a Boolean, represented in R as all-caps TRUE or FALSE (or the shorthand T or F). In addition, this is the first appearance of the dreaded NA or missing values. NAs can be cause unexpected or unwanted results for many R operations, so it’s always good to be aware when and where they are present in your data. R has a very handy na.rm = TRUE argument into many of it’s operations to remove NAs before making calculations, so keep that in mind if you’re getting odd results due to NAs.

The final data type worth talking about are factors. Factors cause a lot of confusion for people new to R who don’t know the traps associated with them. As a definition, factors represent variables that are categorical; that they take on a set of fixed, non-continuous values. By default, R reads strings into data frames as factors, which is rarely what the user wants. For example, reading in the data without the stringsasfactors argument (which defaults to true) df <- data.frame(name, relationship, age, sex, right_handed) yields the following data frame (viewed in the upper right of R Studio):

This doesn’t make any sense since name and relationship are not categorical (however sex is since it takes on two values: M and F). The way we get around this is by including the stringsasfactors = F in the data frame construction, which will convert all 3 string variables into factors. To get sex back to a factor, we run:

df$sex<-as.factor(df$sex)

which overwrites the sex column of the data frame with a factored conversion. We can verify this worked via looking at the data environment window:

or verifying with:

class(df$sex)

## [1] "factor"

One final thing worth mentioning is that you can easily read in data from your favorite file type directly into a data frame. For example:

new_df<-read.csv('src/data/example_data.csv', stringsAsFactors = F)
new_df$eye_color<-as.factor(new_df$eye_color) 
new_df

##      name location num_legs eye_color
## 1     Max   Boston        2     brown
## 2    Paul   Boston        2     brown
## 3 Jillian New York        2     green
## 4     Isa New York        4      blue

Geometrical Data - sf package

There are two types of data structures commonly used for geometrical/spatial data in R - sp and sf. The short version is that sp structures were created back in 2005 and have received tremendous support over the past 13 years. However, the data structures and methods are archaic compared to more modern techniques. Sf is a recently developed package that provides data frame structures for geometric data, allowing the user to use familiar R operations on the geographical data frames. External support for sf is growing constantly and it will eventually usurp sp as the data structure of choice. Hence, everything in this tutorial will be using sf. You can install it into R via the code below. Also note that I use a lot of other packages throughout the tutorial. If you ever see a library(word), that means I am loading a package. Every package I use can be installed using the same code as below, just change ‘sf’ to the package name.

install.packages('sf')
library(sf)

Before we start mapping, it’s important to know the different types of geometric data. In this basic tutorial, I’m only going to cover the big 3:

Points
Lines
Polygons

but there are many more. Points represent specific locations in space (like a city or a landmark), lines are a 1 dimensional continuous collection of points that has a start and end (like a river or coastline), and polygons are 2d shapes that are fully enclosed (start and end point are the same, these could be things like state boundaries). When you’re thinking about creating a data driven map, it is important to think about how your data fits into these 3 objects. If you’re doing a map of rivers in a state, you’re going to need a polygon for the state boundary and lines for the rivers. If you want to include houses along the rivers, those would be points.

One final point to make about points/lines/polygons is that you might encounter multipoint/multiline/multipolygon data types. This is used when there is more than one point/line/polygon in a particular row. For example, states such as Alaska and Massachusetts have islands, so their state boundaries would be represented in multiple polygons - one for each land body.

Time to get working with some examples. I’m going to use data that’s publicly available in the spData package (don’t worry - they’re updated for sf)

install.packages('spData')
library(spData)

After it installs, grab the us_states polygon set from it with

usa <- spData::us_states
colnames(usa)

## [1] "GEOID"        "NAME"         "REGION"       "AREA"        
## [5] "total_pop_10" "total_pop_15" "geometry"

We can see this data frame has 7 columns. The first 6 describe some things about the state: name, area population, etc. The last column is the important one. Lets take a look at the first observation in our data frame

GEOID	NAME	REGION	AREA	total_pop_10	total_pop_15	geometry
01	Alabama	South	133709.3 km^2	4712651	4830620	list(list(c(-88.200064, -88.202959, -87.428613, -86.862147, -85.605165, -85.470472, -85.304489, -85.1844, -85.122189, -85.105337, -85.0071, -84.96343, -85.001874, -84.891841, -85.058749, -85.053815, -85.141831, -85.12553, -85.058169, -85.044986, -85.092487, -85.107516, -85.035615, -85.002499, -85.893632, -86.519998, -87.598937, -87.634943, -87.532615, -87.40697, -87.446594, -87.429584, -87.518324, -87.656888, -87.755516, -87.906343, -87.901711, -87.936717, -88.008396, -88.100874, -88.107274, -88.204495,
-88.3322	77, -88.39	5023, -88	.43898, -88.4732	27, -88.403789,	-88.330934, -8	8.210741, -88.097888, -88.200064, 34.995634, 35.008028, 35.002795, 34.991956, 34.984678, 34.328238, 33.482882, 32.861317, 32.773353, 32.644835, 32.523868, 32.422544, 32.322015, 32.263398, 32.136018, 32.013502, 31.839261, 31.694965, 31.620227, 31.51823, 31.362881, 31.186451, 31.108192, 31.000682, 30.993455, 30.993225, 30.997422, 30.865857, 30.743467, 30.675148, 30.527057, 30.406493, 30.280435, 30.249709, 30.291217, 30.40938,
30.55087	9, 30.6574	32, 30.68	4956, 30.50975,	30.377246, 30.3	62102, 30.38844	, 30.369425, 31.246896, 31.893856, 32.44977, 33.073125, 34.029199, 34.892202, 34.995634)))

It’s got quite a lot to say. What the geometry column numbers are are data describing the border of Alabama. These are not necessarily coordinates (more on that later), but rather numbers to tell sf exactly where this observation is in space. This geometry column is the core component that sets sf/spatial data structures apart from regular data frames. You can think of sf objects as a combination of a regular data frame (non geometry columns) plus the geometry column. You’ll see in the geocoding section how to build a sf object from your personal data frame.

Speaking of this geometry column, how can we tell what type of geometry it actually is? You can probably take a guess between point/line/polygons of what the US states can be, but we can see for sure by just calling the geometry column like so (I only checked the first observation for brevity’s sake):

head(usa$geometry,1)

## Geometry set for 1 feature 
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.47323 ymin: 30.24971 xmax: -84.89184 ymax: 35.00803
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs

## MULTIPOLYGON (((-88.20006 34.99563, -88.20296 3...

Keep this call in the back of your mind because it’s a great way to see a good amount of information about the geometry data type. We can see it is a multipolygon, is 2 dimensional, and has a bbox (bounding box) and projection/epsg code.

Projections

Turning a 3D sphere-like object such as the Earth into a 2D representation is very hard. There are a number of different ways to view geometries, each with their own pros and cons. For example:

There are many, many more ways to project the world. However, the earth is not a perfect sphere, so each area in the world often uses their own projection. Thus, there is no ‘one projection fits all’ and you have to evaluate which projection to use depending on where in the world you’re mapping. A very useful tool for this is Projection Wizard, which allows you to select an area of the world and returns a number of projection strings for you to use.

Generally for the USA, you will want to use one of two projections:

The projection you use should be based on the type and location of your analysis. Some projections are created to preserve area, some are created to preserve distance, some to preserve shape. However, there is not one projection that preserves everything, so you have to think about what you’re trying to represent with your map and choose a projection that fits it. For example, lets say I give every county a score pulled from a normal distribution and want to show the area covered by counties that got a score above 1. A map of that might look like this:

However, this is not a good projection/idea. Since what we’re trying to show is area (rather %area), it doesn’t make sense to use the mercator projection, which does not preserve area and instead stretches polygons to be more flat. Thus the states closer to the pole might appear bigger or smaller than they actually are, which then misleads the audience. For something like this, we would want to use an equal-area projection, such as the Albers:

While this example was for area, a similar thought process should be used for things like distance and shape. If you’re plotting distances between Heathrow Airport and other airports in the EU, you would want to use a projection that makes sure the distances are correct, perhaps slightly at the expense of less accurate shapes/areas. CUNY has a good resource/explation of various projection types and their usage.

Transforming projections

Transforming data between projections is incredibly easy using the st_transform function, but first lets chat about how projections are represented. The most widely used system for documenting projections is Proj4, which are character strings detailing types of projections and their locations. Remember the Coordinate Reference System (CRS) we saw when looking at the geometry column? We can check it out again using st_crs

st_crs(usa)

## Coordinate Reference System:
##   EPSG: 4269 
##   proj4string: "+proj=longlat +datum=NAD83 +no_defs"

There are a lot of complicated arguments that can go into the proj4string, but this case is pretty simple: it is a simple longlat/mercator projection using the NAD83 datum. The EPSG code there is a way to quickly reference a particular projection. Ergo, inputting 4269 as the EPSG Code would return the exact same projection. Just remember that it is easy to go from EPSG to proj4 but not the other way around, as there can be more information embedded in the strings.

Projections can be a bit complicated when it comes to the USA however. Since there are two states that are geographically distant from the main 48, it can be challenging to project all 50 states without having to stretch the map in order to include Hawaii. Because of this, there are some convenience tools in R that change the projection so that AK and HI are where Mexico normally is. Since this is an exception to a standard work flow, I’ll go through an example of how to transform projections and do basic plotting with New England, then I’ll show the USA tricks.

To start, I’m going to use the geometry file of the continental US from spData, however it’s very easy to load your own geometries from file types such as shapefiles or geojson with the st_read() function. Shapefiles are one of the most common ways to store geometric data, and it’s possible to download, open, and load data all within the R console. However, since the USA is a popular country to plot, it is easily available in existing packages. Once I have it loaded, I can easily filter it down to just the New England states. Then, a quick look of it’s geometry with the plot function

 ne <- filter(usa, NAME %in% c('Maine','New Hampshire','Vermont','Massachusetts',
                                  'Connecticut','Rhode Island'))
 plot(ne$geometry)
 title(main = 'Mercator')

To transform to a different projection, all you need to do is call st_transform with an EPSG code or a proj4 string. Here’s New England with an Albers Equal-Area

 ne_trans <- st_transform(ne, crs='+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs')
 plot(ne_trans$geometry)
 title(main = 'Equi-area')

Plotting the USA

As mentioned above, Alaska and Hawaii are a PITA for plotting, because you often get something that looks like this:

full_us<-rbind(us_states, st_transform(alaska, st_crs(us_states)), st_transform(hawaii,st_crs(us_states)))
plot(full_us$geom)

Not only is the graphic stretched due to AK and HI, the projection is saying that Alaska is the size of half the US. There is no good solution to this problem with your own shapefiles, but luckily someone has done all the work for us in the albersusa package

library(albersusa)
alb_usa <- albersusa::usa_sf('laea')
plot(alb_usa$geometry)
title(main = 'Albers')

This is the same Albers Equal Area Conic projection I demoed before. The usa_sf function from the package allows you to specify a projection from c("longlat", "laea", "lcc", "eqdc", "aeqd") and that’s it! I’d recommend the package for doing any kind of mapping of the United States as a whole.

Creating your own spatial points

I’m not going to cover how to make your own custom polygons/lines as it is a task few people need to do. What a lot of people do need however is to plot a bunch of static points on an existing geometry. However, these points don’t have coordinates, but instead have addresses or city names. I’m going to grab an example file here:

loc_data <- read.csv('src/data/location_data.csv')
head(loc_data)

##             city state count
## 1       Abington    PA    20
## 2         Berlin    NJ    17
## 3 Haddon Heights    NJ    10
## 4        Allegan    MI    20
## 5       Amarillo    TX    38
## 6     Enterprise    MS     2

Here we have a bunch of cities, states, and some metric count. We want to plot these cities on a map, so they need to be converted to a ‘point’ geometry object. Lets start by getting the coordinates through geocoding from the tmaptools package.

suppressMessages(library(tmaptools))
output = tmaptools::geocode_OSM('Boston,MA')
output

## $query
## [1] "Boston,MA"
## 
## $coords
##         x         y 
## -71.05957  42.36048 
## 
## $bbox
##         min       max
## x -71.19126 -70.80449
## y  42.22765  42.39698

What this code does is it pings OpenStreetMap with the given location, then reads back information about the location. Other packages/functionality can be used to ping other servers, the most popular being google.

Now we’re going to use this method to convert our entire loc_data data frame. First we need to combine the city and state columns with a paste function. Then we can input that entire column into geocode_OSM and it’ll return us a dataframe with the lons/lats

loc_data$loc <- paste(loc_data$city, loc_data$state, sep = ',')
output<-tmaptools::geocode_OSM(loc_data$loc)
head(output)

##   X             query      lat        lon  lat_min  lat_max    lon_min
## 1 1       Abington,PA 40.12067  -75.11795 40.08067 40.16067  -75.15795
## 2 2         Berlin,NJ 39.79123  -74.92905 39.77601 39.80915  -74.96611
## 3 3 Haddon Heights,NJ 39.87734  -75.06462 39.86910 39.88988  -75.08548
## 4 4        Allegan,MI 42.58672  -85.88615 42.41894 42.76893  -86.27441
## 5 5       Amarillo,TX 35.20722 -101.83382 35.10946 35.29447 -101.95620
## 6 6     Enterprise,MS 34.46705  -89.15729 34.44705 34.48705  -89.17729
##      lon_max
## 1  -75.07795
## 2  -74.90796
## 3  -75.04603
## 4  -85.54305
## 5 -101.65368
## 6  -89.13729

This is the output of the function. As you can see, it returns a bunch of info, although we’re really only after the lon/lat. So now lets merge the datasets together by doing a left_join between the original loc_data and this new output file, then drop the columns we don’t want.

loc_data<-left_join(loc_data,output, by = c('loc'='query'))

#Selects all columns that does not end with '_min' or '_max'
loc_data <- loc_data %>% 
            select(-ends_with('_min'))%>%
            select(-ends_with('_max')) 

head(loc_data)

##   X           city state count               loc      lat        lon
## 1 1       Abington    PA    20       Abington,PA 40.12067  -75.11795
## 2 2         Berlin    NJ    17         Berlin,NJ 39.79123  -74.92905
## 3 3 Haddon Heights    NJ    10 Haddon Heights,NJ 39.87734  -75.06462
## 4 4        Allegan    MI    20        Allegan,MI 42.58672  -85.88615
## 5 5       Amarillo    TX    38       Amarillo,TX 35.20722 -101.83382
## 6 6     Enterprise    MS     2     Enterprise,MS 34.46705  -89.15729

Almost done! Now we just have to convert our data frame to a sf object with the given coordinates. How hard can it be?

loc_geo <- st_as_sf(loc_data[complete.cases(loc_data),], coords = c('lon','lat'), crs = 4326)
plot(loc_geo$geometry)

Woohoo! There are our points. Lets combine them with our map

plot(alb_usa$geometry)
plot(loc_geo$geometry, add = T)

Hmm, something’s not right. We see one point in the middle of South Dakota, but none of the others. Let’s investigate the most common cause of things not aligning…

st_crs(alb_usa)

## Coordinate Reference System:
##   No EPSG code
##   proj4string: "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"

st_crs(loc_geo)

## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"

Aha! Remember, when you’re plotting things together, they both MUST have the same CRS. Easy fix with our transform knowledge, here I just give the points the CRS of alb_usa. This way I know for sure that they have the exact same CRS

loc_geo = st_transform(loc_geo, st_crs(alb_usa))
plot(alb_usa$geometry)
plot(loc_geo$geometry, add = T)

Good/High Quality Maps

Whew, congrats on getting this far. Now that you have a basic intro to geometries, conversions, coordinates, and more, it’s time to do some real mapping. The native plot functions in the previous section is good for quick, easy visualizations but we must turn to external packages to achieve higher quality plotting. If you ask any R user what their favorite package for visualizations is, he or she will say ggplot2 100% of the time. Unfortunately at time of writing, ggplot’s sf module, geom_sf, is still in development. You can use it by installing the github/development source of ggplot, but in this tutorial I’m going to present an alternative package called tmap. tmap does a very good ggplot impersonation and ggplot veterans should be able to pick it up very quickly.

There are two main types of plots I’ll cover: points and choropleths. Point based maps would be something like the map at the end of the previous section, but maybe points colored/sized by value. Choropleths color the entire polygon based on some categorical or continuous scale. If you’re a bit confused by that wording, don’t worry because you’re about to plot one yourself.

Lets take a look at our USA data frame and see if there’s any sample data to plot

head(alb_usa)

## Simple feature collection with 6 features and 13 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -2031905 ymin: -1470717 xmax: 2295505 ymax: 67481.2
## epsg (SRID):    NA
## proj4string:    +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
##        geo_id fips_state                 name lsad census_area iso_3166_2
## 1 0400000US04         04              Arizona       113594.084         AZ
## 2 0400000US05         05             Arkansas        52035.477         AR
## 3 0400000US06         06           California       155779.220         CA
## 4 0400000US08         08             Colorado       103641.888         CO
## 5 0400000US09         09          Connecticut         4842.355         CT
## 6 0400000US11         11 District of Columbia           61.048         DC
##     census pop_estimataes_base pop_2010 pop_2011 pop_2012 pop_2013
## 1  6392017             6392310  6411999  6472867  6556236  6634997
## 2  2915918             2915958  2922297  2938430  2949300  2958765
## 3 37253956            37254503 37336011 37701901 38062780 38431393
## 4  5029196             5029324  5048575  5119661  5191709  5272086
## 5  3574097             3574096  3579345  3590537  3594362  3599341
## 6   601723              601767   605210   620427   635040   649111
##   pop_2014                       geometry
## 1  6731484 MULTIPOLYGON (((-1111066 -8...
## 2  2966369 MULTIPOLYGON (((557903.1 -1...
## 3 38802500 MULTIPOLYGON (((-1853480 -9...
## 4  5355866 MULTIPOLYGON (((-613452.9 -...
## 5  3596677 MULTIPOLYGON (((2226838 519...
## 6   658893 MULTIPOLYGON (((1960720 -41...

We’ve got some population and some area data here to work with. Since Cali/TX/NY are likely going to be outliers, lets make a new variable that’s the change in population from 2014 to 2013 as a percentage - something a bit more comparable between states

alb_usa$popdelt<-round((alb_usa$pop_2014/alb_usa$pop_2013 -1) *100, 2)

Perfect, lets get tmapping.

library(tmap)
tm_shape(alb_usa) +
  tm_polygons()

Similar to ggplot, the initial call to tmap is through a tm_shape function. This is whatever spatial object your data is in. Don’t worry if the data is split into multiple objects - for example polygons in one and points in another -, you can easily add multiple sets of data to a single map. From there, you can chain various options to customize your plot based on sizes, colors, the world’s your oyster. As an example, lets throw in the cities from the loc_geo data:

tm_shape(alb_usa) + 
  tm_polygons() +
  tm_shape(loc_geo) +
  tm_bubbles()

All it is is one long equation. Now lets start making it pretty. The variable we were looking at was the one created that shows population change from ’13 to ’14. That belongs to the polygon dataset, so we’re going to be customizing tm_polygons by adding a col argument telling tmap to color the polygons by that variable.

tm_shape(alb_usa) + 
  tm_polygons(col = 'popdelt')

And you’re done! Well, not completely, but that simple addition gave a lot of life to the map. Lets add some more cool stuff (I would HIGHLY encourage you to check out ?tm_layout as there are a TON of options that will likely suit whatever your needs are)

tm_shape(alb_usa) + 
  tm_polygons(col = 'popdelt', title = '% Change') +
  tm_layout(inner.margins = c(.03,.03,.03,.03),
            main.title = "% Change in Population '13-'14",
            main.title.position = 'center',
            bg.color = "gray95",
            legend.position = c('right','bottom'),
            legend.frame = T)

The next think you might be asking about is colors. The biggest piece of advice I will give you is never try to make your own color palette. Trust me, it sounds easy and you might think that because you match your belt and shoes every day you have enough experience. However, picking the right colors that best match what your data represents is like picking a needle in a haystack (there are a LOT of colors out there and most of them are garbage). Instead, why not just let other people do the work for you? I recommend using color brewer. You can find their website at colorbrewer2.org, but don’t click that link just yet. There’s an even easier way (assuming you work in RStudio). Behold:

library(shinyjs)
library(tmaptools)
tmaptools::palette_explorer()

This is an amazing tool for picking colors in the family you want. Say, for instance, I want to have my states to be shades of orange instead of green. Also, instead of having 6 divisions like I currently have, I want more. Tmap is very smart in that it looks at the values of the data you’re plotting and will automatically choose a number of breaks that are clean. You can override this with the breaks= parameter, but I would highly advise against it.

Back to the colors. Let’s think, is my data Sequential, Categorical, or Diverging? Well my numbers are a continuous, sequential range from what looks like -.5 to 2.5 so we’ll be using a Sequential palette. I set the slider to 10, and click on tmap layer function code. The colors that I want are labeled OrRd, so all I have to do is add in the pallete and n parameters into my tm_polygons function.

tm_shape(alb_usa) + 
  tm_polygons(col = 'popdelt', title = '% Change', palette = 'OrRd', n=10) +
  tm_layout(inner.margins = c(.03,.03,.03,.03),
            main.title = "% Change in Population '13-'14",
            main.title.position = 'center',
            bg.color = "gray95",
            legend.position = c('right','bottom'),
            legend.frame = T)

And voila! Your very own color palette! Perhaps time to add our points back on? I’m also going to re-do the colors back to 6 divisions, if that’s okay with you.

tm_shape(alb_usa) + 
  tm_polygons(col = 'popdelt', title = '% Change', palette = 'OrRd', n=6) +
  tm_shape(loc_geo) + 
  tm_bubbles(col = '#b3de69', size = .30) +
  tm_layout(inner.margins = c(.03,.03,.03,.03),
            main.title = "% Change in Population '13-'14",
            main.title.position = 'center',
            bg.color = "gray95",
            legend.position = c('right','bottom'),
            legend.frame = T)

Cool! This is a great (near) final product; it just needs a label for the points. The first legend was very easy, is the second one as well? The answer is generally yes but here unfortunately not (at least that I’m aware of). You see, tmap automatically creates its fantastic legends based on some kind of variation between data. You recall that in our states polygons, this variation came from the col property and the popdelt variable. With our points, there is no variation - they are just notable cities. So the way we get around this is by assigning our label as a variable to every city, then tell tmap to color by that label variable. Since there is only one option, all the colors will be the same

loc_geo$label <- "Max's Cities"
tm_shape(alb_usa) + 
  tm_polygons(col = 'popdelt', title = '% Change', palette = 'OrRd', n=6) +
  tm_shape(loc_geo) + 
  tm_bubbles(col = 'label', palette = '#b3de69', size = .25, title.col='') +
  tm_layout(inner.margins = c(.03,.03,.03,.03),
            main.title = "% Change in Population '13-'14",
            main.title.position = 'center',
            bg.color = "gray95",
            legend.position = c('right','bottom'),
            legend.frame = T
            )

A bit of work but no big deal. An important thing to note here is that if you’re plotting bubbles, the title parameter has changed a bit. Since bubbles can vary by color (such as in this example even though there’s only one color) and size, the legend needs to know what kind of variation you’re looking for. Here, since we’re varying by color, we specify the title using title.col=, which I leave empty since the information we put in the data frame is descriptive enough.

To vary the size of the bubbles, we’ll need some kind of metric. My cities also had a count column if I recall correctly, lets take a quick look at the distribution:

plot(loc_geo$count)

Looks like the majority of values are under 150-200 with some outliers. Lets trim the dataset, then scale the city bubbles by this count metric

loc_geo_sm<-filter(loc_geo, count <= 200)
tm_shape(alb_usa) + 
  tm_polygons(col = 'popdelt', title = '% Change', palette = 'OrRd', n=6) +
  tm_shape(loc_geo_sm) + 
tm_bubbles(size = 'count',col = '#b3de69', title.size='Count') +
tm_layout(inner.margins = c(.03,.03,.03,.03),
          main.title = "% Change in Population '13-'14",
          main.title.position = 'center',
          bg.color = "gray95",
          legend.position = c('right','bottom'),
          legend.frame = T
          )

And now we have a final, publication-quality map! If you’re concerned about Florida being covered by the legend, don’t fret - there are a number of ways to deal with that. You can move the legend outside the box, convert it to horizontal and have it below FL/TX. However, a simple solution is to just change the right margin from .03 to a slightly higher value (shown below is .067)

Odds and Ends

Here are some final common, geographical things that map makers should at least consider when creating a visualization. You’ll note in the previous section I excluded things like long/lat lines, scales, etc. This is because when using a modified projection that has Alaska and Hawaii tucked in the bottom left, using geographic features can be misleading. Ergo, for these examples, I’m going to filter out Hawaii and Alaska from my geometry file:

us_cont <- filter(alb_usa, !name %in% c('Alaska','Hawaii'))
tm_shape(us_cont) + 
  tm_polygons(col = 'popdelt', title = '% Change', palette = 'OrRd', n=6) +
  tm_layout(inner.margins = c(.03,.03,.03,.03),
          main.title = "% Change in Population '13-'14",
          main.title.position = 'center',
          bg.color = "gray95",
          legend.position = c('right','bottom'),
          legend.frame = T
          )

Adding scales and compasses are quick and easy. They are separate modules in tmap so just add them in before the layout. Of course, these are heavily customizable. Check ?tm_compass and tm_scale_bar for more info.

tm_shape(us_cont) + 
  tm_polygons(col = 'popdelt', title = '% Change', palette = 'OrRd', n=6) +
  tm_compass(position = c("left", "bottom"), type = '4star') +
  tm_scale_bar(position = c("left", "bottom")) + 
  tm_layout(inner.margins = c(.03,.03,.03,.03),
          main.title = "% Change in Population '13-'14",
          main.title.position = 'center',
          bg.color = "gray95",
          legend.position = c('right','bottom'),
          legend.frame = T
          )

Finally, long/lat lines are in the tm_grid module. One important to thing to note, things are drawn in the order they are called. So in the previous example, calling the bubbles to be drawn before the polygons would result in you being unable to see them (as they would be underneath the states). Same thing here, if you’re drawing grid lines, make sure you call them first. Also make sure the projection= argument of your grid call matches the projection of your polygons!

tm_shape(us_cont) + 
  tm_grid(projection = '+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0') +
  tm_polygons(col = 'popdelt', title = '% Change', palette = 'OrRd', n=6) +
  tm_compass(position = c("left", "bottom")) +
  tm_scale_bar(position = c("left", "bottom")) + 
  tm_layout(inner.margins = c(.03,.03,.03,.03),
            main.title = "% Change in Population '13-'14",
            main.title.position = 'center',
            bg.color = "gray95",
            legend.position = c('right','bottom'),
            legend.frame = T
  )

Closing Thoughts / Useful Links

This tutorial barely scratches the surface of sf/geometric operations and plotting in tmap. Fortunately there are some really good resources out there, which I will link at the bottom. Keep in mind the support for sf is still developing and there will be continued improvements to mapping and geometric calculations in the future. Some other things to perhaps explore on your own that were not covered in this tutorial are:

Working with lines (tm_lines)
Using facets (tm_facets)
Raster Data
Polygon manipulation (buffers, intersections, joining)

Useful Links:

‘Geocomputation with R’, a WIP, free textbook dedicated to, well, geocomputation in R
SF Development Github - check out the vignettes, they are very useful.
Tmap Development Github. Lots of examples and further guidance
ggplot development for SF not currently in the CRAN package but support is being developed.