This is my second guide/tutorial for mapping in R, this one being more of an introduction to spatial data workflow and interactive mapping in leaflet. If you need an introduction to spatial data in R, or just a quick refresher, my introductory tutorial is located here.
Unlike static mapping where the end result is an image that can be inserted into anywhere (and can even be printed out), interactive maps sacrifice some of this flexibility in exchange for a cleaner and more informative user experience. Having mouseovers, on click effects, and filtration options allow users to customize your visualizations to their needs, as well as opening up new avenues for you to express your data.
In R, interactive visualizations usually take on one of two final forms: html or Shiny-based output. This means that you cannot ‘copy and paste’ an interactive map into a .docx or even a pdf. If you intend to use an interactive map for a report, a presentation, a product, make sure that that medium is a vessel that can handle the maps. Regardless of what the final output is, I highly recommend constructing your map in RMarkdown, since you’re likely going to have to use it to compile your output eventually.
HTML files are easy to construct and easy to distribute. Your map can be written in pure R code within an .rmd file and pandoc will have no issue converting it to HTML. HTML files are nice because that can be viewed on any computer, hosted on any server, and require little to no formatting if coming from an .rmd file. However, the catch is that the map cannot talk to any other elements in the file. This means that having custom radio buttons, drop down menus, specific on click functions, etc are impossible in native R. There are ways of manually injecting javascript code into your map, but it’s unintuitive, requires you to learn another language, and there are already good R methods out there in Shiny. Everything written in this document is all .html, so everything you read here can easily be replicated in portable file, but if you want more customizable interaction, Shiny is the next step up.
Shiny is an extension of R that allows users to build external applications that can be embedded into a wide variety of things. From “static” documents, to websites, to slideshows, the ceiling for Shiny is very high. Shiny allows users to add a wide range of “javascript-like” abilities to your map, such as event handlers, input options, and many more, in an intuitive syntax. The cost of this is that a shiny based app needs to be ‘run’, as opposed to HTML which can be ‘viewed’. This means that if you want to run the Shiny app, you need to have R installed on your computer. While this is not an issue for you the reader, keep in mind who the audience is for your map and whether or not they have R readily available.
This tutorial is constructed as ‘a day in the life of a relatively simple project’. The goal is to build a map showing all of the hospitals in Massachusetts, then construct a network between each hospital and its 5 closest competitors. While the final product at the bottom of this page is a nice, pretty map, the first half demonstrates some spatial data collection and transformation methods that could be useful for people who are unfamiliar with spatial data manipulation in R. I would recommend reading the entire thing, but if you just want to map, feel free to skip to the Beginning with Leaflet section.
Before beginning, I find it useful to plan out how it is I am going to get to my final result. For this project, I see it as:
The first step in our task is to get the locations of the hospitals. Wikipedia seems to have a good list of what we want. However if we are going to scrape the names, there are a couple of issues. First, the town names get in the way both locationally and the fact that they are coded as the same HTML element. Second, there are a number of hospitals that are listed as ‘Closed’, or that have labels that are not part of the name. While not impossible to scrape, it would be a total PITA. For this example, I decided to instead use the MA listing in US Hospital Finder. While I’m not entirely certain this is a complete list, it works well for a small project like this.
For the scraping component, I use the html_nodes() function in the rvest package. This allows for smooth, dplyr syntax to funnel through the html elements until I find the links that I want, and then the text that contains the hyperlinks.
site<-read_html('http://www.ushospitalfinder.com/hospitals-in/Massachusetts') # Reads in the html
names<-html_nodes(site, 'div')[3] %>% # Selects the div with the list of names
html_nodes('a') %>% # Selects all the links in the div
html_text() %>% # Selects the text out of each link
head(-1) # Removes the last element 'Submit a new hospital'
# Compiles the name list into a data frame
hospitals<-data.frame(names, stringsAsFactors = F)
Now that we have a column of names, we need to geocode up some latitudes and longitudes. In the last tutorial, I taught you how to geocode using Open Street Map. The perk to OSM is that it does not limit you to 2500 queries a day like google does. The con is that google’s api is much more detailed and can do many more things, such as travel time calculation, traffic congestion, etc. In this context, the ‘detail’ in google’s API is the fact that it can get a lng/lat from a name of a location (such as a hospital), while OSM requires an address. Thus for this tutorial, I’ll be using google to geocode the list of names we just created.
The most common way to do google geocoding is through the ggmap package. However, this route requires no api key and can cause issues over large networks, like those at offices. A development version that allows the user to input an API key is currently available, but until it gets put on CRAN, I recommend using the googleway package instead.
To geocode in google you need a google maps api key. This allows google to identify who’s using their server. A personal one is free to acquire, but limits you to 2500 queries a day. Since we only have 115 hospitals, that’s not a problem. Once you have your key, save it in a variable to be passed into the geocode function (my variable is called my_key but is not displayed in this tutorial for obvious reasons).
The next hurdle to overcome is the way to geocode all 116 names at once. OSM (and ggmap’s geocode function), both allow you to input an entire list of names and have the output be a data frame. Unfortunately, googleway::google_geocode only accepts strings of length 1 as inputs. This means we need to make a function that will take in a name as an input, do the geocoding, and then return the variables we’re after. From there, we can then use map, apply, or a for loop to call that function on every name in our hospitals dataframe.
#This function takes in a name of a hospital and returns a named list with its latitude, longitude, and address.
geo_func <- function (name){
#Calls the geocode function from googleway. Adds a 'MA' to the address since other hospitals might have the same name.
#This encourages google to find/prioritize Massachusetts hospitals.
ret<- google_geocode(paste0(name, ', MA'), key = my_key)
#Parses the returned object for the information we're after
lat <-ret$results$geometry$location$lat
lng <-ret$results$geometry$location$lng
address<-ret$results$formatted_address
#Creates a named list to be returned
return(list('lat' = lat, 'lng' = lng, 'address' = address))
}
Now we can map this function to our column of names. Nesting it within a mutate() we create a column called L where each cell is a list with the 3 elements returned from our geocode function. From there, we can then map through that list column to create three columns for each of the lng/lat/address. Finally, we drop the unneeded list column
hospitals<-hospitals %>%
mutate(L = map(names, geo_func)) %>%
mutate(lat = map(.$L, 'lat'), lng = map(.$L, 'lng'), address = map(.$L, 'address')) %>%
select(-L)
head(hospitals)
## names lat lng
## 1 Adcare Hospital of Worcester 42.27633 -71.79483
## 2 Clinton Hospital 42.42742 -71.69301
## 3 M I T Medical Department 42.36121, 42.36009 -71.08673, -71.09416
## 4 Saint Vincent Hospital 42.26498 -71.79642
## 5 Anna Jaques Hospital 42.8144 -70.89092
## 6 Cooley Dickinson Hospital 42.3305 -72.65313
## address
## 1 107 Lincoln St, Worcester, MA 01605, USA
## 2 201 Highland St, Clinton, MA 01510, USA
## 3 25 Carleton St, Cambridge, MA 02142, USA, 77 Massachusetts Ave, Cambridge, MA 02139, USA
## 4 123 Summer St, Worcester, MA 01608, USA
## 5 25 Highland Ave, Newburyport, MA 01950, USA
## 6 30 Locust St, Northampton, MA 01060, USA
Voila! We now have our data frame ready to be converted into a spatial object of points. Although MIT seems a bit weird… why is it of length 2?
filter(hospitals, names == 'M I T Medical Department')
## names lat lng
## 1 M I T Medical Department 42.36121, 42.36009 -71.08673, -71.09416
## address
## 1 25 Carleton St, Cambridge, MA 02142, USA, 77 Massachusetts Ave, Cambridge, MA 02139, USA
What happened here is that when we called the geocode for ‘M I T Medical Department, MA’, google returned multiple addresses. However, we only want to use 1 for each location. For actual research, this should be evaluated on a case by case basis but for the ease of this tutorial, I’m just going to take the first result for each location with multiple addresses. To do this, I’m going to use the unnest function to separate the nestest list of returns into separate rows so that MIT now looks like this:
hospitals %>%
unnest(lat, lng, address) %>%
filter(names == 'M I T Medical Department')
## names lat lng
## 1 M I T Medical Department 42.36121 -71.08673
## 2 M I T Medical Department 42.36009 -71.09416
## address
## 1 25 Carleton St, Cambridge, MA 02142, USA
## 2 77 Massachusetts Ave, Cambridge, MA 02139, USA
And now to do a slice so that only the first row of each duplicate remains
hospitals<-hospitals %>%
unnest(lat, lng, address) %>%
group_by(names)%>%
slice(1L)
filter(hospitals, names == 'M I T Medical Department')
## # A tibble: 1 x 4
## # Groups: names [1]
## names lat lng address
## <chr> <dbl> <dbl> <chr>
## 1 M I T Medical Department 42.4 -71.1 25 Carleton St, Cambridge, MA 0214~
Now that we have one row and one observation for each hospital, we can now convert to an sf object.
hosp_geo<-st_as_sf(hospitals, coords = c('lng','lat'),
crs = 4326)
This is very easy. Calling st_distance between the same object performs a row-wise distance calculation between each combination of observations. However, we need to be careful here. If you remember the section on projections from the first tutorial, there are certain projections that are designed to preserve distance on a 2d coordinate system (at the expense of other metrics, such as area). These projections are usually conic in nature and look something like this when depicting the entire world:
Since we’re going to be calculating the distances between hospitals, we want to transform our hosp_geo spatial object to have one of these projections. Since all of our data is in Massachusetts, we can use a special coordinate system constructed just for the state which is called the ‘Massachusetts State Plane Coordinate System’ (which, as you can see on their website, is a version of the Lambert Conformal Conic projection modified specifically for Massachusetts). Each state has at least one of these projections, but larger states have 2 or more and caution must be used when calculating distances between planes. Luckily for us, Massachusetts only has one. We can see the differences in distance calculations below:
original_hosp <- hosp_geo
xform_hosp <- st_transform(hosp_geo, crs = 26986)
#Mercator Projection Distances:
head(as.data.frame(st_distance(original_hosp, original_hosp))[1:6])
## V1 V2 V3 V4 V5 V6
## 1 0.00 m 95315.71 m 53149.84 m 56146.580 m 56458.512 m 48378.14 m
## 2 95315.71 m 0.00 m 106180.98 m 54860.418 m 58295.545 m 110925.69 m
## 3 53149.84 m 106180.98 m 0.00 m 51348.062 m 48093.689 m 101514.43 m
## 4 56146.58 m 54860.42 m 51348.06 m 0.000 m 3959.395 m 93162.48 m
## 5 56458.51 m 58295.55 m 48093.69 m 3959.395 m 0.000 m 95048.35 m
## 6 48378.14 m 110925.69 m 101514.43 m 93162.477 m 95048.350 m 0.00 m
#Mass State Plane Projection Distances:
head(as.data.frame(st_distance(xform_hosp, xform_hosp))[1:6])
## V1 V2 V3 V4 V5 V6
## 1 0.00 m 95314.41 m 53148.12 m 56144.703 m 56456.590 m 48376.88 m
## 2 95314.41 m 0.00 m 106178.75 m 54859.841 m 58294.841 m 110926.06 m
## 3 53148.12 m 106178.75 m 0.00 m 51346.391 m 48092.124 m 101511.44 m
## 4 56144.70 m 54859.84 m 51346.39 m 0.000 m 3959.265 m 93160.26 m
## 5 56456.59 m 58294.84 m 48092.12 m 3959.265 m 0.000 m 95045.98 m
## 6 48376.88 m 110926.06 m 101511.44 m 93160.260 m 95045.978 m 0.00 m
We can see from this small sample of distances that the original projection’s distances were off by up to 3 meters. While that may not mean much, if we were calculating distances from Mass to California, the errors would be much much larger. Plus it’s always a good idea to employ best practices, especially when all you have to do is add one line of code.
Looking at our new distance matrix, we still have some problems. These ‘V1’, ‘V2’, etc variables are meaningless in our context. We can make this data frame a bit more practical by changing the column names to the names of the hospitals they correspond to, as well as adding another column containing those same names in the same order. Then we would have both row and column names with the values referring to the distance between those two hospitals
dist_matrix<-as.data.frame(st_distance(xform_hosp,xform_hosp)) %>% #Converts to dataframe
#Sets column names
`colnames<-`(hospitals$names) %>%
#Adds a column containing names so that each row now also has a name
cbind(name = hospitals$names)
dist_matrix[1:6,c(1:6,ncol(dist_matrix))] %>% kable()
| Adcare Hospital of Worcester | Anna Jaques Hospital | Arbour-Fuller Hospital | Arbour H R I Hospital | Arbour Hospital | Athol Memorial Hospital | name |
|---|---|---|---|---|---|---|
| 0.00 m | 95314.41 m | 53148.12 m | 56144.703 m | 56456.590 m | 48376.88 m | Adcare Hospital of Worcester |
| 95314.41 m | 0.00 m | 106178.75 m | 54859.841 m | 58294.841 m | 110926.06 m | Anna Jaques Hospital |
| 53148.12 m | 106178.75 m | 0.00 m | 51346.391 m | 48092.124 m | 101511.44 m | Arbour-Fuller Hospital |
| 56144.70 m | 54859.84 m | 51346.39 m | 0.000 m | 3959.265 m | 93160.26 m | Arbour H R I Hospital |
| 56456.59 m | 58294.84 m | 48092.12 m | 3959.265 m | 0.000 m | 95045.98 m | Arbour Hospital |
| 48376.88 m | 110926.06 m | 101511.44 m | 93160.260 m | 95045.978 m | 0.00 m | Athol Memorial Hospital |
Here’s where things get a bit nightmarish: we need the bottom 5 distance values for each hospital, along with the name of the other hospital they’re neighbored to. We could do this either row wise (filter rows to the row that matches the origin hospital, then look at column names/values), but for this task doing this column wise is much easier. Similar to what we did to geocode, we’re going to make a function that takes in the origin hospital and returns a dataframe with the origin hospital, the neighbor/target hospital, and the distance values.
### Returns a list of the 5 closest hospitals to a given one
### 'origin' is the origin hospital, and 'name' is the column of names in the matrix
close_checker <- function(origin){
dist_matrix%>%
#Selects only the origin hospital distance and the name columns
select(origin, name) %>%
#Filters to the bottom 6 distances (includes distance between hospital and itself)
top_n(-6, !!as.name(origin)) %>%
#Drops the instance where the distance between a hospital and itself is 0
filter(name != origin) %>%
#Orders the results
arrange(!!as.name(origin)) %>%
#Adds a column with the origin hospital name (passed into the function)
cbind(hospital = origin) %>%
#Renames columns
`colnames<-`(c('Distance','Target','Hospital')) %>%
return()
}
close_checker(hospitals$names[1])
## Distance Target Hospital
## 1 1267.666 m Saint Vincent Hospital Adcare Hospital of Worcester
## 2 1881.093 m Worcester State Hospital Adcare Hospital of Worcester
## 3 2751.840 m UMass Memorial Medical Center Adcare Hospital of Worcester
## 4 3232.114 m Veterans Affairs Med Center Adcare Hospital of Worcester
## 5 3763.125 m Fairlawn Rehab Hospital Adcare Hospital of Worcester
Perfect. In the end, we’re going to have one dataframe containing all of the hospitals’ closest 5. From there, we need to create an sf object with a line for each origin:target pair. Another instance of making a function to map…
#Function creates a linestring between two hospitals, origin and target
linestring_func <- function(origin, target){
hosp_geo %>%
# Filters to the two names we are connecting
filter(names == origin | names == target) %>%
# Extracts the raw coordinates of the two points
st_coordinates() %>%
# Creates a linestring
st_linestring() %>%
return()
}
Important: Note that this function filiters the original hosp_geo object that has the Mercator projection and not the transformed one (xform_hosp) with the Mass State Plane projection. This is because our final plot is going to be on a leaflet map, which uses the Mercator projection, so we need to plot spatial objects that also have this same projection. This means not only are we going to eventually be plotting the hosp_geo object, we want to make sure that our lines are also constructed from that object so that they will have the same reference system.
We are now ready to map our functions! Remember, our workflow will iterate through the list of names of hospitals. For each hospital, our close_checker function will create a table with the hospital and it’s 5 nearest neighbors. Then our linestring_func will create a linestring for each combination.
#Starting point is names of all the hospitals
locs <- hospitals$names %>%
#Maps through all of them and applies our close_checker() function
map(~(.x) %>% close_checker()) %>%
#Binds all of the length 5 dataframes with eachother
invoke(rbind,.) %>%
#Creates a column of linestrings
mutate(ls = map2(Hospital,Target,linestring_func))
head(locs)
## Distance Target Hospital
## 1 1267.666 m Saint Vincent Hospital Adcare Hospital of Worcester
## 2 1881.093 m Worcester State Hospital Adcare Hospital of Worcester
## 3 2751.840 m UMass Memorial Medical Center Adcare Hospital of Worcester
## 4 3232.114 m Veterans Affairs Med Center Adcare Hospital of Worcester
## 5 3763.125 m Fairlawn Rehab Hospital Adcare Hospital of Worcester
## 6 13749.879 m Merrimack Valley Hospital Anna Jaques Hospital
## ls
## 1 -71.79483, -71.79642, 42.27633, 42.26498
## 2 -71.79483, -71.77213, 42.27633, 42.27791
## 3 -71.79483, -71.76153, 42.27633, 42.27771
## 4 -71.79483, -71.76745, 42.27633, 42.29714
## 5 -71.79483, -71.83487, 42.27633, 42.26009
## 6 -70.89092, -71.04518, 42.81440, 42.76530
Looks good! The final step is to convert this into a sf object using the linestrings as its geometry
lines<- st_sf(origin = locs$Hospital,
target=locs$Target,
distance = locs$Distance,
geometry = st_sfc(locs$ls, crs = 4326),
dim = 'XY',
row.names = NULL)
And we now have lines ready to plot with their distance, origin, and destination recorded as variables.
In R, it’s very easy to download files straight from the web and load them into the console. I’ll be using county-level MA data, so I download and unzip the geometry using the following code.
counties_path <- "./src/data/ma_counties"
if(!dir.exists(counties_path)){dir.create(counties_path)}
download.file(url = 'http://download.massgis.digital.mass.gov/shapefiles/state/counties.zip',
destfile = paste0(counties_path,'/ma_counties.zip'))
unzip(zipfile = paste0(counties_path,'/ma_counties.zip'),
exdir = './src/data/ma_counties')
Now to read it in. After, lets do a quick plot of all 3 files to double check that they’re all in the same CRS and that there were no outrageous errors.
ma<-st_read(paste0(counties_path,'/COUNTIESSURVEY_POLYM.shp'))
## Reading layer `COUNTIESSURVEY_POLYM' from data source `C:\Users\mpohlman\Documents\rgistutorial\src\data\ma_counties\COUNTIESSURVEY_POLYM.shp' using driver `ESRI Shapefile'
## Simple feature collection with 14 features and 13 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 33863.73 ymin: 777606.4 xmax: 330837 ymax: 959743
## epsg (SRID): NA
## proj4string: +proj=lcc +lat_1=41.71666666666667 +lat_2=42.68333333333333 +lat_0=41 +lon_0=-71.5 +x_0=200000 +y_0=750000 +datum=NAD83 +units=m +no_defs
ma <- st_transform(ma, crs = st_crs(lines))
hosp_geo <-st_as_sf(hospitals, coords = c('lng','lat'),
crs = st_crs(lines))
plot(ma$geometry)
plot(lines$geometry, add = T)
plot(hosp_geo$geometry, add =T)
Now lets get started with leaflet. General leaflet syntax follows the traditional tidyverse piping %>% operations on a map object. The simplest of these need to be initialized with a lat/lng/zoom, and a choice of basemap via the addProviderTiles() command. (Or just addTiles for the default OSM map). One of the super cool things about leaflet is that there are a ton of basemaps you can easily import to your map. Here’s a nice tool to view all of the different basemaps in leaflet. The code on the website is for Javascript, but you can easily pass them into the addProviderTiles() operator:
map <- leaflet() %>%
setView(lng = -71.5820372, lat = 42.1770196, zoom = 8) %>%
addProviderTiles(providers$OpenTopoMap)
map
Now lets add our county level shape file to a very clean background using addPolygons (hint: most things in leaflet begin with ‘add’)
map <- leaflet() %>%
setView(lng = -71.5820372, lat = 42.1770196, zoom = 8) %>%
addProviderTiles(providers$Esri.WorldGrayCanvas) %>%
addPolygons(data = ma)
map