Scraping disaggregated election data from multiple websites using R and rvest

A walkthrough using the 2018 Russian presidential election

Precinct-level electoral data can be very valuable to social and data scientists; in my own work, this kind of data is essential for performing election-forensic analysis of the results in order to detect possible electoral manipulation. Unfortunately, governments in both democracies and non-democracies rarely consolidate precinct-level data in one downloadable file; more often, these results are scattered across the websites of disparate election commissions or other government bodies. For example, the Russian Central Election Commission no longer directly reports precinct-level data, reporting results only at the district level as shown below.

The page notes there are 6 precincts here, but reports only aggregatedata.

Below, you’ll find a template for scraping such data using the rvest package for navigating HTML in R, along with the stringi package for transliterating non-Latin characters into Latin (in this case, Cyrillic).

This script scrapes the precinct-level election results, region by region, using rvest to navigate to the appropriate webpages in nested for-loops. It automatically navigates Russia’s Central Election Commission website and the websites of the regional election commissions (as of summer 2019) to extract precinct-level election results for each electoral district in every region of Russia, and appends these lists into one long table of precinct-level results. The result is a .csv file of electoral data suitable for conducting election-forensic analysis.

Of course, there are lots of other cases where you might want to scrape disaggregated data! The example given below could be easily modified to extract different data from the CEC / regional commission websites, for researchers interested in Russian electoral data. More broadly, it can also be used as a template for scraping other kinds of routine data from multiple sources. In either case, to tweak the template to capture the data you want, it’s necessary to change the css selectors that are used to tell rvest how to navigate the HTML, and which elements of the webpage to grab. This can be done–if not exactly easily, then at least feasibly–using the Chrome plug-in SelectorGadget to provide a set of CSS selectors that will identify the object you want. Fair warning: for complicated websites, it can take some trial and error to find the right selector. It can also be helpful to use Chrome’s developer tools to understand the structure of the page you are trying to navigate in R.

Finally, this code is suited to cases where the sites being scraped largely follow the same format. In this case, the regional election commission websites are all generally modelled after that of the Central Election Commission; consequently, it’s relatively easy to routinize the data scraping process. For situations where websites differ considerably–say, a system like the US where this is no central election authority–this template will be less useful.

With all that said, on to the code!

Walkthrough

The first code chunk loads the necessary libraries, sets up a container for the tables, and loads an html session. The session loads the top-level page for the 2018 presidential election in Russia. This top-level page contains links to each region-level page in a drop-down menu. As a first step, we will need to collect each of the links for the region pages contained in the drop-down menu, so that we can navigate to those URLs.

Throughout this example, I have commented out the for-loops, so that the code runs only once in RMarkdown. To run the full code, simply remove the `#’ comments before and after the loops.

library(rvest)
library(stringi)
library(tidyr)
library(data.table)
library(dplyr)
library(plyr)

rm(list=ls())

#Create a container for the full, precinct-level results

table.list.big <- list() 

#Begin the HTML session at the top-level page

outer.session <- html_session("http://www.vybory.izbirkom.ru/region/izbirkom?action=show&global=1&vrn=100100084849062&region=0&prver=0&pronetvd=null",
                              encoding="windows-1251")   #This is the Cyrillic encoding used by this particular webpage

#Use read_html() %>% html_nodes() to save the items in the drop-down menu
#"select option" is a CSS selector that tells rvest to grab all the options in the drop-down menu

nodes <- outer.session %>% read_html(encoding="windows-1251") %>% html_nodes("select option")  

#Create a dataframe that contains the text (i.e. the region's name) and the URL from the drop-down menu nodes

nodes_regions_df <- data_frame(
  link=xml_attr(nodes, "value"),
  name=xml_text(nodes)
)
nodes_regions_df <- nodes_regions_df[-1,] #Remove first, empty row

The next segment begins the for loop that moves through the URLs for the pages containing region-level results, using the data frame we just created. Each region-level page contains another drop-down menu, which leads to the results for each electoral district in the region. As above, we want to create a data frame that contains these links, so that they can be navigated to via another for loop.

Since the CEC no longer collates precinct-level election results, we need to navigate to the regional election commission for each region in order to collect those results. The code segment below gets us there in the following sequence:

  1. Go the region-level results on the CEC page
  2. Go to the first district-level results page for that region (still on the CEC website)
  3. From there, jump to the district-level results page on the regional election commission website
  4. Navigate to the top-level results page of the regional election commission website
  5. Navigate to the full results page for the region
#Create a counter, m, for the regions

m <- 1  

#for(m in 1:nrow(nodes_regions_df)){
  
  #Navigate to link m in the region-level dataframe and save the district-level URLs in a new dataframe (nodes_district_df)
  
  s <- html_session(url=as.character(nodes_regions_df[m,1]),
                    encoding="windows-1251")                    #Selects the region
  nodes_district <- s %>% read_html(encoding="windows-1251") %>% html_nodes("select option") 
  nodes_district_df <- data_frame(
    link=xml_attr(nodes_district, "value"),
    name=xml_text(nodes_district)
  )
  nodes_district_df <- nodes_district_df[-1,]

  #Create a container for district results
  
  table.list.district <- list()
 
  #Jump to the first district URL under the region
  
    s <- jump_to(s, as.character(nodes_district_df[1,1])) 
    
    #On each district page, the sixth link jumps to the Regional Election Commission page
    
    s <- follow_link(s, 6) 
      
    #Grab the link for the main page of the regional election commission
    #Do this by counting tables from the bottom of the page--the correct link is in the table that is 3rd from bottom
    #Select all links in that table
    
    
    region.main.link <- s %>% read_html(encoding="windows-1251") %>% 
      html_nodes("table:nth-last_child(3)") %>% html_nodes("a") %>% html_attr("href") 
    
    #Jump to the first link in that table; First link leads to the region main page
    
    s <- jump_to(s, url = as.character(region.main.link[1])) 
    
    #The HTML session is now the top-level page on the regional election commission page
    #Read the page and collect URLs as a table
    
    page.table <- s %>% read_html(encoding="windows-1251") %>% 
      html_nodes("table:nth-last_child(3)") %>% html_nodes("a") %>% html_attr("href")
    
    #The final URL on each page is the landing page we want for the next for loop: "Svodnaya tablitsa of election results"
    #Save that URL as region.link
    
    region.link <- tail(page.table, n = 1)  #The last link on the page leads to the landing page we want
      
    #Navigate the session to that URL  
      
    s <- jump_to(s, as.character(region.link))  #The session lands on target page for the next loop

As you go through your code with rvest, it can be helpful to check where you are in R visually using a web browser. In this case, we can check the current session URL by calling the variable s, copying the URL, and opening it in a browser.

s
## <session> http://www.adygei.vybory.izbirkom.ru/region/region/adygei?action=show&root=1000001&tvd=100100084849067&vrn=100100084849062&region=1&global=true&sub_region=1&prver=0&pronetvd=null&vibid=100100084849067&type=227
##   Status: 200
##   Type:   text/html;charset=Windows-1251
##   Size:   35222

Here, the current session when m = 1 is set at what I call the landing page for the region; the page that we can use to cycle through all of the districts in the region in order to scrape precinct-level data. It looks like this:

The links circled in red lead to the district pages where precinct-level data is presented; these are the ones we are after in the inner for-loop.

The next code segment sets up the elements necessary for the inner-most for loop; the loop that will cycle through the electoral districts in region ‘m’, using the landing page, and scrape the precinct-level results. It sets up a container for the precinct-level results, and saves the region name, transliterates it into Latin characters from Cyrillic, and finds the number of districts that the for loop will need to cycle through.

Next is the inner-most for loop, which gathers the precinct data. Prior to starting this loop, s is an HTML session of a page that shows district-level results, which links to more detailed precinct-level data for each of this districts. The loop navigates through i of `n.districts’ districts and scrapes the precinct-level data therein.

##Begin inner for-loop
    i <- 1 #The district counter

   # for(i in 1:n.districts){
      
      #Create a variable for the i-th link in the table
      
      css.location <- paste("div td:nth-child(", i, ") a", sep="")
      
      #Follow the link to the district page, where precinct (UIK) level results are presented
      #Save and transliterate the district name
      
      page <- s %>% follow_link(css = css.location) %>% read_html(encoding="windows-1251") 
      district.name <- page  %>% html_node("table:nth-child(5)") %>%
        html_text()
      district.name <- substr(district.name, start=38, stop=(nchar(as.character(district.name)))-4)
      district.name <- stri_trans_general(district.name, 'latin')
      
      #Create a table to contain the precinct-level results, and scrape the table text
     
      
      table.long <-   page %>%
        html_nodes(" div b , div table") %>% html_text()
      
      #Convert to numeric makes non-numeral heading (i.e. extra text) rows become NA
      
      table.long <- as.numeric(table.long)  
      
      #Remove those NA rows, giving us the numerical results table
      table.long <- na.omit(table.long)    

      #Currently the table has been scraped is a vector; we now need to parse it back into rows and columns
      #The vector has been read in from left to right and top to bottom; that is, row-by-row
      #For this election, each page has 20 rows corresponding to variables like turnout and a candidate's vote-share
      #Divide the vector by 20 to get the number of columns in the table (for now, columns=precincts)
      #NOTE: The value for the number of rows will vary from election to election!
      
      #Create a container with 20 rows and the appropriate number of columns (number of precincts) to hold the data
      table.temp <- matrix(NA, nrow = 20, ncol = (length(table.long) / 20)) #Remove first entry (headings)
      
      #Fill in the container with data from the vector of results using a for loop
      #The loop below says that for each row in the container, fill the cells with as many entries from the vector as there are columns on the webpage. Then update the column/precinct counter by the number and move to the next row. 
      
      r <- 1  #Column/precinct counter
      q <- 1  #Row/variable counter (i.e. variables on the webpage (vote total, etc))
      for(q in 1:20){
        table.temp[q,] <- table.long[r:(r+(ncol(table.temp)-1))]
        r <- r + (ncol(table.temp))
      }
      
      #Transpose the table so that it is in the correct format: columns are variables and rows are precincts
      
      t.table.temp <- data.table(t(table.temp))
      
      #Add the saved id variables
      
      table.ids <- matrix(NA, nrow=nrow(t.table.temp), ncol=2)
      table.ids[,1] <- district.name
      table.ids[,2] <- region.name
      
      t.table.temp <- cbind(t.table.temp, table.ids)
      
      #Insert the resulting table into the list of tables with precinct data for each district in this region
      
      table.list[[i]] <- t.table.temp
 #   }  #The close bracket for the inner for-loop
    
 #Convert the list of district tables into one single table for the region, name the precincts, and save with `m` as a counter for the regions
    region.table <- rbindlist(table.list)
    precinct.names <- matrix(
      paste("UIK", seq(1:nrow(region.table)), sep="-"), nrow=nrow(region.table), ncol=1)
   
    region.table <- cbind(precinct.names, region.table)
  
  # Uncomment below to write csv to preferred destination  
  # write.csv(region.table, file=paste("C:/filepath", m, ".csv", sep="_"))
    
   #Pause for 3 seconds before continuing the loop, to avoid over-taxing the server or making the authorities mad
    Sys.sleep(3)  #Pause the loop for 3 seconds

The last code segment is thankfully short and simple. What remains is to save region.table into the large table that will contain results for each precinct in the country, and to close the outer for-loop.

 #Save region.table for region `m` as entry `m` in the large table
  
  table.list.big[[m]] <- region.table

 #Notify user that the region is complete

  print(paste("Region", region.name, "complete.", sep=" "))
## [1] "Region Respublika Adygeâ (Adygeâ) complete."
# } #The final brace, closes the whole function

#Save the resulting large table
#Note that each user will have to rename the variables using their preferred nomenclature

#  Uncomment code below to write csv to preferred destination
  
#   write.csv(table.list.big, "C:/filepath.csv")

Final notes

That’s it!

Once you un-comment the for-loop commands, this code will automatically work its way through the region pages and scrape the election data. In some cases, where region pages are slightly different, it will throw out errors. These errors are relatively simple to address individually by changing the CSS selectors, and so are not covered in this walkthrough.

If you are interested in the 2016 or 2018 Russian election data–and especially if this code no longer works due to changes in the CEC / regional commission websites, please contact me! I’ll be happy to provide the data scraped by this code.