Data visualization in R: line charts

My previous post was about how we can download data from Statistic Denmark. Here, we will use these data for learning to plot line charts in R using ggplot2. In the way, we will see:

  • How to download data from Statistic Denmark.
  • How to clean a dataframe and manipulate it to calculate the values we would like to plot.
  • Generate line charts with ggplot2.
  • Make multi-panel plots.
  • Tune the look of a plot for producing beautiful charts.

We will use the following libraries:

library(danstat)
library(tidyverse)
library(furrr)
library(forcats)
library(dint)

The first think I do when I produce a figure is to think about what I would like to tell with it. In this case I would like to see the population change (in percentage) in each Danish municipality from 2008 to 2020, both by the total population and by separating the data in persons of Danish origin and foreigners (i.e immigrants and their descendants).

Load data

We need the Population at the first day of the quarter by region, sex, age (5 years age groups), ancestry and country of origin - table FOLK1C.

id_table <- "FOLK1C"
var_pop <- get_table_metadata(table_id = id_table, variables_only = TRUE)
  
# loop by quarter for getting the data 
  steps <- function(quarter){
      var_values <- list(id_region, id_ancestry, quarter)
      var_input <- purrr::map2(.x = var_codes, .y = var_values, .f = ~list(code = .x, values = .y))
      get_data(id_table, variables = var_input)
  }

# Codes for var_input
  var_codes <- c("OMRÅDE", "HERKOMST", "Tid")
  
  # Values for var_input
  ## Region: Denmark
  id_region <- as.numeric(var_pop$values[[1]]$id)
  id_region <- id_region[id_region > 100]
  ## Ancestry
  id_ancestry <- NA
  ## Quarters
  id_quarter <- var_pop$values[[6]]$id   # Select all quarters
  # Parallel process with {future}
  plan(multiprocess)  
  pop_LAU <- id_quarter %>% future_map(steps)
  pop_LAU <- bind_rows(pop_LAU)
  plan("default")
  
  pop_LAU
## # A tibble: 20,592 x 4
##    OMRÅDE        HERKOMST                 TID    INDHOLD
##    <chr>         <chr>                    <chr>    <dbl>
##  1 Copenhagen    Total                    2008Q1  509861
##  2 Copenhagen    Persons of Danish origin 2008Q1  405954
##  3 Copenhagen    Immigrants               2008Q1   77114
##  4 Copenhagen    Descendant               2008Q1   26793
##  5 Frederiksberg Total                    2008Q1   93444
##  6 Frederiksberg Persons of Danish origin 2008Q1   79257
##  7 Frederiksberg Immigrants               2008Q1   11524
##  8 Frederiksberg Descendant               2008Q1    2663
##  9 Dragør        Total                    2008Q1   13261
## 10 Dragør        Persons of Danish origin 2008Q1   12432
## # ... with 20,582 more rows

Clean and prepare the data

As we can see, the column names are in Danish. Therefore, we would need to rename them (i.e. OMRÅDE = LAU_NAME, HERKOMST = ancestry, TID = date, and INDHOLD = pop). We can use the function rename (from dplyr). For this, we put first the new name of the column (e.g. LAU_NAME) and then we link it with the name of the old column (i.e. OMRÅDE). See the first part of the following code. Then, we would need to convert TID from a character format to dates. As the data represent the Population at the first day of the quarter, our objective is therefore to transform the value (e.g. “2008Q1”) to the first day of the quarter (e.g. “2008-01-01”). We can do that with the dint package. First, we need to convert the value to a year-quarter date format (e.g. “2008-Q1”) using the function as_date_yq, which take a value (i.e. 20081) representing the year (2008) and the quarter (i.e. 1, 2, 3, 4 for the first, second, third, or four quarter, respectively). This is why I remove the Q from the value using gsub. Once we have the year-quarter we past it to first_of_quarter for calculating the first day of a quarter. Finally, I have change a level of ancestry (i.e. Persons of Danish origin -> Danish) and transform the dataframe to its wide format putting each value in a column (i.e. Total, Danish, Immigrant, and Descendant).

pop_LAU <- pop_LAU %>% 
  # translate columns names into English
  rename(LAU_NAME = OMRÅDE,
         ancestry = HERKOMST,
         date = TID, 
         pop = INDHOLD) %>% 
  # Transform date
  mutate(date = gsub("Q", "", date),
         date = as_date_yq(as.integer(date)),
         date = first_of_quarter(date)) %>% 
  # Change "Persons of Danish origin"
  mutate(ancestry = ifelse(ancestry == "Persons of Danish origin",
                           "Danish",
                           ancestry)) %>% 
  # Pivot data to wide format
  pivot_wider(c(LAU_NAME, date), names_from = ancestry, values_from = pop)

pop_LAU
## # A tibble: 5,148 x 6
##    LAU_NAME      date        Total Danish Immigrants Descendant
##    <chr>         <date>      <dbl>  <dbl>      <dbl>      <dbl>
##  1 Copenhagen    2008-01-01 509861 405954      77114      26793
##  2 Frederiksberg 2008-01-01  93444  79257      11524       2663
##  3 Dragør        2008-01-01  13261  12432        733         96
##  4 Tårnby        2008-01-01  40016  36267       2889        860
##  5 Albertslund   2008-01-01  27602  20718       4408       2476
##  6 Ballerup      2008-01-01  47116  41708       3746       1662
##  7 Brøndby       2008-01-01  33831  25743       5146       2942
##  8 Gentofte      2008-01-01  68913  61568       6451        894
##  9 Gladsaxe      2008-01-01  62562  54605       5848       2109
## 10 Glostrup      2008-01-01  20673  18327       1640        706
## # ... with 5,138 more rows

Then, we would need to divide the population in two categories, Danes and foreigners (the sum of immigrants and their descendants). The function mutate is used to generate this new variable, and then I use select to get only the columns I will use for the figure.

pop_LAU <- pop_LAU %>% 
  mutate(Foreign = Immigrants + Descendant) %>% 
  select(-Immigrants, -Descendant) 

pop_LAU
## # A tibble: 5,148 x 5
##    LAU_NAME      date        Total Danish Foreign
##    <chr>         <date>      <dbl>  <dbl>   <dbl>
##  1 Copenhagen    2008-01-01 509861 405954  103907
##  2 Frederiksberg 2008-01-01  93444  79257   14187
##  3 Dragør        2008-01-01  13261  12432     829
##  4 Tårnby        2008-01-01  40016  36267    3749
##  5 Albertslund   2008-01-01  27602  20718    6884
##  6 Ballerup      2008-01-01  47116  41708    5408
##  7 Brøndby       2008-01-01  33831  25743    8088
##  8 Gentofte      2008-01-01  68913  61568    7345
##  9 Gladsaxe      2008-01-01  62562  54605    7957
## 10 Glostrup      2008-01-01  20673  18327    2346
## # ... with 5,138 more rows

The next step is to calculate, for each municipality, the percentage of change of the total, Danish, and foreign population using 2008 as baseline. First, I transform the data to its long format (pivot_longer) for having one row for each measurement. Then, I calculate the percentage for each group of people (i.e ancestry = Danish, Foreign, Total) and each municipality (i.e. LAU_NAME). I use mutate for doing that but the data must be grouped (group_by) to be sure the formula is applied separately in each group. Once we have finished with the calculations, it is recommended to ungroup the data for not getting unexpected behaviours in future calculations with the variables. The final step of the chunk is to customize the labels of the variable we are going to use in the legend of the figure (i.e. ancestry).

pop_LAU <- pop_LAU %>% 
  # Pivot to long format 
  pivot_longer(-c(LAU_NAME, date), names_to = "ancestry", values_to = "pop") %>% 
  # Percentage of change (baseline 2008-Q1)
  group_by(LAU_NAME, ancestry) %>% 
  arrange(LAU_NAME, date) %>% 
  mutate(pop_pct_2008 = (pop/first(pop) - 1) * 100) %>% 
  ungroup() %>% 
  # Sort ancestry 
  mutate(ancestry = factor(ancestry, 
                           levels = c("Danish",
                                      "Foreign",
                                      "Total"),
                           labels = c("Danish",
                                      "Foreign (Immigrants + Descendant)",
                                      "Total"))) 

pop_LAU
## # A tibble: 15,444 x 5
##    LAU_NAME date       ancestry                            pop pop_pct_2008
##    <chr>    <date>     <fct>                             <dbl>        <dbl>
##  1 Aabenraa 2008-01-01 Total                             60189       0     
##  2 Aabenraa 2008-01-01 Danish                            54165       0     
##  3 Aabenraa 2008-01-01 Foreign (Immigrants + Descendant)  6024       0     
##  4 Aabenraa 2008-04-01 Total                             60283       0.156 
##  5 Aabenraa 2008-04-01 Danish                            54202       0.0683
##  6 Aabenraa 2008-04-01 Foreign (Immigrants + Descendant)  6081       0.946 
##  7 Aabenraa 2008-07-01 Total                             60427       0.395 
##  8 Aabenraa 2008-07-01 Danish                            54318       0.282 
##  9 Aabenraa 2008-07-01 Foreign (Immigrants + Descendant)  6109       1.41  
## 10 Aabenraa 2008-10-01 Total                             60368       0.297 
## # ... with 15,434 more rows

Finally, since we are more interested in the process of plotting line chards in R and not the comparison between the 98 municipalities, I have simplified the dataset to the main four municipalities (i.e. Aarhus, Aalborg, Odense, and Copenhagen).

  big_cities <- c("Aarhus", "Aalborg", "Odense", "Copenhagen")
  pop_LAU <- pop_LAU %>% filter(LAU_NAME %in% big_cities)
  
  pop_LAU
## # A tibble: 624 x 5
##    LAU_NAME date       ancestry                             pop pop_pct_2008
##    <chr>    <date>     <fct>                              <dbl>        <dbl>
##  1 Aalborg  2008-01-01 Total                             195145       0     
##  2 Aalborg  2008-01-01 Danish                            181159       0     
##  3 Aalborg  2008-01-01 Foreign (Immigrants + Descendant)  13986       0     
##  4 Aalborg  2008-04-01 Total                             195145       0     
##  5 Aalborg  2008-04-01 Danish                            180993      -0.0916
##  6 Aalborg  2008-04-01 Foreign (Immigrants + Descendant)  14152       1.19  
##  7 Aalborg  2008-07-01 Total                             195048      -0.0497
##  8 Aalborg  2008-07-01 Danish                            180960      -0.110 
##  9 Aalborg  2008-07-01 Foreign (Immigrants + Descendant)  14088       0.729 
## 10 Aalborg  2008-10-01 Total                             196138       0.509 
## # ... with 614 more rows

Plot line charts

Now that we have the data, we can plot the line chart using the function geom_line from ggplot2 package. For plotting a graph with ggplot2 we first need to call a new ggplot object with the function ggplot, and then we add more information to this object step by step using the operator +. I see it as layers where I define the information I want to plot and its format, so when I add geom_line in the code I am telling R that the data should be connected with a line. If I were also interested, for example, in adding each data as a points in the graph, I would add another line with geom_point.

There are three ways to initialize a ggplot object with ggplot, I would recommend you to read the help to understand the process. Here, I am using the same data and the same set of aesthetics in all plots so I define then in this step. We need a dataframe (pop_LAU) and also to specify what data are going to be plotted in the axes (i.e. x = years and y = population change in percentage). Then, we also need to define the parameter (colour) into the aesthetic for plotting each group of data (i.e. ancestry = Danish, Foreign, and Total population) with different colours. Note that this variable is printed in the legend, with the same order and labels we defined before. Finally, we can plot the variation by municipality (LAU_NAME) using the function facet_wrap.

ggplot(data = pop_LAU,
       mapping = aes(x = date,
                     y = pop_pct_2008,
                     colour = ancestry)) +
  geom_line(size = 1) +
  facet_wrap(~LAU_NAME, ncol = 2)

We already have the plot; however, there are some aspects we can clearly improve on it. For example, the municipalities in the graph are sorted alphabetically but I think the figure could be more informative if we sort them base on the total population in 2020 (even more if we would like to plot the 98 municipalities in the same plot). A way to do that is to use the forcats:fct_reorder2 function before calculating the percentage of population change. It takes a factor variable (i.e. LAU_NAME) and arranges it base on the value of other variables (Total = total population and date). The argument last2 finds the last value of y (i.e. total population) when sorted by x (date).

pop_LAU <- pop_LAU %>% 
  # Pivot data to wide format
  pivot_wider(c(LAU_NAME, date), names_from = ancestry, values_from = pop) %>%
  # Sort LAUs by Total population in 2020-Q4
  mutate(LAU_NAME = factor(LAU_NAME), 
         LAU_NAME = fct_reorder2(.f = LAU_NAME,
                                 .x = date,
                                 .y = Total,
                                 .fun = last2)) %>% 
  # Long format
  pivot_longer(-c(LAU_NAME, date), names_to = "ancestry", values_to = "pop") %>% 
  # Percentage of change (baseline 2008-Q1)
  group_by(LAU_NAME, ancestry) %>% 
  arrange(LAU_NAME, date) %>% 
  mutate(pop_pct_2008 = (pop/first(pop) - 1) * 100) %>% 
  ungroup()

Therefore when we plot the graph we can see that Copenhagen appears first in it, then Aarhus, Aalborg, and Odense. In this sense, we can see the importance of each municipality (in terms of population) just only by looking for its position in the chard. Here, with only four municipalities, it may not be very useful but with all the 98 municipalities it can make a different in the interpretation (e.g. easily detect urban and rural areas).

p <- ggplot(data = pop_LAU,
       mapping = aes(x = date,
                     y = pop_pct_2008,
                     colour = ancestry)) +
  geom_line(size = 1) +
  facet_wrap(~LAU_NAME, ncol = 2)

p

There are still some things we can improve in the figure. First, the legend text is too large ant it seems more adequate to put the legend at the bottom. Then, we can change the text of the x axis to clearly annotate that it is in percentage. Regarding the horizontal axis, it is clear that it represents years so we may not need to name it. Furthermore, it can be useful to only plot the last to digits of the year. I have also change the colours of the lines for using a colour blind friendly palette. Finally, we can add titles and captions for explaining what is presented in the graph.

p1 <- p +
  # change default theme
  theme_bw() +
  # Put legend at the bottom of the plot
  theme(legend.position = "bottom") +
  # Format y axis
  scale_y_continuous(name = "[%]") +
  # Format x axis
  scale_x_date(name = "", date_breaks = "3 year", date_labels = "%y") +
  # Change line colours
  scale_colour_manual(name = "Ancestry",
                      values = c("#009E73", "#D55E00", "#000000")) +
  # add titles
  labs(title = "Population change by LAU and ancestry (2008 - 2020)",
       subtitle = "Municipalities sorted by total population in 2020",
       caption = "Data source: Statistics Denmark\nAuthors: J. Elio (@Elio_Javi), C. Keßler, H.S. Hansen. Aalborg University, Department of Planning") 

p1

Now we have a nice figure but we can tune it a bit more by defining the non-data components of the plot with theme. Here, I show you an example of a customize plot. However, we can make changes on almost everything and I would recommend you to play a bit with it to get a better idea of the different options we have and to find your personal preferences.

p1 +
  theme(plot.title = element_text(size = 14, colour = "darkblue", face = "bold"),
        plot.caption = element_text(size = 8, colour = "grey25"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 10),
        legend.title = element_text(size = 10, face = "bold"),
        legend.text = element_text(size = 10),
        plot.background = element_rect(fill = "linen", colour = NA),
        panel.background = element_rect(fill = "grey85", colour = NA),
        legend.background = element_rect(fill = "linen", color = NA),
        legend.key = element_rect(fill = "linen"),
        strip.text = element_text(size = 12, color = "white", face = "bold.italic"),
        strip.background = element_rect(color = "black", fill = "#FC4E07", linetype = "solid")
        )

Notes

I have created this post during my work as postdoctoral researcher at Aalborg University, in the project “Global flows of migrants and their impact on north European welfare states - FLOW”.

It is not endorsed by the university or the project, and it is not maintained. All the data I use here are public, and my only aim is that the post serves for learning R. For more information about migration and the project outcomes please visit the project’s website: https://www.flow.aau.dk.

Javier Elío
Javier Elío
Associate Professor

My research interests include environmental sciences and data analysis.

Related