Relational dataSession 8  
1 / 16

Relational data

It's rare to find all the data you need for an analysis in a single table.

Typically, you'll have to link two (or more) tables together by matching on common "key" variable(s).

We use joins in SQL or R, VLOOKUPs in Excel

2 / 16

Relational data

Here, we'll focus on left (outer) joins.

The syntax is similar for other types of joins.

3 / 16

left_join()

Keep structure of table x

…and match to observations in table y

"key" variable (common to both tables)

x %>% 
  left_join(y, by = "id")

source: https://github.com/gadenbuie/tidyexplain

4 / 16

Relational Data

We're going to join two tables one with cases of tuberculosis by country, one with population by country.

From this new table we could derive a rate.

5 / 16

Please Import

tb_cases.csv

tb_pop.csv

and

tb_new_table.csv

6 / 16

left_join

Keep the original structure of the tb_cases data frame

…then match to rows in tb_pop

based on "country" value

tb_cases %>% 
  left_join(tb_pop, by = "country")

7 / 16

Duplicates!

For every country value in tb_cases, there are 4 in tb_pop

## # A tibble: 64 × 5
##    country     year.x  cases year.y        pop
##    <chr>        <dbl>  <dbl>  <dbl>      <dbl>
##  1 Afghanistan   1999    745   1999   19987071
##  2 Afghanistan   1999    745   2000   20595360
##  3 Afghanistan   1999    745   2001   21347782
##  4 Afghanistan   1999    745   2002   22202806
##  5 Brazil        1999  37737   1999  172006362
##  6 Brazil        1999  37737   2000  174504898
##  7 Brazil        1999  37737   2001  176968205
##  8 Brazil        1999  37737   2002  179393768
##  9 China         1999 212258   1999 1272915272
## 10 China         1999 212258   2000 1280428583
## # … with 54 more rows

8 / 16

Join on multiple rows

match on two (or more) variables

tb_cases %>% 
  left_join(tb_pop, by = c("country", "year"))

9 / 16

Join on multiple rows

match on two (or more) variables

tb_cases %>% 
  left_join(tb_pop, by = c("country", "year"))

You may recall from an earlier slide

You can create a vector with function: c() for concatenate/combine

c(100, 80, 200)
c("beds", "staff", "patients")

9 / 16

Default combine

Dplyr join verbs find the same named columns and list in the output what it has joined on ## Joining, by = c("country", "year")

This can be copied from the console

tb_cases %>% 
  left_join(tb_pop)

## # A tibble: 16 × 4
##    country      year  cases        pop
##    <chr>       <dbl>  <dbl>      <dbl>
##  1 Afghanistan  1999    745   19987071
##  2 Brazil       1999  37737  172006362
##  3 China        1999 212258 1272915272
##  4 Denmark      1999    170    5319410
##  5 Afghanistan  2000   2666   20595360
##  6 Brazil       2000  80488  174504898
##  7 China        2000 213766 1280428583
##  8 Denmark      2000    171    5338283
##  9 Afghanistan  2001   4639   21347782
## 10 Brazil       2001  37491  176968205
## 11 China        2001 212766 1287890449
## 12 Denmark      2001    124    5354684
## 13 Afghanistan  2002   6509   22202806
## 14 Brazil       2002  40723  179393768
## 15 China        2002 194972 1295322020
## 16 Denmark      2002    135    5368994

10 / 16

Joining with different names

If two tables have different names for same variable:

tb_cases %>% 
  left_join(tb_new_table, by = c("country" = "Place",
                              "year" = "Yr"))

tb_new_table is the same table as tb_cases but with a column Joined.

11 / 16

Other joins

Like SQL, dplyr has left_join, inner_join and full_join.

Joins that are slightly different to SQL are:

12 / 16

Other joins

Like SQL, dplyr has left_join, inner_join and full_join.

Joins that are slightly different to SQL are:

Semi-join

All rows from x where there are matching values in y, keeping just columns from x.

12 / 16

Use Case

Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test.

Join to the tb_new_table and bring back only those records where the column FirstLetter is 'A' but nothing from tb_new_table.

tb_cases %>% 
  semi_join(tb_new_table %>% 
              filter(FirstLetter == "A"), by = c("country" = "Place",
                              "year" = "Yr"))

## # A tibble: 4 × 3
##   country      year cases
##   <chr>       <dbl> <dbl>
## 1 Afghanistan  1999   745
## 2 Afghanistan  2000  2666
## 3 Afghanistan  2001  4639
## 4 Afghanistan  2002  6509

13 / 16

Anti-join

All rows from x where there are not matching values in y, keeping just columns from x.

14 / 16

Use Case

In text mining, exclude 'stop words' (but, and, or) which are kept in a separate table for reference.

Join to the tb_new_table and exclude only those records where the column FirstLetter is 'A' .

tb_cases %>% 
  anti_join(tb_new_table %>% 
              filter(FirstLetter == "A"), by = c("country" = "Place",
                              "year" = "Yr"))

## # A tibble: 12 × 3
##    country  year  cases
##    <chr>   <dbl>  <dbl>
##  1 Brazil   1999  37737
##  2 China    1999 212258
##  3 Denmark  1999    170
##  4 Brazil   2000  80488
##  5 China    2000 213766
##  6 Denmark  2000    171
##  7 Brazil   2001  37491
##  8 China    2001 212766
##  9 Denmark  2001    124
## 10 Brazil   2002  40723
## 11 China    2002 194972
## 12 Denmark  2002    135

15 / 16

This work is licensed as

Creative Commons
Attribution
ShareAlike 4.0
International
To view a copy of this license, visit
https://creativecommons.org/licenses/by/4.0/

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Relational data

Session 8

Relational data

Relational data

left_join()

Relational Data

Please Import

left_join

Duplicates!

Join on multiple rows

Join on multiple rows

You may recall from an earlier slide

Default combine

Joining with different names

Other joins

Other joins

Semi-join

Use Case

Anti-join

Use Case

This work is licensed as

Relational data

Help