+ - 0:00:00
Notes for current slide
Notes for next slide

Relational data


Session 8

1 / 16

Relational data

It's rare to find all the data you need for an analysis in a single table.


Typically, you'll have to link two (or more) tables together by matching on common "key" variable(s).

We use joins in SQL or R, VLOOKUPs in Excel

2 / 16

Relational data

Here, we'll focus on left (outer) joins.


The syntax is similar for other types of joins.

3 / 16

left_join()

Keep structure of table x

…and match to observations in table y

"key" variable (common to both tables)

x %>%
left_join(y, by = "id")

source: https://github.com/gadenbuie/tidyexplain

4 / 16

Relational Data

We're going to join two tables one with cases of tuberculosis by country, one with population by country.


From this new table we could derive a rate.

5 / 16

Please Import

tb_cases.csv

tb_pop.csv

and

tb_new_table.csv

6 / 16

left_join

Keep the original structure of the tb_cases data frame

…then match to rows in tb_pop

based on "country" value

tb_cases %>%
left_join(tb_pop, by = "country")
7 / 16

Duplicates!

For every country value in tb_cases, there are 4 in tb_pop

## # A tibble: 64 × 5
## country year.x cases year.y pop
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 1999 19987071
## 2 Afghanistan 1999 745 2000 20595360
## 3 Afghanistan 1999 745 2001 21347782
## 4 Afghanistan 1999 745 2002 22202806
## 5 Brazil 1999 37737 1999 172006362
## 6 Brazil 1999 37737 2000 174504898
## 7 Brazil 1999 37737 2001 176968205
## 8 Brazil 1999 37737 2002 179393768
## 9 China 1999 212258 1999 1272915272
## 10 China 1999 212258 2000 1280428583
## # … with 54 more rows
8 / 16

Join on multiple rows

match on two (or more) variables

tb_cases %>%
left_join(tb_pop, by = c("country", "year"))
9 / 16

Join on multiple rows

match on two (or more) variables

tb_cases %>%
left_join(tb_pop, by = c("country", "year"))



You may recall from an earlier slide

You can create a vector with function: c() for concatenate/combine

c(100, 80, 200)
c("beds", "staff", "patients")
9 / 16

Default combine

Dplyr join verbs find the same named columns and list in the output what it has joined on ## Joining, by = c("country", "year")

This can be copied from the console

tb_cases %>%
left_join(tb_pop)
## # A tibble: 16 × 4
## country year cases pop
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Denmark 1999 170 5319410
## 5 Afghanistan 2000 2666 20595360
## 6 Brazil 2000 80488 174504898
## 7 China 2000 213766 1280428583
## 8 Denmark 2000 171 5338283
## 9 Afghanistan 2001 4639 21347782
## 10 Brazil 2001 37491 176968205
## 11 China 2001 212766 1287890449
## 12 Denmark 2001 124 5354684
## 13 Afghanistan 2002 6509 22202806
## 14 Brazil 2002 40723 179393768
## 15 China 2002 194972 1295322020
## 16 Denmark 2002 135 5368994
10 / 16

Joining with different names

If two tables have different names for same variable:

tb_cases %>%
left_join(tb_new_table, by = c("country" = "Place",
"year" = "Yr"))

tb_new_table is the same table as tb_cases but with a column Joined.

11 / 16

Other joins

Like SQL, dplyr has left_join, inner_join and full_join.

Joins that are slightly different to SQL are:

12 / 16

Other joins

Like SQL, dplyr has left_join, inner_join and full_join.

Joins that are slightly different to SQL are:

Semi-join

All rows from x where there are matching values in y, keeping just columns from x.

12 / 16

Use Case

Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test.

Join to the tb_new_table and bring back only those records where the column FirstLetter is 'A' but nothing from tb_new_table.

tb_cases %>%
semi_join(tb_new_table %>%
filter(FirstLetter == "A"), by = c("country" = "Place",
"year" = "Yr"))
## # A tibble: 4 × 3
## country year cases
## <chr> <dbl> <dbl>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Afghanistan 2001 4639
## 4 Afghanistan 2002 6509
13 / 16

Anti-join

All rows from x where there are not matching values in y, keeping just columns from x.

14 / 16

Use Case

In text mining, exclude 'stop words' (but, and, or) which are kept in a separate table for reference.

Join to the tb_new_table and exclude only those records where the column FirstLetter is 'A' .

tb_cases %>%
anti_join(tb_new_table %>%
filter(FirstLetter == "A"), by = c("country" = "Place",
"year" = "Yr"))
## # A tibble: 12 × 3
## country year cases
## <chr> <dbl> <dbl>
## 1 Brazil 1999 37737
## 2 China 1999 212258
## 3 Denmark 1999 170
## 4 Brazil 2000 80488
## 5 China 2000 213766
## 6 Denmark 2000 171
## 7 Brazil 2001 37491
## 8 China 2001 212766
## 9 Denmark 2001 124
## 10 Brazil 2002 40723
## 11 China 2002 194972
## 12 Denmark 2002 135
15 / 16

This work is licensed as


Creative Commons
Attribution
ShareAlike 4.0
International
To view a copy of this license, visit
https://creativecommons.org/licenses/by/4.0/

16 / 16

Relational data

It's rare to find all the data you need for an analysis in a single table.


Typically, you'll have to link two (or more) tables together by matching on common "key" variable(s).

We use joins in SQL or R, VLOOKUPs in Excel

2 / 16
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow