class: title-slide, left, bottom # Relational data ---- ## **Session 8** ### ### .right-column[ ] --- # Relational data It's rare to find all the data you need for an analysis in a single table. </br> Typically, you'll have to link two (or more) tables together by matching on common "key" variable(s). .green[We use joins in SQL or R, VLOOKUPs in Excel] --- # Relational data Here, we'll focus on left (outer) joins. </br> The syntax is similar for other types of joins. --- # left_join() Keep structure of table x …and match to observations in table y "key" variable (common to both tables) ```r x %>% left_join(y, by = "id") ``` <img class="center" src="data:image/png;base64,#img/session08/left-join.gif" width="40%"/> source: https://github.com/gadenbuie/tidyexplain --- # Relational Data We're going to join two tables one with cases of tuberculosis by country, one with population by country. </br> From this new table we could derive a rate. --- class: center, middle # Please Import .blue[tb_cases.csv] .blue[tb_pop.csv] and .blue[tb_new_table.csv] --- # left_join Keep the original structure of the tb_cases data frame …then match to rows in tb_pop based on "country" value ```r tb_cases %>% left_join(tb_pop, by = "country") ``` --- # Duplicates! For every country value in tb_cases, there are 4 in tb_pop ``` ## # A tibble: 64 × 5 ## country year.x cases year.y pop ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 1999 19987071 ## 2 Afghanistan 1999 745 2000 20595360 ## 3 Afghanistan 1999 745 2001 21347782 ## 4 Afghanistan 1999 745 2002 22202806 ## 5 Brazil 1999 37737 1999 172006362 ## 6 Brazil 1999 37737 2000 174504898 ## 7 Brazil 1999 37737 2001 176968205 ## 8 Brazil 1999 37737 2002 179393768 ## 9 China 1999 212258 1999 1272915272 ## 10 China 1999 212258 2000 1280428583 ## # … with 54 more rows ``` --- # Join on multiple rows match on two (or more) variables ```r tb_cases %>% left_join(tb_pop, by = c("country", "year")) ``` -- </br> </br> ### You may recall from an earlier slide You can create a vector with function: c() for concatenate/combine ```r c(100, 80, 200) c("beds", "staff", "patients") ``` --- # Default combine Dplyr join verbs find the same named columns and list in the output what it has joined on ## Joining, by = c("country", "year") This can be copied from the console ```r tb_cases %>% left_join(tb_pop) ``` ``` ## # A tibble: 16 × 4 ## country year cases pop ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Brazil 1999 37737 172006362 ## 3 China 1999 212258 1272915272 ## 4 Denmark 1999 170 5319410 ## 5 Afghanistan 2000 2666 20595360 ## 6 Brazil 2000 80488 174504898 ## 7 China 2000 213766 1280428583 ## 8 Denmark 2000 171 5338283 ## 9 Afghanistan 2001 4639 21347782 ## 10 Brazil 2001 37491 176968205 ## 11 China 2001 212766 1287890449 ## 12 Denmark 2001 124 5354684 ## 13 Afghanistan 2002 6509 22202806 ## 14 Brazil 2002 40723 179393768 ## 15 China 2002 194972 1295322020 ## 16 Denmark 2002 135 5368994 ``` --- # Joining with different names If two tables have .blue[different] names for .blue[same] variable: ```r tb_cases %>% left_join(tb_new_table, by = c("country" = "Place", "year" = "Yr")) ``` tb_new_table is the same table as tb_cases but with a column Joined. --- # Other joins Like SQL, dplyr has left_join, inner_join and full_join. Joins that are slightly different to SQL are: -- ### [Semi-join](https://stackoverflow.com/questions/42249690/what-is-semi-join-in-database) All rows from x where there are matching values in y, keeping just columns from x. <img class="center" src="data:image/png;base64,#img/session08/semi-join.gif" width="40%"/> --- # Use Case Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test. Join to the tb_new_table and bring back only those records where the column FirstLetter is 'A' but nothing from tb_new_table. ```r tb_cases %>% semi_join(tb_new_table %>% filter(FirstLetter == "A"), by = c("country" = "Place", "year" = "Yr")) ``` ``` ## # A tibble: 4 × 3 ## country year cases ## <chr> <dbl> <dbl> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Afghanistan 2001 4639 ## 4 Afghanistan 2002 6509 ``` --- ### [Anti-join](https://stackoverflow.com/questions/28702960/find-complement-of-a-data-frame-anti-join) All rows from x where there are not matching values in y, keeping just columns from x. <img class="center" src="data:image/png;base64,#img/session08/anti-join.gif" width="40%"/> --- # Use Case In text mining, exclude 'stop words' (but, and, or) which are kept in a separate table for reference. Join to the tb_new_table and exclude only those records where the column FirstLetter is 'A' . ```r tb_cases %>% anti_join(tb_new_table %>% filter(FirstLetter == "A"), by = c("country" = "Place", "year" = "Yr")) ``` ``` ## # A tibble: 12 × 3 ## country year cases ## <chr> <dbl> <dbl> ## 1 Brazil 1999 37737 ## 2 China 1999 212258 ## 3 Denmark 1999 170 ## 4 Brazil 2000 80488 ## 5 China 2000 213766 ## 6 Denmark 2000 171 ## 7 Brazil 2001 37491 ## 8 China 2001 212766 ## 9 Denmark 2001 124 ## 10 Brazil 2002 40723 ## 11 China 2002 194972 ## 12 Denmark 2002 135 ``` --- #### This work is licensed as </br> Creative Commons </br> Attribution </br> ShareAlike 4.0 </br> International </br> To view a copy of this license, visit </br> https://creativecommons.org/licenses/by/4.0/