class: title-slide, left, bottom <img src="data:image/png;base64,#img/session06/dplyr_wrangling.PNG" width="40%"/> # Data wrangling with dplyr ---- ## **Session 6** ### ### .right-column[ ] .footnote[Artwork by @allison_horst] --- class: center, middle # Wrangling </br> Reshaping or transforming data </br> into a format which is easier to work with </br> (…for later visualisation, modelling, </br> or computing of statistics…) --- # A note on "tidy" data Tidyverse functions work best with tidy data: 1. Each variable forms a column. 2. Each observation forms a row. </br> (Broadly, this means long rather than wide tables) --- class: center, middle # The tool: dplyr package .blue[(dee-ply-r)] is a language for data manipulation </br> </br> Most wrangling puzzles can be solved with </br> knowledge of just 5 dplyr verbs .blue[(5 functions)]. </br> </br> These verbs will be the subject of this session. --- # Project 2: ## Exploring Mental Health (MH) Inpatient Capacity -- </br> The following is some analysis of Mental Health inpatient capacity in England. </br> As part of this, we will be looking at the changes in the number (and occupancy) of MH beds available. -- ### Background Maintaining clinical effectiveness and safety when a ward is fully occupied is a serious challenge for staff. </br> Inappropriate out of area placements have an added cost and also mean patients are separated from their social support networks. --- # The Data: </br> KH03 returns (bed numbers and occupancy) by organisation, published by NHS England. </br> Scraped from the NHSE statistics website: </br> https://www.england.nhs.uk/statistics/statistical-work-areas/bed-availability-and-occupancy/bed-data-overnight/ .blue[partially cleaned] --- # Start a new script File/New script or shortcut keys <kbd>Ctrl + Shift + N </kbd> .left-column[ Load the data beds_data.csv ] .right-column[ <img class="center" src="data:image/png;base64,#img/session06/new-script.PNG"/> ] --- # Less friendly csvs <img class="center" src="data:image/png;base64,#img/session06/beds-data-wizard.PNG"/> --- # Less friendly csvs <img class="center" src="data:image/png;base64,#img/session06/beds-data-wizard-2.png"/> --- # Less friendly csvs Note you will have to move the cursor to another area for the number to have an effect. <img class="center" src="data:image/png;base64,#img/session06/beds-data-wizard-skip.PNG"/> --- # Less friendly csvs <img class="center" src="data:image/png;base64,#img/session06/beds-data-wizard-date.png"/> --- # Less friendly csvs <img class="center" src="data:image/png;base64,#img/session06/beds-data-date-wizard.PNG"/> --- # Look at the data This is real data so there are real issues (which we'll work with) <table> <thead> <tr> <th style="text-align:left;"> date </th> <th style="text-align:left;"> org_code </th> <th style="text-align:left;"> org_name </th> <th style="text-align:right;"> beds_av </th> <th style="text-align:right;"> occ_av </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1A </td> <td style="text-align:left;"> Worcestershire Health And Care </td> <td style="text-align:right;"> 129 </td> <td style="text-align:right;"> 117 </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1C </td> <td style="text-align:left;"> Solent </td> <td style="text-align:right;"> 105 </td> <td style="text-align:right;"> 82 </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1E </td> <td style="text-align:left;"> Staffordshire And Stoke On Trent Partnership </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1F </td> <td style="text-align:left;"> Isle Of Wight </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 42 </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1H </td> <td style="text-align:left;"> Barts Health </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1J </td> <td style="text-align:left;"> Gloucestershire Care Services </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> --- # Dplyr 5 key verbs will help us gain a deeper understanding of our data sets. Note summarise() can also be spelt summarize() ```r dplyr::arrange() dplyr::filter() dplyr::mutate() dplyr::group_by() dplyr::summarise() ``` --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Starts with an...] </br> potato then </br> peel then </br> slice into medium sized pieces then </br> boil for 25 minutes then </br> mash --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Start with an R object] </br> **potato** then </br> peel then </br> slice into medium sized pieces then </br> boil for 25 minutes then </br> mash --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Start with an R object] </br> **potato** then </br> **peel()** then </br> slice into medium sized pieces then </br> boil for 25 minutes then </br> mash --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Start with an R object] </br> **potato** then </br> **peel()** then </br> **slice(size = "medium")** then </br> boil for 25 minutes then </br> mash --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Start with an R object] </br> **potato** then </br> **peel()** then </br> **slice(size = "medium")** then </br> **boil(time = 25)** then </br> .blue[output = mashed potato] --- # Building with steps .grey[These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato:] .blue[input object] </br> **potato** %>% </br> **peel()** %>% </br> **slice(size = "medium")** %>% </br> **boil(time = 25)** .blue[output = mashed potato] --- # Tidyverse syntax .blue[input object] data_frame .blue[then] </br>.green[do_this(]rules.green[)] .blue[then] </br>.green[do_this(]rules.green[)] .blue[output = new data frame] --- # Tidyverse syntax </br> data_frame .blue[%>%] </br>.green[do_this(]rules.green[)] .blue[%>%] </br>.green[do_this(]rules.green[)] --- # Tidyverse syntax </br> Combine simple pieces to solve complex puzzles data_frame .blue[%>%] </br>.green[do_this(]rules.green[)] .blue[%>%] </br>.green[do_this(]rules.green[)] --- class: inverse, middle, center #Using dplyr --- class: inverse, center, middle # Q1. Which organisation provided the highest number of Mental Health (MH) beds? --- .blue[# 1. arrange] Reorder rows based on select variable </br> <img class="center" src="data:image/png;base64,#img/session06/arrange-verb-code.PNG" width="40%"/> ```r beds_data %>% arrange(beds_av) ``` -- </br> Shortcut for %>% = <kbd>Ctrl + Shift + m</kbd> --- .blue[# 1. arrange] Reorder rows based on select variable... defaults to ascending order ```r beds_data %>% arrange(beds_av) ``` ``` ## # A tibble: 4,558 × 5 ## date org_code org_name beds_av occ_av ## <date> <chr> <chr> <dbl> <dbl> ## 1 2018-09-01 RAL Royal Free London 2 2 ## 2 2018-06-01 RAL Royal Free London 4 4 ## 3 2015-09-01 RAL Royal Free London 5 2 ## 4 2013-09-01 RAL Royal Free London 6 1 ## 5 2013-12-01 RAL Royal Free London 6 3 ## 6 2014-03-01 RAL Royal Free London 6 2 ## 7 2014-06-01 RAL Royal Free London 6 3 ## 8 2014-09-01 RAL Royal Free London 6 2 ## 9 2014-12-01 RAL Royal Free London 6 2 ## 10 2015-03-01 RAL Royal Free London 6 3 ## # … with 4,548 more rows ``` --- .blue[# 1. arrange] We need descending order: ```r beds_data %>% * arrange(desc(beds_av)) ``` ``` ## # A tibble: 4,558 × 5 ## date org_code org_name beds_av occ_av ## <date> <chr> <chr> <dbl> <dbl> ## 1 2013-09-01 RHA Nottinghamshire Healthcare 1050 929 ## 2 2013-12-01 RHA Nottinghamshire Healthcare 1002 888 ## 3 2014-03-01 RHA Nottinghamshire Healthcare 993 875 ## 4 2014-06-01 RHA Nottinghamshire Healthcare 991 857 ## 5 2015-03-01 RHA Nottinghamshire Healthcare 977 821 ## 6 2014-09-01 RHA Nottinghamshire Healthcare 971 844 ## 7 2014-12-01 RHA Nottinghamshire Healthcare 939 829 ## 8 2015-12-01 RWK East London 925 757 ## 9 2015-09-01 RWK East London 919 762 ## 10 2016-09-01 RWK East London 916 770 ## # … with 4,548 more rows ``` </br> desc() works for text and numeric variables --- class: inverse, center, middle # Q2. Which 2 organisations provided the highest number of MH beds in September 2018? --- class: center, middle # .darkgrey[Q2. Which 2 organisations provided the .blue[highest number] of MH beds in September 2018?] .pull-left[ We'll use arrange() as before] --- class: center, middle # .darkgrey[Q2. Which 2 organisations provided the .blue[highest number] of MH beds .blue[in September 2018?]] .pull-left[ .darkgrey[We'll use arrange() as before] </br> </br> But we require only observations with this date ] --- .blue[# 2. filter] pick observations by their value <img class="center" src="data:image/png;base64,#img/session06/filter-verb-code.PNG" width="40%"/> -- ## == is a test of equality ```r beds_data %>% * filter(date == "2018-09-01") ``` The expression inside brackets should return TRUE or FALSE. We are choosing rows where this expression is TRUE. To exclude and test where the expression is NOT equal `!=` ```r beds_data %>% * filter(date != "2018-09-01") ``` --- .blue[# 2. filter] pick observations by their value ```r beds_data %>% arrange(desc(beds_av)) %>% * filter(date == "2018-09-01") ``` --- class: inverse, center, middle # Q3. Which 5 organisations had the highest percentage bed occupancy in September 2018? --- class: center, middle # .darkgrey[Q3. Which 5 organisations had the .blue[highest] percentage bed occupancy in September 2018?] We'll use arrange() as before --- class: center, middle # .darkgrey[Q3. Which 5 organisations had the .blue[highest] percentage bed occupancy in .blue[September 2018?]] .darkgrey[We'll use arrange() as before] </br> We'll use filter() as before --- class: center, middle # .darkgrey[Q3. Which 5 organisations had the highest .blue[percentage bed occupancy] in September 2018?] .darkgrey[We'll use arrange() as before] </br> .darkgrey[We'll use filter() as before] </br> We don't have this variable... -- but we can create it --- .blue[# 3. mutate] create .green[new variables] from .green[existing ones] ```r beds_data %>% * mutate(perc_occ = occ_av / beds_av) ``` </br> perc_occ is a new named column from existing occ_av and beds_av columns -- ## = is NOT a test of equality --- .blue[# 3. mutate] create .green[new variables] from .green[existing ones] ```r beds_data %>% * mutate(perc_occ = occ_av / beds_av) %>% filter(date == "2018-09-01") %>% * arrange(desc(perc_occ)) ``` We can refer to variables we've just created above --- class: inverse, center, middle # Q4. What was the mean number of beds, (across all trusts) for each value of date? --- class: center, middle # .darkgrey[Q4. What was the .blue[mean number of beds], (across all trusts) for each value of date?] Let's first look at how we'd produce summary statistics like a mean... --- .blue[# 4. summarise] collapse many values into a single summary value ```r beds_data %>% * summarise(means_beds = mean(beds_av)) ``` </br> Similar in syntax to mutate: means_beds is the new column name </br> mean() function uses the existing column beds_av -- The result is NA ``` ## # A tibble: 1 × 1 ## means_beds ## <dbl> ## 1 NA ``` --- .blue[# 4. summarise] collapse many values into a single summary value ```r beds_data %>% summarise(means_beds = mean(beds_av, * na.rm = TRUE)) ``` </br> We'll need to remove NA values to get a suitable mean. TRUE can also be T --- .blue[# 4. summarise] collapse many values into a single summary value ```r beds_data %>% summarise(means_beds = mean(beds_av, * na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## means_beds ## <dbl> ## 1 300. ``` This code produces a single summary value for the whole dataset --- class: inverse, middle, center # Q4. What was the mean number of beds (across all trusts) for each value of date? --- class: middle, center #.darkgrey[Q4. What was the .blue[mean number of beds] (across all trusts) for each value of date?] .blue[Now we know how to use summarise...] --- class: middle, center #.darkgrey[Q4. What was the mean number of beds (across all trusts) .blue[for each value of date?]] .darkgrey[Now we know how to use summarise...] </br> .blue[We'll produce a summary value for **each value of date**] --- .blue[# 5. group_by] For each group... ```r beds_data %>% * group_by(date) %>% summarise(mean_beds = mean(beds_av, na.rm = TRUE)) ``` For each value of date... </br> something must follow a group_by() as it does nothing to the output alone. The change occurs behind the scenes. --- .blue[# 5. group_by] For each group... .darkgrey[summarise() produces a single summary value] ```r beds_data %>% * group_by(date) %>% summarise(mean_beds = mean(beds_av, na.rm = TRUE)) ``` ``` ## # A tibble: 21 × 2 ## date mean_beds ## <date> <dbl> ## 1 2013-09-01 324. ## 2 2013-12-01 322. ## 3 2014-03-01 319. ## 4 2014-06-01 320. ## 5 2014-09-01 318. ## 6 2014-12-01 315. ## 7 2015-03-01 314. ## 8 2015-06-01 294. ## 9 2015-09-01 296. ## 10 2015-12-01 296. ## # … with 11 more rows ``` --- .blue[# 5a. ungroup] Often it's safest to remove the grouping after you've performed the required operation <img class="center" src="data:image/png;base64,#img/session06/group_by_ungroup.png" width="40%"/> ```r beds_data %>% group_by(date) %>% summarise(mean_beds = mean(beds_av, na.rm = NA)) %>% ungroup() ``` --- class: inverse, middle, center # Q5. Which 5 organisations have the highest mean % bed occupancy? </br> -- # (over the 5 year period) --- class: middle, center # .darkgrey[Q5. Which 5 organisations have the highest mean .blue[% bed occupancy]?] .blue[Create a new variable mutate()] --- class: middle, center # .darkgrey[Q5. Which 5 .blue[organisations] have the highest mean % bed occupancy?] .darkgrey[Create a new variable mutate()] </br> </br> .blue[Then, for each of the organisations group_by()] --- class: middle, center # .darkgrey[Q5. Which 5 organisations have the highest .blue[mean] % bed occupancy?] .darkgrey[Create a new variable mutate() </br> </br> Then, for each of the organisations group_by() </br> </br> .blue[Summary statistic using summarise()]] --- class: middle, center # .darkgrey[Q5. Which 5 organisations have the .blue[highest] mean % bed occupancy?] .darkgrey[Create a new variable mutate() </br> </br> Then, for each of the organisations group_by() </br> </br> Summary statistic using summarise() </br> </br> .blue[Order to find highest by using arrange()]] --- class: middle, center # .darkgrey[Q5. Which 5 organisations have the .blue[highest] mean % bed occupancy?] .darkgrey[Create a new variable mutate() </br> </br> Then, for each of the organisations group_by() </br> </br> Summary statistic using summarise() </br> </br> Order to find highest by using arrange()] </br> .pull-left[ .blue[Tip: Run the code after each new line to check it returns the output you'd expect.] ] --- class: inverse, middle, center # Over to you... --- class: middle, center # Q5. Which 5 organisations have the highest mean % bed occupancy? Create a new variable mutate() </br> </br> Then, for each of the organisations group_by() </br> </br> Summary statistic using summarise() </br> </br> Order to find highest by using arrange() </br> .pull-left[ .blue[Tip: Run the code after each new line to check it returns the output you'd expect.] ] .pull-right[ Hint: ```r beds_data %>% mutate() %>% group_by() %>% summarise() %>% arrange() ``` ] --- # Solution ```r beds_data %>% mutate(perc_occ = occ_av/beds_av) %>% group_by(org_name) %>% summarise(mean_pocc = mean(perc_occ, na.rm = TRUE)) %>% arrange(desc(mean_pocc)) ``` ``` ## # A tibble: 255 × 2 ## org_name mean_pocc ## <chr> <dbl> ## 1 Barnet, Enfield And Haringey Mental Health 0.982 ## 2 Bradford District Care Trust 0.972 ## 3 Camden And Islington 0.953 ## 4 Hertfordshire Partnership University 0.953 ## 5 North Essex Partnership University 0.945 ## 6 Devon Partnership 0.945 ## 7 Sussex Partnership 0.942 ## 8 Essex Partnership University 0.942 ## 9 Manchester Mental Health And Social Care Trust 0.940 ## 10 Birmingham And Solihull Mental Health 0.935 ## # … with 245 more rows ``` --- # Extension How many distinct organisation names are counted under the same organisation code? ```r beds_data %>% mutate(perc_occ = occ_av/beds_av) %>% group_by(org_name) %>% summarise(mean_pocc = mean(perc_occ, na.rm = TRUE), # Number of data points/rows by group * number = n()) %>% arrange(desc(mean_pocc)) ``` ``` ## # A tibble: 255 × 3 ## org_name mean_pocc number ## <chr> <dbl> <int> ## 1 Barnet, Enfield And Haringey Mental Health 0.982 21 ## 2 Bradford District Care Trust 0.972 3 ## 3 Camden And Islington 0.953 21 ## 4 Hertfordshire Partnership University 0.953 9 ## 5 North Essex Partnership University 0.945 15 ## 6 Devon Partnership 0.945 21 ## 7 Sussex Partnership 0.942 21 ## 8 Essex Partnership University 0.942 6 ## 9 Manchester Mental Health And Social Care Trust 0.940 14 ## 10 Birmingham And Solihull Mental Health 0.935 21 ## # … with 245 more rows ``` --- # Extension How many columns associated with each observation? Adds another summary column. ```r beds_data %>% group_by(org_code) %>% summarise(# distinct number of org_name * distinct_number = n_distinct(org_name)) %>% filter(distinct_number > 1) %>% arrange(desc(distinct_number)) ``` ``` ## # A tibble: 22 × 2 ## org_code distinct_number ## <chr> <int> ## 1 RDE 3 ## 2 RTG 3 ## 3 R1K 2 ## 4 RA9 2 ## 5 RBN 2 ## 6 RD1 2 ## 7 RD8 2 ## 8 RDU 2 ## 9 RGM 2 ## 10 RGN 2 ## # … with 12 more rows ``` --- .blue[# 6. select] select a subset of variables .darkgrey[from a existing data set] ```r beds_data %>% select(org_code, org_name) ``` --- .blue[# 6. select] select a subset of variables .darkgrey[from a existing data set] ```r beds_data %>% select(-org_code) ``` .green[To remove a column] --- .blue[# 6. select] select a subset of variables .darkgrey[from a existing data set] ```r beds_data %>% select(1:3) ``` .green[You can also refer to columns by number, here 1:3 saves having to type 1, 2, 3] --- .blue[# 6. select] select a subset of variables .darkgrey[from a existing data set] ```r beds_data %>% select(org_name, everything()) ``` .green[org_name appears first, followed by everything else - and isn't repeated] --- # Not dplyr... ... but works within dplyr [Vectors](https://datascienceplus.com/vectors-and-functions-in-r/) are the simplest type of data structure in R. Simply put, a vector is a sequence of data elements of the same basic type. -- You can create a vector with function: c() for concatenate/combine ```r c(100, 80, 200) c("beds", "staff", "patients") c("beds", 80, "patients") # Mixing strings (characters) and numeric values results in all being strings ``` --- # Very useful in these circumstances ```r c("Bradford District Care", "Bradford District Care Trust") ``` </br> Filter by org_name IN the lookup list. ```r beds_data %>% * filter(org_name %in% c("Bradford District Care", "Bradford District Care Trust")) ``` </br> Filter by org_name NOT IN the lookup list. ```r beds_data %>% * filter(!org_name %in% c("Bradford District Care", "Bradford District Care Trust")) ``` --- # For SQL users Code would look like this in SQL ```sql SELECT * FROM Table WHERE Colm IN ('Bradford District Care', 'Bradford District Care Trust') ``` And ```sql SELECT * FROM Table WHERE Colm NOT IN ('Bradford District Care', 'Bradford District Care Trust') ``` --- #### This work is licensed as </br> Creative Commons </br> Attribution </br> ShareAlike 4.0 </br> International </br> To view a copy of this license, visit </br> https://creativecommons.org/licenses/by/4.0/