2020: Day 4

tidyverse

Author

Ella Kaye

Published

December 4, 2020

Setup

The original challenge

My data

Part 1

library(dplyr)
library(tidyr)
library(stringr)

Using readr::read_tsv() off the bat removes the blank lines, making it impossible to identify the different passports, but reading in the data via readLines() then converting as_tibble() preserves them, and then allows us to use tidyverse functions for the remaining tidying. cumsum() on a logical vectors takes advantage of FALSE having a numeric value of zero and TRUE having a numeric value of one.

passports <- 
  readLines(here::here("2020", "day", "4", "input")) %>%
  as_tibble() %>%
  separate_rows(value, sep = " ") %>%
  mutate(new_passport = value == "") %>%
  mutate(ID = cumsum(new_passport) + 1) %>%
  filter(!new_passport) %>%
  select(-new_passport) %>%
  separate(value, c("key", "value"), sep = ":") %>%
  relocate(ID)

Our data is now in three columns, with ID, key and value, so now we need to find the number of passports with all seven fields once cid is excluded:

passports %>%
  filter(key != "cid") %>%
  count(ID) %>%
  filter(n == 7) %>%
  nrow()

[1] 210

Part 2: Valid passports

Now we need to add data validation checks:

byr (Birth Year) - four digits; at least 1920 and at most 2002.
iyr (Issue Year) - four digits; at least 2010 and at most 2020.
eyr (Expiration Year) - four digits; at least 2020 and at most 2030.
hgt (Height) - a number followed by either cm or in: - If cm, the number must be at least 150 and at most 193. - If in, the number must be at least 59 and at most 76.
hcl (Hair Color) - a # followed by exactly six characters 0-9 or a-f.
ecl (Eye Color) - exactly one of: amb blu brn gry grn hzl oth.
pid (Passport ID) - a nine-digit number, including leading zeroes.
cid (Country ID) - ignored, missing or not.

Ignoring the cid field, we narrow down on passports that at least have the right number of fields, and extract the number from the hgt column:

complete_passports <- passports %>%
  filter(key != "cid") %>%
  add_count(ID) %>%
  filter(n == 7) %>%
  select(-n) %>%
  mutate(hgt_value = case_when(
    key == "hgt" ~ readr::parse_number(value),
    TRUE ~ NA_real_)) %>%
  ungroup()

Then we create a check column, which is TRUE when the value for each key meets the required conditions. Those with 7 TRUEs are valid. Note that with case_when() we’ve left the check column as NA when the condition is FALSE, requiring na.rm = TRUE in the call to sum(). We can get round that by adding a final line to the case_when() condition stating TRUE ~ FALSE. TRUE here is a catch-all for all remaining rows not covered by the conditions above, and then we set them to FALSE, but I find the line TRUE ~ FALSE unintuitive.

complete_passports %>%
  mutate(check = case_when(
    (key == "byr" & value >= 1920) & (key == "byr" & value <= 2002) ~ TRUE,
    (key == "iyr" & value >= 2010) & (key == "iyr" & value <= 2020) ~ TRUE,
    (key == "eyr" & value >= 2020) & (key == "eyr" & value <= 2030) ~ TRUE,
    key == "hgt" & str_detect(value, "cm") & hgt_value >= 150 & hgt_value <= 193 ~ TRUE,
    key == "hgt" & str_detect(value, "in") & hgt_value >= 59 & hgt_value <= 76 ~ TRUE,  
    key == "hcl" & str_detect(value, "^#[a-f0-9]{6}$") ~ TRUE,
    key == "ecl" & value %in% c("amb", "blu", "brn", "gry", "grn", "hzl", "oth") ~ TRUE,
    key == "pid" & str_detect(value, "^[0-9]{9}$") ~ TRUE
  )) %>%
  group_by(ID) %>%
  summarise(check_all = sum(check, na.rm = TRUE)) %>%
  filter(check_all == 7) %>%
  nrow()

[1] 131

Session info

Toggle

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Sonoma 14.0
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/London
 date     2023-11-06
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
 quarto   1.4.466 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)

 [1] /Users/ellakaye/Library/R/arm64/4.3/library
 [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────