To support date-based analyses, infection records will be aggregated by MMWR week and year. For both the Los Angeles County and California datasets, we will generate two new columns: mmwr_year and mmwr_week, then remove the original date-based fields. We will also add columns start_date and end_date to serve as reference points should we need them later.
In the California dataset, the field time_int encodes the year and MMWR week as a six-digit integer (YYYYWW). To create the new fields, we extract the first four digits as mmwr_year and the last two digits as mmwr_week, then drop the original time_int column.
In the Los Angeles County dataset, the codebook identifies a field dt_report as the last day of the MMWR week. However, this field contained only missing values, so it was removed. Instead, we convert the infection date field, dt_dx to a proper date format, and then use the MMWRweek package to derive the mmwr_year and mmwr_week.
Code
##-- California dataset:step2_ca_df <- step1_ca_df %>%##--pull MMWR week and year from time_int fieldmutate(mmwr_year =factor(time_int %/%100), mmwr_week =factor(time_int %%100)) %>%add_start_end_dates() %>%select(-time_int) %>%relocate(mmwr_year, mmwr_week, start_date, end_date, .before =everything())##-- LA county dataset:step2_la_cnty_df <- step1_la_cnty_df %>%##--restructure to proper date formatmutate(DATE_FIX =as.Date(parse_date_time(dt_dx, "%d%b%Y"), format ="%Y-%m-%d")) %>%##--use date to create new MMWR fieldsadd_mmwr_week_columns(date_col ="DATE_FIX") %>%add_start_end_dates() %>%select(-c(DATE_FIX, dt_dx)) %>%relocate(mmwr_year, mmwr_week, start_date, end_date, .before =everything()) %>%relocate(county, .before = age_cat)
To streamline this process, we created two helper functions:
add_mmwr_week_columns() : takes date column and adds two fields: mmwr_year and mmwr_week
add_start_end_dates() : uses those values to generate corresponding MMWR week start and end dates
The dataframes now have a structure that looks like this: