Appendix B — Custom Functions

clean()

Description:

Removes unwanted whitespace and formatting inconsistencies from character data.
This utility function converts the input to a character vector, trims leading and trailing spaces, and collapses multiple internal spaces into a single space.

Usage:

Code

df %>% clean(x)

Details:

The function is designed to standardize text fields before joining, grouping, or comparison operations.

It combines as.character(), stringr::str_trim(), and stringr::str_squish() for efficient whitespace cleanup.

Returned Value:

A character vector with trimmed and normalized spacing.

add_mmwr_week_columns()

Description:

This function uses the MMWRweek package to convert calendar dates into the standardized MMWR (Morbidity and Mortality Weekly Report) week and year format commonly used in public health surveillance. Adds two new columns (mmwr_year and mmwr_week) to a dataset based on a specified date column.This function is particularly useful for aligning time series data to standardized epidemiological reporting periods.

Usage:

Code

add_mmwr_week_columns(data, date_col = "date")

Arguments:

data : A data frame containing a date field.
date_col : The column name containing the date (must be a recognizable Date class)

Details: The function extracts the specified date column, converts each date to its corresponding MMWR week and year using MMWRweek::MMWRweek(), and appends these as two new factor columns: mmwr_year and mmwr_week.

Returned Columns:

mmwr_year: MMWR reporting year
mmwr_week: MMWR reporting week number

add_start_end_dates()

Description:

Adds the start and end dates corresponding to each MMWR week and year.
This function uses the MMWRweek::MMWRweek2Date() function to convert MMWR week-year combinations into actual calendar dates, making it easier to interpret and visualize epidemiological data over time.

Usage:

Code

add_start_end_dates(data, year_col = "mmwr_year", week_col = "mmwr_week")

Arguments:

data: A data frame containing MMWR week and year columns
year_col: The column name for MMWR year. Defaults to “mmwr_year”
week_col: The column name for MMWR week. Defaults to “mmwr_week”

Details:

The function extracts MMWR year and week values from the specified columns and converts them to corresponding start (Monday) and end (Sunday) calendar dates using the MMWRweek2Date() function from the MMWRweek package.

This is useful for labeling plots, aligning data with other time series, or verifying that reporting periods match expected calendar intervals.

Returned Columns:

start_date: Calendar date of the first day of the MMWR week (Monday)
end_date: Calendar date of the last day of the MMWR week (Sunday)

calc_rates_rng_grp()

Description:

Simplifies and standardizes the process of calculating crude infection rates (per 100,000 population) across multiple demographic and geographic groupings (e.g., by county or health officer region and by sex, age category, race/ethnicity, etc.).

Usage:

Code

calc_rates_rng_grp(
               df = combined_df,
        group_var = age_cat,
          pop_col = total_age_pop,
             week = 52,
        geo_level = "region" | "county" | "statewide"
)

Arguments:

df : Name of the combined dataframe object
group_var : A single categorical variable that defines the demographic grouping, such as sex, age_cat, or race_short
pop_col : The column that contains the population denominator used to calculate rates for the chosen
week : MMWR week to use for calculations. Defaults to 52. The function first restricts the df to rows where “mmwr_week == week”
region_col : Column that identifies the Health Officer Region column name. Defaults to “health_officer_region”
county_col : Column that identifies the County column name. Defaults to “county”
infected_col : Column name containing cumulative infection counts. Defaults to “cumulative_infected”
severe_col : Column name containing cumulative sever infection counts. Defaults to “cumulative_severe”
geo_level : Geographic level for aggregation. One identified of:
- “region”: sums across counties within each Health Officer Region and group
- “county”: keeps results at the county and group level
- “statewide”: first collapses to a single row per county and group, then sums across counties to create statewide totals by group, and stamps “statewide” into the region and county columns for joining with other outputs.

Returned Output:

For each call, the function returns one row per geographic unit and group, with:

group_var

Name of the grouping variable (for example “sex”, “age_cat”, “race_short”).

group_var_cat

The category value within that grouping (for example “Female”, “18–44”, “Hispanic”).

group_pop

Population denominator for that geographic unit and group.

cumulative_infected, cumulative_severe

Numerators summed at the requested geographic level.

inf_rate_100k, sev_rate_100k

Crude rates per 100,000, computed as:
- (1e5 * count / group_pop) when group_pop > 0 and zero otherwise.

total_ca_pop, total_group_pop

Used as a quality control checkpoint: The total California population value carried forward, and the sum of group_pop across all rows in the returned dataset. This value should match the California population.

make_rate_table()

Description:

Builds a styled summary table for crude rates with optional row filtering and a highlighted rate column. The function selects grouping columns and rate columns, drops unwanted columns, sorts by a chosen metric, optionally keeps the top n rows, and returns a formatted knitr_kable/kableExtra table.

Usage:

Code

make_rate_table(
  df,
  ...,
  pop_col = total_pop,
  rate_pattern = "_rate($|_)",
  drop_cols = "unrec_rate_100k",
  sort_by = "sev_rate_100k",
  top_n = NULL,
  highlight_col = "sev_rate_100k",
  highlight_bg = "#5ce1e6"
)

Arguments:

df: Data frame that contains grouping variables, a population column, and one or more rate columns
...: One or more grouping columns to display (for example county, health_officer_region)
pop_col: Unquoted population column to include in the table. Uses tidy-eval (e.g., total_pop)
rate_pattern: Regular expression used by dplyr::matches() to select rate columns. Default matches columns ending in _rate or rate
drop_cols: Character vector of column names to remove after selection
sort_by: Column name (character) used for descending sort. Must exist in df after selection
top_n: Integer. If supplied, keeps only the first n rows after sorting
highlight_col: Column name (character) to visually highlight with column_spec(). If not found, no highlight is applied
highlight_bg: Hex color for the highlighted column background

Returned Object:

A kableExtra table object for display in HTML documents.

create_EQ_lbl()

Description:

Utilizes the classIntervals() function to create logical, visual breaks for visualizing cases and rates. Specifically, it uses the “Equal Interval” binning method, which can be especially helpful when you want to highlight the outliers, or more extreme ends, within a distribution of data. There are a couple of smaller, helper functions that is required if the “compact” argument is set to “TRUE”. This option shortens up the labels which can be helpful for simplifying maps and charts. For example, instead of returning “456,000 - 886,000”, the label is shortened to “456K - 886K”

Usage:

Code

create_EQ_lbl(
  data,
  var, 
  n = 6,
  new_col = NULL,
  round_fn = floor,
  compact = TRUE
)

Arguments:

data: Data frame or tibble that contains the numeric variable to be binned.
var: Unquoted numeric column to classify into equal-interval bins. Uses tidy-eval (for example inf_rate_100k).
n: Integer. Number of equal-interval classes to create. Passed to classInt::classIntervals() as the n argument. Default is 6.
new_col: Optional character string for the name of the new factor column. If NULL, the function appends “_eq” to the name of var (for example inf_rate_100k_eq).
round_fn: Rounding function applied to break values before label creation. Common choices include floor, ceiling, or round. Default is floor.
compact: Logical. If TRUE, uses a helper to create compact labels (for example “0 - 5K”, “5K - 10K”, “20K - 25K+”) which are often easier to read on maps and charts. If FALSE, returns full numeric ranges such as “0 - 5,000”.