For each participant identify the date of first diagnosis from all available electronic medical records & self-reported data.
If `use_baseline_dates=TRUE` (the default) then will also produce a binary 0/1 variable, indicating the controls (people without a diagnosis) and setting the date first `_df` field to the date of censoring (currently 30 October 2022).
Usage
get_df(
diagnosis_list,
prefix = NULL,
group_by = NULL,
include_selfrep_illness = TRUE,
include_death_cause = TRUE,
include_gp_clinical = TRUE,
include_hesin_diag = TRUE,
include_hesin_oper = TRUE,
include_cancer_registry = TRUE,
use_baseline_dates = TRUE,
file_paths = NULL,
censoring_date = "30-10-2022",
verbose = FALSE
)
Arguments
- diagnosis_list
A list of data frames. The output of `get_diagnoses()` i.e., the raw diagnosis and self-reported illness data that matched the provided codes list.
- prefix
String. Prefix to add to variable names (e.g., if prefix="chd" the output variables would be "chd_gp_df", "chd_hes_df", "chd_df" etc.)
default=NULL
- group_by
String. If the codes list provided to `get_diagnoses()` (i.e., in diagnosis_list$codes_df) contained a grouping/condition variable, indicate the variable name here. "Date first" variables will be created for each prefix in the grouping variable. The `prefix` option is ignored, in favour of the names in the grouping variable.
default=NULL
- include_selfrep_illness
logical. Include self-reported diagnosesin the combined Date First output? If present in `diagnosis_list` will still provide a separate `_df` variable
default=TRUE
- include_death_cause
logical. Include the cause of death in the combined Date First output? If present in `diagnosis_list` will still provide a separate `_df` variable
default=TRUE
- include_gp_clinical
logical. Include the GP data in the combined Date First output? If present in `diagnosis_list` will still provide a separate `_df` variable
default=TRUE
- include_hesin_diag
logical. Include the HES diagnosis data in the combined Date First output? If present in `diagnosis_list` will still provide a separate `_df` variable
default=TRUE
- include_hesin_oper
logical. Include the HES OPCS (operations) data in the combined Date First output? If present in `diagnosis_list` will still provide a separate `_df` variable
default=TRUE
- include_cancer_registry
logical. Include the cancer registry data in the combined Date First output? If present in `diagnosis_list` will still provide a separate `_df` variable
default=TRUE
- use_baseline_dates
logical. If `baseline_dates` available in file paths, produce a binary 0/1 variable, indicating the controls (people without a diagnosis) and setting the date first `_df` field to the date of censoring (currently see `censoring_date` option).
default=TRUE
- file_paths
A data frame. Columns must be `object` and `path` containing paths to outputted files. If not provided will use those in `ukbrapr_paths`
default=NULL
- censoring_date
A string. If using baseline data to infer control participants, include a censoring date (set to NA if not desired). Use dd-mm-yyyy format. Default is the (current) HES date.
default="30-10-2022"
- verbose
Logical. Be verbose,
default=FALSE
Value
Returns a single, "wide" data frame: the participant data for the requested diagnosis codes with "date first" `_df` variables. One for each source of data, and a combined variable.
Examples
###############################################
# example 1. haemochromatosis
# get diagnosis data - returns list of data frames (one per source)
diagnosis_list <- get_diagnoses(ukbrapR:::codes_df_hh)
# for each participant, get Date First diagnosed with the condition
diagnosis_df <- get_df(diagnosis_list, prefix="hh")
###############################################
# example 2. get multiple diseases at once
# don't have to all have the same code types/data sources
codes = rbind(ukbrapR:::codes_df_hh, ukbrapR:::codes_df_ckd)
print(codes)
# get diagnosis data - returns list of data frames (one per source)
diagnosis_list <- get_diagnoses(codes)
# for each participant, get Date First diagnosed with the condition
diagnosis_df <- get_df(diagnosis_list, group_by="condition")