R/load_prostate_redcap.R
load_prostate_redcap.Rd
Loads, merges, reformats, corrects, and labels
the REDCap file used for the MSK-IMPACT Prostate clinical database.
It is recommended that the returned list is next processed by
check_prostate_redcap
.
load_prostate_redcap(
labeled_csv,
deidentify = TRUE,
keep_also = list(baseline = NULL, sample = NULL, freeze = NULL)
)
CSV file with labels, exported from REDCap. Must be the labeled version and must contain dates in order to derive time intervals.
De-identify the returned data set using
deidentify_prostate_redcap
? Defaults to
TRUE
. Should only be disabled if additional data need
to be merged by identifiers, followed by calling
deidentify_prostate_redcap
separately.
Optional. Additional patient-level variables to keep
without editing. As applicable, they would need to be
deidentified manually.
Provide as list with vectors of variable names for baseline and freeze
forms: list(baseline = c("var1", "var2"), freeze = "varX")
.
List of three labeled tibbles (data frames):
pts
: Patient-level data
smp
: Sample-level data
trt
: Treatment data
Access variables labels in RStudio via View
or using attr(., "label")
.
The warning message, Duplicated column names deduplicated
, is
expected due to the design of the REDCap dataset. Another warning message
that a factor does not contain all levels is also possible.
The following edits and assumptions are made:
Potentially incomplete date variables are converted to
date format, using guessdate
.
Various missingness indicators in strings and factors,
c("Unknown / Not Reported", "N/A", "NA", "Unknown", "X", "x")
,
are converted to NA
.
"Undetectable" PSA is set to 0, PSA ">x"
is set to x + 1
,
PSA "a-b"
(e.g., 4.5-4.7
) is set to the mean of the two
values.
Clinical T and N stage variables are set to missing if M1.
Event dates and follow-up time for metastases (met_date
),
castration resistance (crpc_date
), and death are set:
Event date is the last clinic visit (lastvisit
)
if a CRPC/metastases event has not occurred.
Event date is the last follow up/contact (lastfu
)
if last known survival status is alive.
If stage is M1 and the recorded metastasis date is no more than
1 month discrepant, met_date
is set to the diagnosis
date (dxdate
).
If the sample is a variant histology (e.g., neuroendocrine),
the castration resistance date (crpc_date
) is the date of
diagnosis and the event indicator for survival analyses
(event_crpc
) is NA
.
Time intervals for these three survival outcomes are calculated from the time of sequencing. For late-entry survival models, time intervals from diagnosis to sequencing and from sample/biopsy to sequencing are also provided.
Disease extent, distinguishing CRPC from castration-sensitive disease, at sampling is based on the sample date and the date of castration resistance. If the samples was obtained before the CRPC date, or CRPC did not occur, the sample is from castration-sensitive disease by definition.
Overview of analysis-ready data elements: https://stopsack.github.io/prostateredcap/articles/dataelements.html
# Get path to toy data provided by the package:
example_csv_file <- system.file("extdata",
"SampleGUPIMPACTDatab_DATA_LABELS_2021-05-26.csv",
package = "prostateredcap",
mustWork = TRUE)
# Load data:
pts_smp <- load_prostate_redcap(labeled_csv = example_csv_file)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `stage_detailed = fct_relevel(...)`.
#> Caused by warning:
#> ! 5 unknown levels in `f`: T1/T2 NX M0, T3 N0 M0, T3 NX M0, T4 M0, and TX NX M0
#> Warning: Expected 4 pieces. Missing pieces filled with `NA` in 2 rows [3, 10].
#> Warning: There was 1 warning in `transmute()`.
#> ℹ In argument: `smp_tissue = fct_collapse(smp_tissue, Visceral = c("Liver",
#> "Lung"))`.
#> Caused by warning:
#> ! Unknown levels in `f`: Liver
#> Warning: There were 2 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `dzextent_seq = fct_relevel(...)`.
#> Caused by warning:
#> ! 2 unknown levels in `f`: Regional nodes and Metastatic, variant histology
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `rx_dzextent = fct_relevel(...)`.
#> Caused by warning:
#> ! 6 unknown levels in `f`: Localized, Regional nodes, Metastatic
#> hormone-sensitive, Non-metastatic castration-resistant, Metastatic
#> castration-resistant, and Metastatic, variant histology
# Access patient-level data:
pts_smp$pts
#> # A tibble: 8 × 59
#> ptid complete_pts age_dx race race4 race3 ethnicity smoking smoke01
#> <int> <chr> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
#> 1 1 Complete 43.4 White White White NOT Hisp… Never 0
#> 2 2 Complete 42.3 White White White NOT Hisp… Never 0
#> 3 3 Complete 60.7 Black or Afri… Blac… Black NOT Hisp… Never 0
#> 4 4 Complete NA NA NA NA NOT Hisp… Never 0
#> 5 5 Complete 63.4 White White White NOT Hisp… Current 1
#> 6 6 Complete 49.1 White White White NOT Hisp… Never 0
#> 7 7 Complete 59.6 Asian Asian Asian NOT Hisp… Never 0
#> 8 8 Complete 65.2 White White White NOT Hisp… Never 0
#> # ℹ 50 more variables: bx_gl_sum <dbl>, bx_gl <fct>, bx_gl34 <fct>,
#> # bx_gl_maj <dbl>, bx_gl_min <dbl>, psa_dx <dbl>, psa_dxcat <fct>,
#> # lnpsa_dx <dbl>, clin_t <fct>, clin_n <fct>, clin_m <fct>,
#> # stage_detailed <fct>, stage <fct>, clin_tstage <fct>, clin_nstage <fct>,
#> # mstage <fct>, rxprim <fct>, rxprim_oth <chr>, rxprim_rp <lgl>,
#> # rxprim_adt <lgl>, rxprim_chemo <lgl>, rxprim_xrt <lgl>, rxprim_other <lgl>,
#> # rp_gl_sum <dbl>, rp_gl34 <fct>, rp_gl_maj <dbl>, rp_gl_min <dbl>, …
# Access sample-level data:
pts_smp$smp
#> # A tibble: 12 × 41
#> ptid complete_smp dmpid hist_smp hist_cmt dzextent_smp dzextent2 ext_pros
#> <int> <chr> <chr> <fct> <chr> <fct> <fct> <fct>
#> 1 1 Complete P-12345… Adenoca… NA Metastatic … Metastat… FALSE
#> 2 2 Complete P-23456… Adenoca… NA Metastatic … Metastat… TRUE
#> 3 3 Complete P-34567… Adenoca… NA Localized Localized FALSE
#> 4 3 Complete P-34567… Adenoca… NA Metastatic … Metastat… FALSE
#> 5 4 Complete P-43219… Adenoca… NA Metastatic … Metastat… FALSE
#> 6 4 Complete P-43219… Adenoca… NA Metastatic … Metastat… FALSE
#> 7 5 Complete P-54321… Adenoca… NA Metastatic … Metastat… TRUE
#> 8 5 Complete P-54321… Adenoca… NA Metastatic … Metastat… TRUE
#> 9 6 Complete P-67895… Adenoca… NA Metastatic … Metastat… FALSE
#> 10 7 Complete P-77320… Adenoca… NA Regional no… Regional… FALSE
#> 11 8 Complete P-88321… Adenoca… NA Metastatic … Metastat… TRUE
#> 12 8 Complete P-88321… Adenoca… NA Metastatic … Metastat… FALSE
#> # ℹ 33 more variables: ext_lndis <fct>, ext_bone <fct>, ext_vis <fct>,
#> # ext_liver <fct>, ext_lung <fct>, ext_other <fct>, bonevol <fct>,
#> # cntadt <fct>, tissue <fct>, smp_pros <fct>, smp_tissue <fct>,
#> # pur_rev <fct>, pur_remov <fct>, is_met_for_qc <fct>, dzextent_seq <fct>,
#> # primmet_smp <chr>, age_smp <dbl>, age_seq <dbl>, dx_smp_mos <dbl>,
#> # adt_smp_mos <dbl>, dx_seq_mos <dbl>, adt_seq_mos <dbl>, smp_met_mos <dbl>,
#> # smp_os_mos <dbl>, seq_met_mos <dbl>, seq_crpc_mos <dbl>, …
# Access treatment data:
pts_smp$trt
#> # A tibble: 0 × 19
#> # ℹ 19 variables: ptid <int>, rx_line <chr>, rx_name <fct>,
#> # rx_name_parpi <chr>, rx_censor <chr>, rx_stop_reason <fct>,
#> # rx_stop_reason_other <chr>, rx_dzextent <fct>, rx_ext_pros <fct>,
#> # rx_ext_lndis <fct>, rx_ext_bone <fct>, rx_ext_vis <fct>,
#> # rx_ext_liver <fct>, rx_ext_lung <fct>, rx_ext_other <fct>,
#> # rx_bonevol <fct>, dx_rx_start_mos <dbl>, dx_rx_end_mos <dbl>, rx_wks <dbl>
# Pass 'pts_smp' to check_prostate_redcap() next