This step is only required once. To install (or update) the prostateredcap package from GitHub, use the remotes package:
install.packages("remotes") # skip if 'remotes' package is already installed
remotes::install_github("stopsack/prostateredcap")
The prostateredcap R package contains an example dataset of the prostate cancer database in the same format as it would be exported from REDCap as a “labeled CSV.” All data in the example dataset are designed to mimick real clinical data but do not correspond to any real patients.
First, load the dplyr package for data handling, and take a look at the raw example dataset provided as part of the prostateredcap package.
library(dplyr)
raw_data <- system.file("extdata",
"SampleGUPIMPACTDatab_DATA_LABELS_2021-05-26.csv",
package = "prostateredcap")
readr::read_csv(file = raw_data) %>%
print(max_extra_cols = 0) # do not print all other columns
#> # A tibble: 28 × 72
#> `Record ID` `Repeat Instrument` `Repeat Instance` `Birth Date` Race
#> <dbl> <chr> <dbl> <chr> <chr>
#> 1 1 NA NA 02/04/1956 White
#> 2 1 Sample Data 1 NA NA
#> 3 1 Freeze Data 1 NA NA
#> 4 2 NA NA 02/09/1974 White
#> 5 2 Sample Data 1 NA NA
#> 6 2 Freeze Data 1 NA NA
#> 7 3 NA NA 01/08/1953 Black or Afri…
#> 8 3 Sample Data 1 NA NA
#> 9 3 Sample Data 2 NA NA
#> 10 3 Freeze Data 1 NA NA
#> # ℹ 18 more rows
The dataset, as a typical REDCap export, contains multiple rows per person, with each of the REDCap “forms” (baseline data, sample data, …) in a separate row and blank values for variables not part of that “form.”
We will load the prostateredcap library, read in the same dataset again, and display its contents.
library(prostateredcap)
pts_smp <- load_prostate_redcap(raw_data)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `stage_detailed = fct_relevel(...)`.
#> Caused by warning:
#> ! 5 unknown levels in `f`: T1/T2 NX M0, T3 N0 M0, T3 NX M0, T4 M0, and TX NX M0
#> Warning: Expected 4 pieces. Missing pieces filled with `NA` in 2 rows
#> [3, 10].
#> Warning: There was 1 warning in `transmute()`.
#> ℹ In argument: `smp_tissue = fct_collapse(smp_tissue, Visceral = c("Liver",
#> "Lung"))`.
#> Caused by warning:
#> ! Unknown levels in `f`: Liver
#> Warning: There were 2 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `dzextent_seq = fct_relevel(...)`.
#> Caused by warning:
#> ! 2 unknown levels in `f`: Regional nodes and Metastatic, variant histology
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `rx_dzextent = fct_relevel(...)`.
#> Caused by warning:
#> ! 6 unknown levels in `f`: Localized, Regional nodes, Metastatic
#> hormone-sensitive, Non-metastatic castration-resistant, Metastatic
#> castration-resistant, and Metastatic, variant histology
Warnings that the example data, which has data on 8 patients, does not contain all tumor/stage combinations are expected.
load_prostate_redcap()
has returned a list with two
separate data elements:
pts
, the data frame with patient-level data.smp
, the sample-level data frame. smp
can
have multiple rows per patient that can be merged with the patient-level
data pts
using the ptid
variable present in
both datasets.The data in pts_smp
is preprocessed. For example, rather
than containing data on date of birth and date of diagnosis, the
pts
dataset contains age at diagnosis (age_dx
,
in years).
pts_smp$pts
#> # A tibble: 8 × 59
#> ptid complete_pts age_dx race race4 race3 ethnicity smoking smoke01
#> <int> <chr> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
#> 1 1 Complete 43.4 White White White NOT Hisp… Never 0
#> 2 2 Complete 42.3 White White White NOT Hisp… Never 0
#> 3 3 Complete 60.7 Black or Afri… Blac… Black NOT Hisp… Never 0
#> 4 4 Complete NA NA NA NA NOT Hisp… Never 0
#> 5 5 Complete 63.4 White White White NOT Hisp… Current 1
#> 6 6 Complete 49.1 White White White NOT Hisp… Never 0
#> 7 7 Complete 59.6 Asian Asian Asian NOT Hisp… Never 0
#> 8 8 Complete 65.2 White White White NOT Hisp… Never 0
#> # ℹ 50 more variables: bx_gl_sum <dbl>, bx_gl <fct>, bx_gl34 <fct>,
#> # bx_gl_maj <dbl>, bx_gl_min <dbl>, psa_dx <dbl>, psa_dxcat <fct>,
#> # lnpsa_dx <dbl>, clin_t <fct>, clin_n <fct>, clin_m <fct>,
#> # stage_detailed <fct>, stage <fct>, clin_tstage <fct>, clin_nstage <fct>,
#> # mstage <fct>, rxprim <fct>, rxprim_oth <chr>, rxprim_rp <lgl>,
#> # rxprim_adt <lgl>, rxprim_chemo <lgl>, rxprim_xrt <lgl>, rxprim_other <lgl>,
#> # rp_gl_sum <dbl>, rp_gl34 <fct>, rp_gl_maj <dbl>, rp_gl_min <dbl>, …
pts_smp$smp
#> # A tibble: 12 × 41
#> ptid complete_smp dmpid hist_smp hist_cmt dzextent_smp dzextent2 ext_pros
#> <int> <chr> <chr> <fct> <chr> <fct> <fct> <fct>
#> 1 1 Complete P-12345… Adenoca… NA Metastatic … Metastat… FALSE
#> 2 2 Complete P-23456… Adenoca… NA Metastatic … Metastat… TRUE
#> 3 3 Complete P-34567… Adenoca… NA Localized Localized FALSE
#> 4 3 Complete P-34567… Adenoca… NA Metastatic … Metastat… FALSE
#> 5 4 Complete P-43219… Adenoca… NA Metastatic … Metastat… FALSE
#> 6 4 Complete P-43219… Adenoca… NA Metastatic … Metastat… FALSE
#> 7 5 Complete P-54321… Adenoca… NA Metastatic … Metastat… TRUE
#> 8 5 Complete P-54321… Adenoca… NA Metastatic … Metastat… TRUE
#> 9 6 Complete P-67895… Adenoca… NA Metastatic … Metastat… FALSE
#> 10 7 Complete P-77320… Adenoca… NA Regional no… Regional… FALSE
#> 11 8 Complete P-88321… Adenoca… NA Metastatic … Metastat… TRUE
#> 12 8 Complete P-88321… Adenoca… NA Metastatic … Metastat… FALSE
#> # ℹ 33 more variables: ext_lndis <fct>, ext_bone <fct>, ext_vis <fct>,
#> # ext_liver <fct>, ext_lung <fct>, ext_other <fct>, bonevol <fct>,
#> # cntadt <fct>, tissue <fct>, smp_pros <fct>, smp_tissue <fct>,
#> # pur_rev <fct>, pur_remov <fct>, is_met_for_qc <fct>, dzextent_seq <fct>,
#> # primmet_smp <chr>, age_smp <dbl>, age_seq <dbl>, dx_smp_mos <dbl>,
#> # adt_smp_mos <dbl>, dx_seq_mos <dbl>, adt_seq_mos <dbl>, smp_met_mos <dbl>,
#> # smp_os_mos <dbl>, seq_met_mos <dbl>, seq_crpc_mos <dbl>, …
By default, the argument deidentify = TRUE
is set in
load_prostate_redcap()
. Thus, any identifiers except the
sample IDs, which are needed to merge in molecular data and are shared
on cBioPortal, have been removed from the returned datasets.
To help ensure data quality, the prostateredcap package contains the
function check_prostate_redcap()
, which further processes
the output of load_prostate_redcap()
(in our example,
pts_smp
):
pts
and on the smp
dataset.pts
and smp
datasets are returned,
excluding samples that do not pass a given level of internal consistency
checks. Exclusion of samples that do not pass checks can be disabled
altogether.check_prostate_redcap(recommended_only = TRUE)
.Passing the data to check_prostate_redcap()
with default
parameters and reviewing the number of records that do not pass
checks:
pts_smp_qcd <- pts_smp %>%
check_prostate_redcap(recommended_only = TRUE)
pts_smp_qcd$qc_pts
#> # A tibble: 7 × 6
#> label index included n diff excluded
#> <chr> <int> <list> <int> <int> <list>
#> 1 All patients 1 <tibble> 8 NA <NULL>
#> 2 Incomplete record 2 <tibble> 8 0 <tibble>
#> 3 Missing date of birth or diagnosis 3 <tibble> 7 1 <tibble>
#> 4 Metastatic/CRPC but no associated date 4 <tibble> 7 0 <tibble>
#> 5 No lastvisit+met+CRPC date 5 <tibble> 7 0 <tibble>
#> 6 Metastases before diagnosis 6 <tibble> 7 0 <tibble>
#> 7 Missing stage 7 <tibble> 7 0 <tibble>
pts_smp_qcd$qc_pts
shows that 1 record failed
on criterion 3 that filtered for records with missing data of birth or
missing date of diagnosis. This record is excluded from the final
“quality-controlled” return dataset, pts_smp_qcd$pts
.
Instead of 8 records before quality control, this dataset only includes
records on 7 patients.smp
, QC results are accessible
as pts_smp_qcd$qc_smp
and the final version via
pts_smp_qcd$smp
. The first step for sample-level data (with
index == 2
) is to check whether corresponding patient-level
passed quality control.check_prostate_redcap()
are
defined in qc_criteria_pts()
and
qc_criteria_smp()
and can be modified as needed.check_prostate_redcap(qc_level_pts = 1, qc_level_smp = 1)
,
no exclusions will be performed. Provide different levels than
1
to define the last QC criterion to use for exclusions.
qc_pts
and qc_smp
will still display what
effect of all steps on the data would be.qc_pts
and qc_smp
can be used in study
flowcharts of patient inclusion/exclusion.The data are now ready to be used for analyses. For example, the sample data and patient data can be merged into one data frame.
inner_join(pts_smp_qcd$pts,
pts_smp_qcd$smp,
by = "ptid") %>%
rmarkdown::paged_table() # print formatted version
See the data dictionary of all derived variables recommended for analyses.