Getting started with an example dataset • prostateredcap

Install the prostateredcap package

This step is only required once. To install (or update) the prostateredcap package from GitHub, use the remotes package:

install.packages("remotes")  # skip if 'remotes' package is already installed
remotes::install_github("stopsack/prostateredcap")

An example dataset

The prostateredcap R package contains an example dataset of the prostate cancer database in the same format as it would be exported from REDCap as a “labeled CSV.” All data in the example dataset are designed to mimick real clinical data but do not correspond to any real patients.

First, load the dplyr package for data handling, and take a look at the raw example dataset provided as part of the prostateredcap package.

library(dplyr)

raw_data <- system.file("extdata",
                        "SampleGUPIMPACTDatab_DATA_LABELS_2021-05-26.csv",
                        package = "prostateredcap")

readr::read_csv(file = raw_data) %>% 
  print(max_extra_cols = 0)  # do not print all other columns
#> # A tibble: 28 × 72
#>    `Record ID` `Repeat Instrument` `Repeat Instance` `Birth Date` Race          
#>          <dbl> <chr>                           <dbl> <chr>        <chr>         
#>  1           1 NA                                 NA 02/04/1956   White         
#>  2           1 Sample Data                         1 NA           NA            
#>  3           1 Freeze Data                         1 NA           NA            
#>  4           2 NA                                 NA 02/09/1974   White         
#>  5           2 Sample Data                         1 NA           NA            
#>  6           2 Freeze Data                         1 NA           NA            
#>  7           3 NA                                 NA 01/08/1953   Black or Afri…
#>  8           3 Sample Data                         1 NA           NA            
#>  9           3 Sample Data                         2 NA           NA            
#> 10           3 Freeze Data                         1 NA           NA            
#> # ℹ 18 more rows

The dataset, as a typical REDCap export, contains multiple rows per person, with each of the REDCap “forms” (baseline data, sample data, …) in a separate row and blank values for variables not part of that “form.”

Loading the data

We will load the prostateredcap library, read in the same dataset again, and display its contents.

library(prostateredcap)

pts_smp <- load_prostate_redcap(raw_data)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `stage_detailed = fct_relevel(...)`.
#> Caused by warning:
#> ! 5 unknown levels in `f`: T1/T2 NX M0, T3 N0 M0, T3 NX M0, T4 M0, and TX NX M0
#> Warning: Expected 4 pieces. Missing pieces filled with `NA` in 2 rows
#> [3, 10].
#> Warning: There was 1 warning in `transmute()`.
#> ℹ In argument: `smp_tissue = fct_collapse(smp_tissue, Visceral = c("Liver",
#>   "Lung"))`.
#> Caused by warning:
#> ! Unknown levels in `f`: Liver
#> Warning: There were 2 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `dzextent_seq = fct_relevel(...)`.
#> Caused by warning:
#> ! 2 unknown levels in `f`: Regional nodes and Metastatic, variant histology
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `rx_dzextent = fct_relevel(...)`.
#> Caused by warning:
#> ! 6 unknown levels in `f`: Localized, Regional nodes, Metastatic
#> hormone-sensitive, Non-metastatic castration-resistant, Metastatic
#> castration-resistant, and Metastatic, variant histology

Warnings that the example data, which has data on 8 patients, does not contain all tumor/stage combinations are expected.

load_prostate_redcap() has returned a list with two separate data elements:

pts, the data frame with patient-level data.
smp, the sample-level data frame. smp can have multiple rows per patient that can be merged with the patient-level data pts using the ptid variable present in both datasets.

The data in pts_smp is preprocessed. For example, rather than containing data on date of birth and date of diagnosis, the pts dataset contains age at diagnosis (age_dx, in years).

pts_smp$pts
#> # A tibble: 8 × 59
#>    ptid complete_pts age_dx race           race4 race3 ethnicity smoking smoke01
#>   <int> <chr>         <dbl> <fct>          <fct> <fct> <fct>     <fct>     <dbl>
#> 1     1 Complete       43.4 White          White White NOT Hisp… Never         0
#> 2     2 Complete       42.3 White          White White NOT Hisp… Never         0
#> 3     3 Complete       60.7 Black or Afri… Blac… Black NOT Hisp… Never         0
#> 4     4 Complete       NA   NA             NA    NA    NOT Hisp… Never         0
#> 5     5 Complete       63.4 White          White White NOT Hisp… Current       1
#> 6     6 Complete       49.1 White          White White NOT Hisp… Never         0
#> 7     7 Complete       59.6 Asian          Asian Asian NOT Hisp… Never         0
#> 8     8 Complete       65.2 White          White White NOT Hisp… Never         0
#> # ℹ 50 more variables: bx_gl_sum <dbl>, bx_gl <fct>, bx_gl34 <fct>,
#> #   bx_gl_maj <dbl>, bx_gl_min <dbl>, psa_dx <dbl>, psa_dxcat <fct>,
#> #   lnpsa_dx <dbl>, clin_t <fct>, clin_n <fct>, clin_m <fct>,
#> #   stage_detailed <fct>, stage <fct>, clin_tstage <fct>, clin_nstage <fct>,
#> #   mstage <fct>, rxprim <fct>, rxprim_oth <chr>, rxprim_rp <lgl>,
#> #   rxprim_adt <lgl>, rxprim_chemo <lgl>, rxprim_xrt <lgl>, rxprim_other <lgl>,
#> #   rp_gl_sum <dbl>, rp_gl34 <fct>, rp_gl_maj <dbl>, rp_gl_min <dbl>, …

pts_smp$smp
#> # A tibble: 12 × 41
#>     ptid complete_smp dmpid    hist_smp hist_cmt dzextent_smp dzextent2 ext_pros
#>    <int> <chr>        <chr>    <fct>    <chr>    <fct>        <fct>     <fct>   
#>  1     1 Complete     P-12345… Adenoca… NA       Metastatic … Metastat… FALSE   
#>  2     2 Complete     P-23456… Adenoca… NA       Metastatic … Metastat… TRUE    
#>  3     3 Complete     P-34567… Adenoca… NA       Localized    Localized FALSE   
#>  4     3 Complete     P-34567… Adenoca… NA       Metastatic … Metastat… FALSE   
#>  5     4 Complete     P-43219… Adenoca… NA       Metastatic … Metastat… FALSE   
#>  6     4 Complete     P-43219… Adenoca… NA       Metastatic … Metastat… FALSE   
#>  7     5 Complete     P-54321… Adenoca… NA       Metastatic … Metastat… TRUE    
#>  8     5 Complete     P-54321… Adenoca… NA       Metastatic … Metastat… TRUE    
#>  9     6 Complete     P-67895… Adenoca… NA       Metastatic … Metastat… FALSE   
#> 10     7 Complete     P-77320… Adenoca… NA       Regional no… Regional… FALSE   
#> 11     8 Complete     P-88321… Adenoca… NA       Metastatic … Metastat… TRUE    
#> 12     8 Complete     P-88321… Adenoca… NA       Metastatic … Metastat… FALSE   
#> # ℹ 33 more variables: ext_lndis <fct>, ext_bone <fct>, ext_vis <fct>,
#> #   ext_liver <fct>, ext_lung <fct>, ext_other <fct>, bonevol <fct>,
#> #   cntadt <fct>, tissue <fct>, smp_pros <fct>, smp_tissue <fct>,
#> #   pur_rev <fct>, pur_remov <fct>, is_met_for_qc <fct>, dzextent_seq <fct>,
#> #   primmet_smp <chr>, age_smp <dbl>, age_seq <dbl>, dx_smp_mos <dbl>,
#> #   adt_smp_mos <dbl>, dx_seq_mos <dbl>, adt_seq_mos <dbl>, smp_met_mos <dbl>,
#> #   smp_os_mos <dbl>, seq_met_mos <dbl>, seq_crpc_mos <dbl>, …

By default, the argument deidentify = TRUE is set in load_prostate_redcap(). Thus, any identifiers except the sample IDs, which are needed to merge in molecular data and are shared on cBioPortal, have been removed from the returned datasets.

Performing quality control

To help ensure data quality, the prostateredcap package contains the function check_prostate_redcap(), which further processes the output of load_prostate_redcap() (in our example, pts_smp):

A set of internal consistency checks is run on the pts and on the smp dataset.
For each step, the number of records that pass and that are being excluded based on each criterion are recorded.
The pts and smp datasets are returned, excluding samples that do not pass a given level of internal consistency checks. Exclusion of samples that do not pass checks can be disabled altogether.
To obtain only those derived variables that are recommended for analyses, use check_prostate_redcap(recommended_only = TRUE).

Passing the data to check_prostate_redcap() with default parameters and reviewing the number of records that do not pass checks:

pts_smp_qcd <- pts_smp %>%
  check_prostate_redcap(recommended_only = TRUE)

pts_smp_qcd$qc_pts
#> # A tibble: 7 × 6
#>   label                                  index included     n  diff excluded
#>   <chr>                                  <int> <list>   <int> <int> <list>  
#> 1 All patients                               1 <tibble>     8    NA <NULL>  
#> 2 Incomplete record                          2 <tibble>     8     0 <tibble>
#> 3 Missing date of birth or diagnosis         3 <tibble>     7     1 <tibble>
#> 4 Metastatic/CRPC but no associated date     4 <tibble>     7     0 <tibble>
#> 5 No lastvisit+met+CRPC date                 5 <tibble>     7     0 <tibble>
#> 6 Metastases before diagnosis                6 <tibble>     7     0 <tibble>
#> 7 Missing stage                              7 <tibble>     7     0 <tibble>

Accessing pts_smp_qcd$qc_pts shows that 1 record failed on criterion 3 that filtered for records with missing data of birth or missing date of diagnosis. This record is excluded from the final “quality-controlled” return dataset, pts_smp_qcd$pts. Instead of 8 records before quality control, this dataset only includes records on 7 patients.
For the sample dataset smp, QC results are accessible as pts_smp_qcd$qc_smp and the final version via pts_smp_qcd$smp. The first step for sample-level data (with index == 2) is to check whether corresponding patient-level passed quality control.
The criteria used by check_prostate_redcap() are defined in qc_criteria_pts() and qc_criteria_smp() and can be modified as needed.
By running check_prostate_redcap(qc_level_pts = 1, qc_level_smp = 1), no exclusions will be performed. Provide different levels than 1 to define the last QC criterion to use for exclusions. qc_pts and qc_smp will still display what effect of all steps on the data would be.
The sequential exclusions and respective counts available in qc_pts and qc_smp can be used in study flowcharts of patient inclusion/exclusion.

Running analyses

The data are now ready to be used for analyses. For example, the sample data and patient data can be merged into one data frame.

inner_join(pts_smp_qcd$pts,
           pts_smp_qcd$smp,
           by = "ptid") %>%
  rmarkdown::paged_table()  # print formatted version

See the data dictionary of all derived variables recommended for analyses.