Data dictionary for REDCap: Definitions for key data elements in clinical-genomic prostate cancer research. The field definitions can be downloaded as a template that serves for creating a REDCap database for data input.
Assessment of reproducibility: Empirical assessment of the validity of key data fields for reproducibility in the specific clinical setting (as described in Keegan et al.).
prostateredcap R package: Load and reshape the raw data from REDCap, perform automated quality control checks, and deidentify, using an R package available via Github. The resulting sharable dataset is ready for statistical/bioinformatic analyses.
Design and implement the REDCap database – see Keegan et al.
Assessment of reproducibility – see Keegan et al.
Data processing for analyses using the prostateredcap package – described in detail in the Get Started vignette.
Export REDCap dataset in “Labeled CSV” format. (Because data collection involves protected health information, raw datasets cannot be shared. To test the process without having an actual dataset at hand yet, the Get Started vignette uses an example dataset.)
In R, install the prostateredcap package using remotes::install_github("stopsack/prostateredcap")
.
Import the labelled CSV exported from REDCap using load_prostate_redcap()
, which loads, merges, reformats, corrects, and labels the dataset (see Details). By default, this function will also pass the dataset through deidentify_prostate_redcap()
to remove protected health information.
library(prostateredcap)
datasets <- load_prostate_redcap(labeled_csv = "file_from_redcap.csv")
Run automated quality control checks and exclusions of records that fail these checks using check_prostate_redcap()
. By using recommended_only = TRUE
, only data elements recommended for analyses will be returned.
datasets <- check_prostate_redcap(datasets, recommended_only = TRUE)
# View quality control results and exclusion criteria:
datasets$qc_pts
datasets$qc_smp
datasets$pts
is the set of patient-level data; datasets$smp
is the set of sample-level data; datasets$trt
is the set of treatment data per line of treatment. See data dictionary of elements recommended for analyses.Keegan NM, Vasselman SE, Barnett ES, Nweji B, Carbone EA, Blum A, Morris MJ, Rathkopf DE, Slovin SF, Danila DC, Autio KA, Scher HI, Kantoff PW, Abida W,* Stopsack KH.* Clinical annotations for prostate cancer research: Defining data elements, creating a reproducible analytical pipeline, and assessing data quality. The Prostate. 2022. doi: 10.1002/pros.24363. Article | Preprint