adjust_batch generates biomarker levels for the variable(s) markers in the dataset data that are corrected (adjusted) for batch effects, i.e. differential measurement error between levels of batch.

adjust_batch(
  data,
  markers,
  batch,
  method = c("simple", "standardize", "ipw", "quantreg", "quantnorm"),
  confounders = NULL,
  suffix = "_adjX",
  ipw_truncate = c(0.025, 0.975),
  quantreg_tau = c(0.25, 0.75),
  quantreg_method = "fn"
)

Arguments

data

Data set

markers

Variable name(s) to batch-adjust. Select multiple variables with tidy evaluation, e.g., markers = starts_with("biomarker").

batch

Categorical variable indicating batch.

method

Method for batch effect correction:

  • simple Simple means per batch will be subtracted. No adjustment for confounders.

  • standardize Means per batch after standardization for confounders in linear models will be subtracted. If no confounders are supplied, method = simple is equivalent and will be used.

  • ipw Means per batch after inverse-probability weighting for assignment to a specific batch in multinomial models, conditional on confounders, will be subtracted. Stabilized weights are used, truncated at quantiles defined by the ipw_truncate parameters. If no confounders are supplied, method = simple is equivalent and will be used.

  • quantreg Lower quantiles (default: 25th percentile) and ranges between a lower and an upper quantile (default: 75th percentile) will be unified between batches, allowing for differences in both parameters due to confounders. Set the two quantiles using the quantreg_tau parameters.

  • quantnorm Quantile normalization between batches. No adjustment for confounders.

confounders

Optional: Confounders, i.e. determinants of biomarker levels that differ between batches. Only used if method = standardize, method = ipw, or method = quantreg, i.e. methods that attempt to retain some of these "true" between-batch differences. Select multiple confounders with tidy evaluation, e.g., confounders = c(age, age_squared, sex).

suffix

Optional: What string to append to variable names after batch adjustment. Defaults to "_adjX", with X depending on method:

  • _adj2 from method = simple

  • _adj3 from method = standardize

  • _adj4 from method = ipw

  • _adj5 from method = quantreg

  • _adj6 from method = quantnorm

ipw_truncate

Optional and used for method = ipw only: Lower and upper quantiles for truncation of stabilized weights. Defaults to c(0.025, 0.975).

quantreg_tau

Optional and used for method = quantreg only: Quantiles to scale. Defaults to c(0.25, 0.75).

quantreg_method

Optional and used for method = quantreg only: Algorithmic method to fit quantile regression. Defaults to "fn". See parameter method of rq.

Value

The data dataset with batch effect-adjusted variable(s) added at the end. Model diagnostics, using the attribute .batchtma of this dataset, are available via the diagnose_models function.

Details

If no true differences between batches are expected, because samples have been randomized to batches, then a method that returns adjusted values with equal means (method = simple) or with equal rank values (method = quantnorm) for all batches is appropriate.

If the distribution of determinants of biomarker values (confounders) differs between batches, then a method that retains these "true" differences between batches while adjusting for batch effects may be appropriate: method = standardize and method = ipw address means; method = quantreg addresses lower values and dynamic range separately.

Which method to choose depends on the properties of batch effects (affecting means or also variance?) and the presence and strength of confounding. For the two mean-only confounder-adjusted methods, the choice may depend on whether the confounder--batch association (method = ipw) or the confounder--biomarker association (method = standardize) can be modeled better. Generally, if batch effects are present, any adjustment method tends to perform better than no adjustment in reducing bias and increasing between-study reproducibility. See references.

All adjustment approaches except method = quantnorm are based on linear models. It is recommended that variables for markers and confounders first be transformed as necessary (e.g., log transformations or splines). Scaling or mean centering are not necessary, and adjusted values are returned on the original scale. Parameters markers, batch, and confounders support tidy evaluation.

Observations with missing values for the markers and confounders will be ignored in the estimation of adjustment parameters, as are empty batches. Batch effect-adjusted values for observations with existing marker values but missing confounders are based on adjustment parameters derived from the other observations in a batch with non-missing confounders.

References

Stopsack KH, Tyekucheva S, Wang M, Gerke TA, Vaselkiv JB, Penney KL, Kantoff PW, Finn SP, Fiorentino M, Loda M, Lotan TL, Parmigiani G+, Mucci LA+ (+ equal contribution). Extent, impact, and mitigation of batch effects in tumor biomarker studies using tissue microarrays. eLife 2021;10:e71265. doi: https://doi.org/10.7554/elife.71265 (This R package, all methods descriptions, and further recommendations.)

Rosner B, Cook N, Portman R, Daniels S, Falkner B. Determination of blood pressure percentiles in normal-weight children: some methodological issues. Am J Epidemiol 2008;167(6):653-66. (Basis for method = standardize)

Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003;19:185–193. (method = quantnorm)

Author

Konrad H. Stopsack

Examples

# Data frame with two batches
# Batch 2 has higher values of biomarker and confounder
df <- data.frame(
  tma = rep(1:2, times = 10),
  biomarker = rep(1:2, times = 10) +
    runif(max = 5, n = 20),
  confounder = rep(0:1, times = 10) +
    runif(max = 10, n = 20)
)

# Adjust for batch effects
# Using simple means, ignoring the confounder:
adjust_batch(
  data = df,
  markers = biomarker,
  batch = tma,
  method = simple
)
#>    tma biomarker confounder biomarker_adj2
#> 1    1  1.403751  2.8989230       1.960613
#> 2    2  6.171665  7.7838043       5.614803
#> 3    1  4.003804  7.3531960       4.560667
#> 4    2  2.786042  2.9595673       2.229180
#> 5    1  1.036997  9.8053967       1.593860
#> 6    2  4.331967  8.4152153       3.775105
#> 7    1  3.488887  0.5144628       4.045749
#> 8    2  3.448836  6.3021246       2.891974
#> 9    1  4.664410  6.9582388       5.221272
#> 10   2  5.862608  7.8855600       5.305745
#> 11   1  5.373003  0.3123033       5.929866
#> 12   2  2.874703  3.2556253       2.317841
#> 13   1  1.171207  3.0083081       1.728069
#> 14   2  3.601929  7.3646561       3.045066
#> 15   1  3.011641  4.7902455       3.568504
#> 16   2  2.978349  5.3217126       2.421487
#> 17   1  3.017691  7.0643384       3.574553
#> 18   2  2.318307 10.4857658       1.761445
#> 19   1  2.943507  1.8033877       3.500369
#> 20   2  6.877739  3.1689988       6.320877
# Returns data set with new variable "biomarker_adj2"

# Use quantile regression, include the confounder,
# change suffix of returned variable:
adjust_batch(
  data = df,
  markers = biomarker,
  batch = tma,
  method = quantreg,
  confounders = confounder,
  suffix = "_batchadjusted"
)
#> Warning: Returning data frames from `filter()` expressions was deprecated in dplyr
#> 1.0.8.
#>  Please use `if_any()` or `if_all()` instead.
#>  The deprecated feature was likely used in the batchtma package.
#>   Please report the issue to the authors.
#>    tma biomarker confounder biomarker_batchadjusted
#> 1    1  1.403751  2.8989230                3.095246
#> 2    2  6.171665  7.7838043                4.313937
#> 3    1  4.003804  7.3531960                4.196229
#> 4    2  2.786042  2.9595673                3.043022
#> 5    1  1.036997  9.8053967                2.939945
#> 6    2  4.331967  8.4152153                3.623340
#> 7    1  3.488887  0.5144628                3.978189
#> 8    2  3.448836  6.3021246                3.291825
#> 9    1  4.664410  6.9582388                4.475960
#> 10   2  5.862608  7.8855600                4.197921
#> 11   1  5.373003  0.3123033                4.776011
#> 12   2  2.874703  3.2556253                3.076304
#> 13   1  1.171207  3.0083081                2.996776
#> 14   2  3.601929  7.3646561                3.349294
#> 15   1  3.011641  4.7902455                3.776101
#> 16   2  2.978349  5.3217126                3.115211
#> 17   1  3.017691  7.0643384                3.778663
#> 18   2  2.318307 10.4857658                2.867440
#> 19   1  2.943507  1.8033877                3.747250
#> 20   2  6.877739  3.1689988                4.578988
# Returns data set with new variable "biomarker_batchadjusted"