Not Logged In

Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression Microarrays

Full Text: thesis_Abbr_End.pdf PDF

Gene expression microarrays are widely used to better understand the complex biological mechanisms inside cells. One of the main obstacles of applying statistical learning algorithms to microarray data is the large gap between the number of features (p) and the number of available instances (n), i.e., the "large p, small n" challenge. This thesis explores two ways to deal with this challenge. One approach is to increase n by combining similarly appropriate microarray data sets together. This is appealing as there are now many publicly available microarray studies. The main problem of this approach is the batch eff ect, i.e., the influence of non-biological factors on expression intensities that can confound the biological signal. As a result, combining gene expression studies without correcting for batch e ffects may lead to misleading fi ndings. This thesis proposes a novel batch correction algorithm, called batch eff ect correction using canonical correlation analysis (BECCA), that assumes the batch effect is due to additive independent confounding factors and so utilizes canonical correlation analysis to separate technical bias from the measured biological signal. We compare BECCA to various existing batch correction algorithms using several real-world gene expression studies and nd that BECCA has similar performance. The key advantage of utilizing BECCA, compared to other similar performing algorithms, is its exibility, as BECCA allows the user to adjust how much common signal to preserve across the batches and how much batch related signal to remove from each one by changing the values of BECCA parameters. The second approach to batch correction considers the wisdom of reducing p by selecting a subset of genes. Our experiments suggest that some genes in microarray data sets contain very little biological signal, i.e., including only these genes in the calculations makes all specimens highly correlated, regardless of their tissue of origin or disease state. It is, therefore, desirable to identify and remove these misleading genes before conducing downstream analysis or batch correction. For this purpose, we propose an efficient algorithm to extend the single-study variance-based gene selection method to a multi-study gene selection algorithm. Our empirical results show this feature selection algorithm outperforms other algorithms in reducing the destructive influence of batch eff ects.

Citation

S. Vaisipour. "Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression Microarrays". PhD Thesis, January 2014.

Keywords: batch effects, microarray
Category: PhD Thesis

BibTeX

@phdthesis{Vaisipour:14,
  author = {Saman Vaisipour},
  title = {Detecting, correcting, and preventing the batch effects in
    multi-site data, with a focus on gene expression Microarrays},
  year = 2014,
}

Last Updated: August 16, 2014
Submitted by Nelson Loyola

University of Alberta Logo AICML Logo