Not Logged In

Breast Cancer Prediction Using Genome Wide Single Nucleotide Polymorphism Data

Full Text: BMC Bioinformatics1471-2105-14-S13-S3.pdf PDF
Other Attachments: BIOT2012-4PageExtendedAbstract.pdf [PDF] PDF

Abstract
This paper introduces and applies a Genome Wide Predictive Study (GWPS) to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile. We applied a combination of a feature selection method (MeanDiff) and a learning method (K-Nearest Neighbours, KNN) to a dataset of 623 female subjects, including 302 cases of breast cancer and 321 apparently healthy controls from Alberta, Canada. The learning algorithm considered all the SNPs (506,836) from a whole genome scan with 100% call rate and with minor allele frequency of > 5%. Our learning system produced a classifier to predict whether a novel subject has breast cancer or not. The leave-one-out cross-validation (LOOCV) accuracy of this classifier is 59.55%. Random permutation test show that this result is significantly better than the baseline accuracy of 51.52%. Sensitivity analysis shows that our model is robust to the number of selected SNPs. To better understand the challenge of this task, we then considered other learning systems, each formed by pairing some learner [including decision trees, support vector machines (SVM), as well as KNN] with some feature selection technique [ranging from biologically naive approaches, such as information gain, minimum redundancy maximum relevance (mRMR) and principal component analysis (PCA), as well as MeanDiff, to ones that use biological information just using the SNPs (i) reported to be associated with breast cancer in the literature; (ii) associated with genes of KEGG cancer pathways; or (iii) associated with breast cancer in the F-SNP database]. However, none of these combinations yielded a 10-fold CV score better than our MeanDiff + KNN combination; indeed, only a few of these accuracies were even better than the baseline. We then used the only relevant publicly available breast cancer dataset (CGEMS breast cancer dataset with 1145 breast cancer cases and 1142 controls) to further validate our approach. Due to cross platform differences, only 103 of the 500 Affy 6.0 SNPs selected by our algorithm on were present on the CGEMS Illumina I5 array; this meant we could not test the model trained on our data, on the CGEMS dataset. We could use it, however, to demonstrate the reproducibility of our combination of MeanDiff and KNN, as this led to a LOOCV accuracy of 60.25%, which is significantly better than the CGEMS baseline of 50.06%. This study shows that applying machine learning techniques to GWAS data can produce a model that can effectively predict if a novel subject will develop breast cancer or not. We anticipate producing yet more accurate models by using datasets that include more subjects, and that incorporate other types of information about these women, including environmental and lifestyle factors, as well as other genomic alterations in the form of point mutations and Copy Number Variations (CNVs).

Citation

M. Hajiloo, B. Damavandi, M. Hooshsadat, F. Sangi, J. Mackey, C. Cass, R. Greiner, S. Damaraju. "Breast Cancer Prediction Using Genome Wide Single Nucleotide Polymorphism Data". Biotechnology and Bioinformatics Symposium, pp n/a, October 2012.

Keywords: machine learning, predictive tool, breast cancer, genetic susceptibility, single nucleotide polymorphisms, genome wide association studies, complex disease, medical informatics
Category: In Conference

BibTeX

@incollection{Hajiloo+al:BIOT12,
  author = {Mohsen Hajiloo and Babak Damavandi and Metanat Hooshsadat and
    Farzad Sangi and John Mackey and Carol Cass and Russ Greiner and
    Sambasivarao Damaraju},
  title = {Breast Cancer Prediction Using Genome Wide Single Nucleotide
    Polymorphism Data},
  Pages = {n/a},
  booktitle = {Biotechnology and Bioinformatics Symposium},
  year = 2012,
}

Last Updated: February 12, 2020
Submitted by Sabina P

University of Alberta Logo AICML Logo