View Publication

Breast Cancer Prediction Using Genome Wide Single Nucleotide Polymorphism Data

Mohsen Hajiloo, Dept of Computing Science
Babak Damavandi
Metanat Hooshsadat
Farzad Sangi
John Mackey, Cross Cancer Institute
Carol Cass, Cross Cancer Institute
Russ Greiner, Dept of Computing Science; PI of AICML
Sambasivarao Damaraju, Cross Cancer Institute

Abstract
Background
This paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile.
Results
We first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls), predominantly of Caucasian origin from Alberta, Canada using Affymetrix Human SNP 6.0 arrays. Then, we applied EIGENSTRAT population stratification correction method to remove 73 subjects not belonging to the Caucasian population. Then, we filtered any SNP that had any missing calls, whose genotype frequency was deviated from Hardy-Weinberg equilibrium, or whose minor allele frequency was less than 5%. Finally, we applied a combination of MeanDiff feature selection method and KNN learning method to this filtered dataset to produce a breast cancer prediction model. LOOCV accuracy of this classifier is 59.55%. Random permutation tests show that this result is significantly better than the baseline accuracy of 51.52%. Sensitivity analysis shows that the classifier is fairly robust to the number of MeanDiff-selected SNPs. External validation on the CGEMS breast cancer dataset, the only other publicly available breast cancer dataset, shows that this combination of MeanDiff and KNN leads to a LOOCV accuracy of 60.25%, which is significantly better than its baseline of 50.06%. We then considered a dozen different combinations of feature selection and learning method, but found that none of these combinations produces a better predictive model than our model. We also considered various biological feature selection methods like selecting SNPs reported in recent genome wide association studies to be associated with breast cancer, selecting SNPs in genes associated with KEGG cancer pathways, or selecting SNPs associated with breast cancer in the F-SNP database to produce predictive models, but again found that none of these models achieved accuracy better than baseline.
Conclusions
We anticipate producing more accurate breast cancer prediction models by recruiting more study subjects, providing more accurate labelling of phenotypes (to accommodate the heterogeneity of breast cancer), measuring other genomic alterations such as point mutations and copy number variations, and incorporating non-genetic information about subjects such as environmental and lifestyle factors.

Citation

M. Hajiloo, B. Damavandi, M. Hooshsadat, F. Sangi, J. Mackey, C. Cass, R. Greiner, S. Damaraju. "Breast Cancer Prediction Using Genome Wide Single Nucleotide Polymorphism Data ". BMC Bioinformatics, 14(Suppl 13), pp S3, October 2013.

Keywords:	machine learning, predictive tool, SNPs, breast cancer, genetic susceptibility, single nucleotide polymorphisms, genome wide association studies, complex disease, medical informatics
Category:	In Journal
Web Links:	DOI
	Paper Link
Related Publication(s):	Breast Cancer Prediction Using Genome Wide Single Nucleotide Polymorphism Data

BibTeX

@article{Hajiloo+al:13,
  author = {Mohsen Hajiloo and Babak Damavandi and Metanat Hooshsadat and
    Farzad Sangi and John Mackey and Carol Cass and Russ Greiner and
    Sambasivarao Damaraju},
  title = {Breast Cancer Prediction Using Genome Wide Single Nucleotide
    Polymorphism Data },
  Volume = "14",
  Number = {Suppl 13},
  Pages = {S3},
  journal = {BMC Bioinformatics},
  year = 2013,
}

Last Updated: February 10, 2020
Submitted by Sabina P

Not Logged In

PapersDB

Breast Cancer Prediction Using Genome Wide Single Nucleotide Polymorphism Data

Citation

BibTeX