Not Logged In

Survival Prediction using Gene Expression Data - A Topic Modeling Approach

Full Text: Kumar_Luke_N_201612_MSc.pdf PDF

Survival prediction is becoming a crucial part of treatment planning for most terminally ill patients. Many believe that genomic data will enable us to better estimate survival of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models cannot cope with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic is a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (document) as a mixture over ``(cancer) strains'' (topics), where each strain is a mixture over gene expression values (words).
After using our novel discretized Latent Dirichlet Allocation (dLDA) procedure to learn these strains, we can then express each patient as a distribution over a small number of strains, then use this distribution as input to a learning algorithm. We then ran a recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. Here, we focus on the METABRIC dataset, which describes each of n=1,981 breast cancer patients, using k=49,576 gene expression values. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance Index, as well as a relevant novel measure, D-calibration. We then validate this approach on the n=1082 TCGA BRCA dataset, over k=20532 gene expression values.

Citation

L. Kumar. "Survival Prediction using Gene Expression Data - A Topic Modeling Approach". MSc Thesis, Computing Science, Thesis, December 2016.

Keywords: Survival prediction, Topic models, Gene expression, Machine learning, Calibration and Discrimination
Category: MSc Thesis

BibTeX

@mastersthesis{Kumar:16,
  author = {Luke Kumar},
  title = {Survival Prediction using Gene Expression Data - A Topic Modeling
    Approach},
  School = {Computing Science},
  Type = "Thesis",
  year = 2016,
}

Last Updated: January 24, 2017
Submitted by Luke Kumar

University of Alberta Logo AICML Logo