Not Logged In

Assessing the Feasibility of Learning Biomedical Phenotypes via Large Scale Omics Profiles

Other Attachments: NIPS MLCB 2013.pdf [PDF] PDF

This paper applies the computational learning theory framework to elucidate the differences that distinguish hard bioinformatics learning tasks from easy. While most of the published predictive studies present the empirical error of a model used to learn a specific phenotype pattern given a group of subjects profiled by a recent omics measurement technology, very few explain why learning is feasible in some cases and infeasible in others. Our recent published results show that some tasks (such as predicting (sub)continental ancestral origins of individuals) are quite easy, while others (such as predicting the susceptibility to breast cancer) are extremely difficult. Our analysis suggests that the ancestral origin prediction problem is a case of realizable learning in the presence of many irrelevant features, which suggests that a training dataset with 1/ε ×(ln(|H|)+ln(1/δ)) samples would suffice for PAC learning this target concept. On the other hand, our analysis suggests that the breast cancer prediction problem appears a case of unrealizable learning from incomplete examples with relevant hidden features, and hidden subclasses, which suggests that at least a training dataset with max(((L_H/(4ε)^2) ×(d_1-1)/8),((L_H/(4ε)^2)× ln(1/4δ)),(d_2/(ε×(1-2L_H')^2 ))) samples is necessary for PAC learning this target concept in the worst case. The paper also discusses the effect of the number of irrelevant features, relevant hidden features, and hidden subclasses on the sample complexity of learning biomedical phenotypes – which is very relevant to our task involving high-throughput omics profiles. This paper can aid future omics researchers interested in predictive studies to estimate the necessary and sufficient number of training examples required for their predictive studies.

Citation

M. Hajiloo, R. Greiner. "Assessing the Feasibility of Learning Biomedical Phenotypes via Large Scale Omics Profiles". Neural Information Processing Systems Workshop on Machine Learning in Computational Biology, pp n/a, December 2013.

Keywords: omics, SNP, breast cancer prediction, ancestral origin prediction, ensemble learning, computational learning theory, sample complexity, PAC learning, learning with irrelevant features, learning with hidden subclasses, probabilistic concept learning
Category: In Workshop

BibTeX

@misc{Hajiloo+Greiner:NIPSMLCB13,
  author = {Mohsen Hajiloo and Russ Greiner},
  title = {Assessing the Feasibility of Learning Biomedical Phenotypes via
    Large Scale Omics Profiles},
  Pages = {n/a},
  booktitle = {Neural Information Processing Systems Workshop on Machine
    Learning in Computational Biology},
  year = 2013,
}

Last Updated: February 12, 2020
Submitted by Sabina P

University of Alberta Logo AICML Logo