Augmenting Naive Bayes Classifiers with Statistical Language Models
We augment naive Bayes models with statistical ngram language models to address short 
comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we re 
fer to as the Chain Augmented Naive Bayes (CAN) Bayes classifier. CAN models have 
two advantages over standard naive Bayes classifiers. First, they relax some of the indepen 
dence assumptions of naive Bayes---allowing a local Markov chain dependence in the observed 
variables---while still permitting e#cient inference and learning. Second, they permit straight 
forward application of sophisticated smoothing techniques from statistical language modeling, 
which allows one to obtain better parameter estimates than the standard Laplace smoothing 
used in naive Bayes classification. In this paper, we introduce CAN models and apply them 
to various text classification problems. To demonstrate the language independent and task 
independent nature of these classifiers, we present experimental results on several text clas 
sification problems---authorship attribution, text genre classification, and topic detection---in 
several languages---Greek, English, Japanese and Chinese. We then systematically study the 
key factors in the CAN model that can influence the classification performance, and analyze 
the strengths and weaknesses of the model.
Citation
F. Peng, 
D. Schuurmans, 
S. Wang. 
"Augmenting Naive Bayes Classifiers with Statistical Language Models". Information Retrieval (IR), October 2003.
	
		| Keywords: | n-gram language model, machine learning | 
	
		| Category: | In Journal | 
BibTeX
@article{Peng+al:IR03,
  author = {Fuchun Peng and Dale Schuurmans and Shaojun Wang},
  title = {Augmenting Naive Bayes Classifiers with Statistical Language Models},
  journal = {Information Retrieval (IR)},
  year = 2003,
}Last Updated: March 21, 2007
Submitted by Nelson Loyola