Not Logged In

A Simple Closed-Class/Open-Class Factorization for Improved Language Modeling

Full Text: a-simple-closed-class.pdf PDF

We describe a simple improvement to n- gram language models where we estimate the distribution over closed-class (func- tion) words separately from the condi- tional distribution of open-class words given function words. In English, func- tion words account for about 30% of written language, and also form a nat- ural skeleton for most sentences. By fac- toring a language model into a function word model and a conditional model over open-class words given function words, we largely avoid the problem of sparse training data in the rst phase, and lo- calize the need for sophisticated smooth- ing techniques primarily to the second conditional model. We test our fac- tored approach on the Brown and Wall Street Journal corpora and observe a 3.5% to 25.2% improvement in perplex- ity over standard methods, depending on the particular smoothing method and test set used. Compared to other pro- posals for improving n-gram language models, our factorization has the ad- vantage of inherent simplicity and eĈ- ciency, and improves generalization be- tween data sets.

Citation

F. Peng, D. Schuurmans. "A Simple Closed-Class/Open-Class Factorization for Improved Language Modeling". Natural Language Processing Pacific Rim Symposium, December 2001.

Keywords: factorization, machine learning
Category: In Conference

BibTeX

@incollection{Peng+Schuurmans:NLPRS01,
  author = {Fuchun Peng and Dale Schuurmans},
  title = {A Simple Closed-Class/Open-Class Factorization for Improved Language
    Modeling},
  booktitle = {Natural Language Processing Pacific Rim Symposium},
  year = 2001,
}

Last Updated: June 01, 2007
Submitted by Staurt H. Johnson

University of Alberta Logo AICML Logo