A Simple Closed-Class/Open-Class Factorization for Improved Language Modeling
- Fuchun Peng, Department of Computer Science, University of Massachusetts at Amherst
- Dale Schuurmans, AICML
We describe a simple improvement to n- gram language models where we estimate the distribution over closed-class (func- tion) words separately from the condi- tional distribution of open-class words given function words. In English, func- tion words account for about 30% of written language, and also form a nat- ural skeleton for most sentences. By fac- toring a language model into a function word model and a conditional model over open-class words given function words, we largely avoid the problem of sparse training data in the rst phase, and lo- calize the need for sophisticated smooth- ing techniques primarily to the second conditional model. We test our fac- tored approach on the Brown and Wall Street Journal corpora and observe a 3.5% to 25.2% improvement in perplex- ity over standard methods, depending on the particular smoothing method and test set used. Compared to other pro- posals for improving n-gram language models, our factorization has the ad- vantage of inherent simplicity and eĈ- ciency, and improves generalization be- tween data sets.
Citation
F. Peng, D. Schuurmans. "A Simple Closed-Class/Open-Class Factorization for Improved Language Modeling". Natural Language Processing Pacific Rim Symposium, December 2001.Keywords: | factorization, machine learning |
Category: | In Conference |
BibTeX
@incollection{Peng+Schuurmans:NLPRS01, author = {Fuchun Peng and Dale Schuurmans}, title = {A Simple Closed-Class/Open-Class Factorization for Improved Language Modeling}, booktitle = {Natural Language Processing Pacific Rim Symposium}, year = 2001, }Last Updated: June 01, 2007
Submitted by Staurt H. Johnson