Not Logged In

Integrating Trigram, PCFG and LDA for Language Modeling via Directed Markov Random Fields

A LDA (latent Dirichlet allocation) model is a promising generative probabilistic model to extract semantic information for text data. This paper presents a directed Markov random field (MRF) model that combines n-gram models, probabilistic context free grammars (PCFG) and LDA for the purpose of statistical language modeling. We present efficient approximate inference techniques which decompose the original random fields into two sub-models, one consists of a variational distribution, the other is solvable by a tractable algorithm. We use an EM algorithm for empirical Bayes and the generalized inside-outside algorithm to perform parameter estimation for these complicated models respectively. Our experimental results on the Wall Street Journal corpus show that the composite trigram/PCFG/LDA model consistently has further perplexity reductions over trigram/PCFG/PLSA model and is resistant to over-fitting.

Citation

S. Wang, R. Greiner, D. Schuurmans, L. Cheng, S. Wang. "Integrating Trigram, PCFG and LDA for Language Modeling via Directed Markov Random Fields". NIPS Workshop on Bayesian Methods for Natural Language Processing, December 2005.

Keywords: language modeling, random field, PLSA, machine learning
Category: In Workshop

BibTeX

@misc{Wang+al:NIPS-BMforNL05,
  author = {Shaojun Wang and Russ Greiner and Dale Schuurmans and Li Cheng and
    Shaomin Wang},
  title = {Integrating Trigram, PCFG and LDA for Language Modeling via Directed
    Markov Random Fields},
  booktitle = {NIPS Workshop on Bayesian Methods for Natural Language
    Processing},
  year = 2005,
}

Last Updated: October 13, 2013
Submitted by Russ Greiner

University of Alberta Logo AICML Logo