Not Logged In

A Hierarchical EM Approach to Word Segmentation

Full Text: a-hierarchical-em-approach.pdf PDF

We propose a simple two-level hierar- chical probability model for unsuper- vised word segmentation. By treat- ing words as strings composed of mor- phemes/phonemes which are themselves composed of character/phone strings, we use EM to rst identify the impor- tant morphemes/phonemes in a corpus, and then use a second level of EM to identify words given a lower level mor- pheme/phoneme segmentation. To fur- ther improve performance of the basic method we employ a mutual informa- tion criterion to eliminate long word agglomerations and reduce the size of the inferred lexicon while moving EM out of poor local maxima. Experiments on the Brown corpus show that our method accurately recovers hidden word boundaries using less training data than current MDL based approaches, even though our method is only trained on raw unsupervised data.

Citation

F. Peng, D. Schuurmans. "A Hierarchical EM Approach to Word Segmentation". Natural Language Processing Pacific Rim Symposium, November 2001.

Keywords: hierarchical
Category: In Conference

BibTeX

@incollection{Peng+Schuurmans:NLPRS01,
  author = {Fuchun Peng and Dale Schuurmans},
  title = {A Hierarchical EM Approach to Word Segmentation},
  booktitle = {Natural Language Processing Pacific Rim Symposium},
  year = 2001,
}

Last Updated: June 01, 2007
Submitted by Staurt H. Johnson

University of Alberta Logo AICML Logo