View Publication

Applying Machine Learning to Text Segmentation for Information Retrieval

Xiangji Huang, School of Computer Science, University of Waterloo
Fuchun Peng, Department of Computer Science, University of Massachusetts at Amherst
Dale Schuurmans, AICML
Nick Cercone, School of Computer Science, University of Waterloo
Stephen E. Robertson, Microsoft Research Ltd., UK and City University, London, UK

We propose a self-supervised word segmentation technique for text segmentation in Chi- nese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are and adaptive. We nd that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give com- parable (in some cases even better) retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese informa- tion retrieval. However, for Chinese, we nd that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this eect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, includ- ing 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct com- pounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and dierent evaluation standards for word segmentation should be given to dierent applications.

Citation

X. Huang, F. Peng, D. Schuurmans, N. Cercone, S. Robertson. "Applying Machine Learning to Text Segmentation for Information Retrieval". Information Retrieval (IR), 6(3), pp 333-362, September 2003.

Keywords:	machine learning, word segmentation, EM algorithm
Category:	In Journal

BibTeX

@article{Huang+al:IR03,
  author = {Xiangji Huang and Fuchun Peng and Dale Schuurmans and Nick Cercone
    and Stephen E. Robertson},
  title = {Applying Machine Learning to Text Segmentation for Information
    Retrieval},
  Volume = "6",
  Number = "3",
  Pages = {333-362},
  journal = {Information Retrieval (IR)},
  year = 2003,
}

Last Updated: March 14, 2007
Submitted by AICML Admin Assistant

Not Logged In

PapersDB

Applying Machine Learning to Text Segmentation for Information Retrieval

Citation

BibTeX