Not Logged In

Investigating the Relationship Between Word Segmentation Performance and Retrieval Performance in Chinese IR

Full Text: p148-peng.pdf PDF

It is commonly believed that word segmentation ac- curacy is monotonically related to retrieval perfor- mance in Chinese information retrieval. In this pa- per we show that, for Chinese, the relationship be- tween segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phe- nomenon begins to occur which leads to a reduction in information retrieval performance. We demon- strate this e ect by presenting an empirical inves- tigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation al- gorithms with word segmentation accuracies ranging from 44% to 95%. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accu- rate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surpris- ing advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text.

Citation

F. Peng, X. Huang, D. Schuurmans, N. Cercone. "Investigating the Relationship Between Word Segmentation Performance and Retrieval Performance in Chinese IR". Conference on Computational Linguistics (COLING), Taipei, August 2002.

Keywords: word segmentation
Category: In Conference

BibTeX

@incollection{Peng+al:COLING02,
  author = {Fuchun Peng and Xiangji Huang and Dale Schuurmans and Nick Cercone},
  title = {Investigating the Relationship Between Word Segmentation Performance
    and Retrieval Performance in Chinese IR},
  booktitle = {Conference on Computational Linguistics (COLING)},
  year = 2002,
}

Last Updated: June 01, 2007
Submitted by Staurt H. Johnson

University of Alberta Logo AICML Logo