Applying Machine Learning to Text Segmentation for Information Retrieval
- Xiangji Huang, School of Computer Science, University of Waterloo
- Fuchun Peng, Department of Computer Science, University of Massachusetts at Amherst
- Dale Schuurmans, AICML
- Nick Cercone, School of Computer Science, University of Waterloo
- Stephen E. Robertson, Microsoft Research Ltd., UK and City University, London, UK
We propose a self-supervised word segmentation technique for text segmentation in Chi-
nese information retrieval. This method combines the advantages of traditional dictionary
based, character based and mutual information based approaches, while overcoming many of
their shortcomings. Experiments on TREC data show this method is promising. Our method
is completely language independent and unsupervised, which provides a promising avenue
for constructing accurate multi-lingual or cross-lingual information retrieval systems that are
and adaptive. We nd that although the segmentation accuracy of self-supervised
segmentation is not as high as some other segmentation methods, it is enough to give com-
parable (in some cases even better) retrieval performance. It is commonly believed that word
segmentation accuracy is monotonically related to retrieval performance in Chinese informa-
tion retrieval. However, for Chinese, we nd that the relationship between segmentation and
retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation
accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in
information retrieval performance. We demonstrate this eect by presenting an empirical
investigation of information retrieval on Chinese TREC data, using a wide variety of word
segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, includ-
ing 70% word segmentation accuracy from our self-supervised word-segmentation approach.
It appears that the main reason for the drop in retrieval performance is that correct com-
pounds and collocations are preserved by accurate segmenters, while they are broken up by
less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words
themselves might be too broad a notion to conveniently capture the general semantic meaning
of Chinese text. Our research suggests machine learning techniques can play an important
role in building adaptable information retrieval systems and dierent evaluation standards for
word segmentation should be given to dierent applications.
Citation
X. Huang,
F. Peng,
D. Schuurmans,
N. Cercone,
S. Robertson.
"Applying Machine Learning to Text Segmentation for Information Retrieval". Information Retrieval (IR), 6(3), pp 333-362, September 2003.
Keywords: |
machine learning, word segmentation, EM algorithm |
Category: |
In Journal |
BibTeX
@article{Huang+al:IR03,
author = {Xiangji Huang and Fuchun Peng and Dale Schuurmans and Nick Cercone
and Stephen E. Robertson},
title = {Applying Machine Learning to Text Segmentation for Information
Retrieval},
Volume = "6",
Number = "3",
Pages = {333-362},
journal = {Information Retrieval (IR)},
year = 2003,
}
Last Updated: March 14, 2007
Submitted by AICML Admin Assistant