Classifying Websites into Non-topical Categories
- Chaman Thapa
- Osmar R. Zaiane, University of Alberta (Database)
- Davood Rafiei
- Arya Sharma
With the large presence of organizations from different sectors of economy on the web, the problem of detecting to which sector a given website belongs to is both important and challenging. In this paper, we study the problem of classifying websites into four non-topical categories: public, private, non-profit and commercial franchise. Our work treats each website and all pages from the site as a single entity and classifies the entire website as opposed to a single page or a set of pages. We analyze both the textual features including terms, part-of-speech bigrams and named entities and structural features including the link structure of the site and URL patterns. Our experiments on a large set of websites related to weight loss and obesity control, under a multi-label classification setting using the SVM classifier, reveal that with a careful selection and treatment of features based on keywords, one can achieve an Fmeasure of 70% and that adding structural, part-of-speech and named entity based features further improves the F-measure to 74%. The improvement is more significant when textual features are not accurate or sufficient.
Citation
C. Thapa, O. Zaiane, D. Rafiei, A. Sharma. "Classifying Websites into Non-topical Categories". International Conference on Big Data Analytics and Knowledge Discovery (DAWAK), Vienna, Austria, (ed: Alfredo Cuzzocrea, Umeshwar Dayal), pp 364-377, September 2012.Keywords: | non-topical classification, structural features, topical features, non-topical features, web genre |
Category: | In Conference |
Web Links: | Springer Link |
BibTeX
@incollection{Thapa+al:DAWAK12, author = {Chaman Thapa and Osmar R. Zaiane and Davood Rafiei and Arya Sharma}, title = {Classifying Websites into Non-topical Categories}, Editor = {Alfredo Cuzzocrea, Umeshwar Dayal}, Pages = {364-377}, booktitle = {International Conference on Big Data Analytics and Knowledge Discovery (DAWAK)}, year = 2012, }Last Updated: January 13, 2020
Submitted by Sabina P