Not Logged In

Classifying Websites into Non-topical Categories

Full Text: dawak12.pdf PDF

With the large presence of organizations from different sectors of economy on the web, the problem of detecting to which sector a given website belongs to is both important and challenging. In this paper, we study the problem of classifying websites into four non-topical categories: public, private, non-profit and commercial franchise. Our work treats each website and all pages from the site as a single entity and classifies the entire website as opposed to a single page or a set of pages. We analyze both the textual features including terms, part-of-speech bigrams and named entities and structural features including the link structure of the site and URL patterns. Our experiments on a large set of websites related to weight loss and obesity control, under a multi-label classification setting using the SVM classifier, reveal that with a careful selection and treatment of features based on keywords, one can achieve an Fmeasure of 70% and that adding structural, part-of-speech and named entity based features further improves the F-measure to 74%. The improvement is more significant when textual features are not accurate or sufficient.

Citation

C. Thapa, O. Zaiane, D. Rafiei, A. Sharma. "Classifying Websites into Non-topical Categories". International Conference on Big Data Analytics and Knowledge Discovery (DAWAK), Vienna, Austria, (ed: Alfredo Cuzzocrea, Umeshwar Dayal), pp 364-377, September 2012.

Keywords: non-topical classification, structural features, topical features, non-topical features, web genre
Category: In Conference
Web Links: Springer Link

BibTeX

@incollection{Thapa+al:DAWAK12,
  author = {Chaman Thapa and Osmar R. Zaiane and Davood Rafiei and Arya Sharma},
  title = {Classifying Websites into Non-topical Categories},
  Editor = {Alfredo Cuzzocrea, Umeshwar Dayal},
  Pages = {364-377},
  booktitle = {International Conference on Big Data Analytics and Knowledge
    Discovery (DAWAK)},
  year = 2012,
}

Last Updated: January 13, 2020
Submitted by Sabina P

University of Alberta Logo AICML Logo