Department of Computer Science
Identifying a hierarchy of bipartite subgraphs for web site abstraction
The Web is transforming from a merely information dissemination platform towards a distributed knowledge-based platform for supporting complex problem solving. However, the existing Web contains a large amount of knowledge which is only tagged using layout related markups, making them hard to be discovered and used. In this paper, we purpose to model semantic-rich and self-contained knowledge units embedded in a web site as a mixture of bipartite sub-graphs and to extract the subgraphs as the web site abstraction via hyperlink structure and file hierarchy analysis. A recursive algorithm, named ReHITS, is derived which can identify bipartite sub-graphs with a hierarchical organization. Each identified sub-graph contains a set of associated authorities and hubs as its summarized semantic description. The effectiveness of the algorithm has been evaluated using three real web sites (containing ∼ 10000 web pages) with promising results. Detailed interpretation of the experimental results and qualitative comparison with other related work are also included.
Web structure mining, web site abstraction, HITS algorithm, knowledge discovery
Source Publication Title
Web Intelligence and Agent Systems
Link to Publisher's Edition
Cheung, William K., and Yuxiang Sun. "Identifying a hierarchy of bipartite subgraphs for web site abstraction." Web Intelligence and Agent Systems 5.3 (2007): 343-355.