Document Type

Conference Paper

Department/Unit

Department of Computer Science

Title

Learning the kernel matrix for XML document clustering

Language

English

Abstract

The rapid growth of XML adoption has urged for the need of a proper representation for semi-structured documents, where the document structural information has to be taken into account so as to support more precise document analysis. In this paper, an XML document representation named "structured link vector model" is adopted, with a kernel matrix included for modeling the similarity between XML elements. Our formulation allows individual XML elements to have their own weighted contribution to the overall document similarity while at the same time allows the between-element similarity to be captured. An iterative algorithm is derived to learn the kernel matrix. For performance evaluation, the ACM SIGMOD record dataset as well as the CEDE dataset have been tested. Our proposed method outperforms significantly the traditional vector space model and the edit-distance based methods. In addition, the kernel matrix obtained as a by-product provides knowledge about the conceptual relationship between the XML elements.

Keywords

Kernel, XML, Computer science, Text analysis, Information analysis, Iterative algorithms, Testing, Fourier transforms, Training data

Publication Date

4-2005

Source Publication Title

Proceedings of the 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE-05)

Conference Location

Hong Kong, China

Publisher

IEEE

Peer Reviewed

1

Copyright

Copyright © 2005 by The Institute of Electrical and Electronics Engineers, Inc.

Funder

This work was partially supported by RGC Central Allocation Group Research Grant (HKBU 2/03/C).

DOI

10.1109/EEE.2005.87

Link to Publisher's Edition

http://dx.doi.org/10.1109/EEE.2005.87

ISBN (print)

0769522742

This document is currently not available here.

Share

COinS