Document Type

Journal Article

Department/Unit

Department of Computer Science

Title

Learning element similarity matrix for semi-structured document analysis

Language

English

Abstract

Capturing latent structural and semantic properties in semi-structured documents (e.g., XML documents) is crucial for improving the performance of related document analysis tasks. Structured Link Vector Mode (SLVM) is a representation recently proposed for modeling semi-structured documents. It uses an element similarity matrix to capture the latent relationships between XML elements—the constructing components of an XML document. In this paper, instead of applying heuristics to define the element similarity matrix, we propose to compute the matrix using the machine learning approach. In addition, we incorporate term semantics into SLVM using latent semantic indexing to enhance the model accuracy, with the element similarity learnability property preserved. For performance evaluation, we applied the similarity learning to k-nearest neighbors search and similarity-based clustering, and tested the performance using two different XML document collections. The SLVM obtained via learning was found to outperform significantly the conventional Vector Space Model and the edit-distance-based methods. Also, the similarity matrix, obtained as a by-product, can provide higher-level knowledge on the semantic relationships between the XML elements.

Keywords

Semi-structured document analysis, Learning similarity matrix, Similarity-based clustering, Extended Vector Space Model

Publication Date

4-2009

Source Publication Title

Knowledge and Information Systems

Volume

19

Issue

1

Start Page

53

End Page

78

Publisher

Springer Verlag

Peer Reviewed

1

Funder

The work reported in this paper was carried out in the Center for e-Transformation Research, Hong Kong Baptist University, and supported by the Hong Kong RGC Central Allocation Grant HKBU 2/03/C. J. Yang was also supported by the National Natural Science Foundation of China Grant 60642001.

DOI

10.1007/s10115-008-0138-2

Link to Publisher's Edition

http://dx.doi.org/10.1007/s10115-008-0138-2

ISSN (print)

02191377

ISSN (electronic)

02193116

This document is currently not available here.

Share

COinS