Xml clustering

Disponible uniquement sur Etudier
  • Pages : 14 (3433 mots )
  • Téléchargement(s) : 0
  • Publié le : 10 janvier 2010
Lire le document complet
Aperçu du document
Wei-ning Qian, Hai-lei Qian, Li Wei, Yan Wang and Ao-ying Zhou
Computer Science Department Fudan University Shanghai 200433 E-mail: wnqian@fudan.edu.cn
Abstract: Based on the query expansion techniques in information retrieval systems, structure-based query expansion for XML search engines, which is designed to ease the query for XML datawhile keeping the power and flexibility of XML query, is introduced in this paper. To enable the structure expansion, a structure thesaurus should be built first, which involves the construction of a weighted graph from XML documents and the linkage-based clustering method to cluster the nodes into several groups. After a query comes, the structure thesaurus is examined, so that for each tag in theoriginal query, the tags in the same group are retrieved. Unrelated tags are filtered and some heuristic rules are applied to replacing the tags in the original query with the related tags and to expanding the structure. It is shown that using structure-based query expansion, the system can return result with high precision and recall.

XML (Extensible Markup Language) is aspecification of W3C (Bray, 1998). It is developed to complement HTML for data exchange on the Web. In recent years, XML has been more and more used in large information systems, such as digital libraries or information centers. In most of these systems, search engine is a major module. XML search engine has gained its popularity over HTML search engine primarily due to two notable advantages itbears. 1) It provides the ability to query not only the content, but also the structure. 2) It usually has more complex and powerful query languages, such as XML-QL (Deutsch) and XQL (Robie, 1999). These languages allow users to query elements satisfying certain conditions. However, these two advantages of XML search engine also bring the following shortcomings: 1) It is difficult for users to poseaccurate structure queries without knowing the schema of the XML data, which is difficult to obtain from a large, distributed XML repository. 2) Mastering the complex query languages remains a tough task for common users. In this paper, we apply query expansion techniques to structures to mask the complex query languages from the users. Query expansion is widely used in information retrieval systems(Xu, 1996; Mandala, 1999) to increase the precision and recall. In recent years, it has been implemented in several famous search engines such as AltaVista (http://www.altavista.com) and Lycos


Qian, Qian, Wei, Wang, and Zhou

(http://www.lycos.com). However, traditional query expansion methods are designed for keyword-based queries and cannot fulfill the task mentioned above.Structure-based query expansion for XML search engine is designed to ease the query for XML data while keeping the power and flexibility of XML query. It helps to solve the following problems. Firstly, users may not write complex queries (regular path expression, etc.) and may only pose the query from one angle. However, the documents on the Web correspond to so many various DTDs that users can notbrowse, hence a simple query does not make sure the search engine find enough documents that are needed by users. Secondly, there are some tags in XML DTDs that have different names but are close to each other in a semantic sense and have similar context as well. Traditionally, only the tags that match the user’s query will be considered and their sub-tags will be searched, while the similar tags thatdo not match the original query but may contain the information in need are neglected. By clustering the similar tags into groups, the approach we adopt to structure-based expansion will solve these two problems effectively and provide users with the information they require as complete as possible. To enable the structure expansion, a structure thesaurus should be built first. Based on the...