Drexel University Home Pagewww.drexel.edu DREXEL UNIVERSITY LIBRARIES HOMEPAGE >>

iDEA: Drexel E-repository and Archives > Drexel Theses and Dissertations > Drexel Theses and Dissertations > Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods

Please use this identifier to cite or link to this item: http://hdl.handle.net/1860/3076

Title: Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods
Authors: Zhang, Xiaodan
Keywords: Information Science;Computer science;Markov random fields
Issue Date: 31-Jul-2009
Abstract: Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but is superior in terms of efficiency. The MRF-based algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only–based clustering. On the task of topic analysis, the proposed presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.
URI: http://hdl.handle.net/1860/3076
Appears in Collections:Drexel Theses and Dissertations

Files in This Item:

File Description SizeFormat
Zhang_Xiaodan.pdf879.1 kBAdobe PDFView/Open
View Statistics

Items in iDEA are protected by copyright, with all rights reserved, unless otherwise indicated.


Valid XHTML 1.0! iDEA Software Copyright © 2002-2010  Duraspace - Feedback