|
iDEA: Drexel E-repository and Archives >
Drexel Theses and Dissertations >
Drexel Theses and Dissertations >
Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods
Please use this identifier to cite or link to this item:
http://hdl.handle.net/1860/3076
|
| Title: | Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods |
| Authors: | Zhang, Xiaodan |
| Keywords: | Information Science Computer science Markov random fields |
| Issue Date: | 31-Jul-2009 |
| Abstract: | Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and
biomedical literature prove the effectiveness of the proposed methods.
In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but is superior in terms of efficiency. The MRF-based algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only–based clustering. On the task of topic analysis, the proposed presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms. |
| URI: | http://hdl.handle.net/1860/3076 |
| Appears in Collections: | Drexel Theses and Dissertations
|
Items in iDEA are protected by copyright, with all rights reserved, unless otherwise indicated.
|