Performance of Classification Tools on Unstructured Text
Date of Award
Doctor of Philosophy (PhD)
Graduate School of Computer and Information Sciences
Marlyn Kemper Littman
Maxine S. Cohen
As digital storage of data continues to grow it is increasingly difficult to find information on demand, particularly in unstructured text documents. Unstructured documents lack explicit record definitions or other metadata that can facilitate retrieval. Yet, digital information is increasingly stored in unstructured documents. Manual or human-assisted indexing of unstructured documents is time consuming and expensive. Automated retrieval techniques, such as those used by Internet search engines, have a variety of limitations including depth and breadth of coverage and frequency of update. In addition many retrieval methods become impractical on large document collections where the need for improved performance is even greater. Most text indexing and retrieval systems include a component that classifies documents. This research will focus on the automated classification of unstructured text. The goal of this research was to investigate a commercial classification tool and evaluate the tool's performance on Reuters-21578, a benchmark categorization collection of unstructured text. The performance of a commercial-off-the-shelf(COTS) product, Oracle Text on the Reuters-21578 collection was evaluated using a variety of measures documented in the classification literature.
Janet L. Kourik. 2005. Performance of Classification Tools on Unstructured Text. Doctoral dissertation. Nova Southeastern University. Retrieved from NSUWorks, Graduate School of Computer and Information Sciences. (645)