CEC Theses and Dissertations


Performance of Classification Tools on Unstructured Text

Date of Award


Document Type


Degree Name

Doctor of Philosophy (PhD)


Graduate School of Computer and Information Sciences


Sumitra Mukherjee

Committee Member

Marlyn Kemper Littman

Committee Member

Maxine S. Cohen


As digital storage of data continues to grow it is increasingly difficult to find information on demand, particularly in unstructured text documents. Unstructured documents lack explicit record definitions or other metadata that can facilitate retrieval. Yet, digital information is increasingly stored in unstructured documents. Manual or human-assisted indexing of unstructured documents is time consuming and expensive. Automated retrieval techniques, such as those used by Internet search engines, have a variety of limitations including depth and breadth of coverage and frequency of update. In addition many retrieval methods become impractical on large document collections where the need for improved performance is even greater. Most text indexing and retrieval systems include a component that classifies documents. This research will focus on the automated classification of unstructured text. The goal of this research was to investigate a commercial classification tool and evaluate the tool's performance on Reuters-21578, a benchmark categorization collection of unstructured text. The performance of a commercial-off-the-shelf(COTS) product, Oracle Text on the Reuters-21578 collection was evaluated using a variety of measures documented in the classification literature.

This document is currently not available here.

  Link to NovaCat