CCE Theses and Dissertations

A Selective-Phrase-Based Preprocessor for Improved Spam Filtering

Date of Award


Document Type


Degree Name

Doctor of Philosophy (PhD)


Graduate School of Computer and Information Sciences


Sumitra Mukherjee

Committee Member

Michael J. Laszlo

Committee Member

Amon Seagull


A selective-phrase-based preprocessor for use in generating the feature set of an email message prior to its classification as spam or non-spam is presented. The proposed preprocessor subsumes the phrase-based preprocessor used by Yerazunis and the word-based preprocessor employed by Graham. In Yerazunis' phrased-based approach, a sliding window of W contiguous words is used to generate phrases, and all sub-phrases of each W-word phrase are used as features. In Graham's word-based method, just the individual words in an e-mail message are used as features. The primary goal of this investigation was to detennine whether the classification accuracy attained by Yerazunis' preprocessor can be achieved at a lower computational cost by selecting a much smaller but promising set of phrases from which to generate the sub-phrases that constitute the features. The proposed preprocessor first identifies a small number f of words from the input text (namely, those that are most likely to appear in spam messages). For each such word, it then selects B distinct phrases of W contiguous words that contain that word and uses all sub-phrases of each such phrase as features. A secondary goal of the research was to investigate the sensitivity of the classification accuracy on/, B, and W. The methods used by the proposed preprocessor and the preprocessors devised by Yerazunis and Graham were tested on a benchmark corpus of e-mail messages. The classification accuracy and other metrics were measured and reported. The results indicated that the classification accuracy of the proposed preprocessor was comparable to that of Yerazunis' preprocessor while utilizing a much smaller set of features. In addition, the research indicated optimum values for f, B, and W: The best accuracies were achieved at (j, B, W) = (2, I, 2) and (j, B, W) = (2, 2, 2).

This document is currently not available here.

  Link to NovaCat