CCE Theses and Dissertations
A Selective-Phrase-Based Preprocessor for Improved Spam Filtering
Date of Award
2008
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Graduate School of Computer and Information Sciences
Advisor
Sumitra Mukherjee
Committee Member
Michael J. Laszlo
Committee Member
Amon Seagull
Abstract
A selective-phrase-based preprocessor for use in generating the feature set of an email message prior to its classification as spam or non-spam is presented. The proposed preprocessor subsumes the phrase-based preprocessor used by Yerazunis and the word-based preprocessor employed by Graham. In Yerazunis' phrased-based approach, a sliding window of W contiguous words is used to generate phrases, and all sub-phrases of each W-word phrase are used as features. In Graham's word-based method, just the individual words in an e-mail message are used as features. The primary goal of this investigation was to detennine whether the classification accuracy attained by Yerazunis' preprocessor can be achieved at a lower computational cost by selecting a much smaller but promising set of phrases from which to generate the sub-phrases that constitute the features. The proposed preprocessor first identifies a small number f of words from the input text (namely, those that are most likely to appear in spam messages). For each such word, it then selects B distinct phrases of W contiguous words that contain that word and uses all sub-phrases of each such phrase as features. A secondary goal of the research was to investigate the sensitivity of the classification accuracy on/, B, and W. The methods used by the proposed preprocessor and the preprocessors devised by Yerazunis and Graham were tested on a benchmark corpus of e-mail messages. The classification accuracy and other metrics were measured and reported. The results indicated that the classification accuracy of the proposed preprocessor was comparable to that of Yerazunis' preprocessor while utilizing a much smaller set of features. In addition, the research indicated optimum values for f, B, and W: The best accuracies were achieved at (j, B, W) = (2, I, 2) and (j, B, W) = (2, 2, 2).
NSUWorks Citation
Ajay Arora. 2008. A Selective-Phrase-Based Preprocessor for Improved Spam Filtering. Doctoral dissertation. Nova Southeastern University. Retrieved from NSUWorks, Graduate School of Computer and Information Sciences. (390)
https://nsuworks.nova.edu/gscis_etd/390.