CCE Theses and Dissertations

Date of Award

2015

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Information Systems (DCIS)

Department

College of Engineering and Computing

Advisor

James Cannady

Committee Member

Paul Cerkez

Committee Member

Rita Barrios

Keywords

Computer science, Information science, Data streams, Feature extraction, Real-time learning, Sparse PCA

Abstract

Intruders attempt to penetrate commercial systems daily and cause considerable financial losses for individuals and organizations. Intrusion detection systems monitor network events to detect computer security threats. An extensive amount of network data is devoted to detecting malicious activities.

Storing, processing, and analyzing the massive volume of data is costly and indicate the need to find efficient methods to perform network data reduction that does not require the data to be first captured and stored. A better approach allows the extraction of useful variables from data streams in real time and in a single pass. The removal of irrelevant attributes reduces the data to be fed to the intrusion detection system (IDS) and shortens the analysis time while improving the classification accuracy. This dissertation introduces an online, real time, data processing method for knowledge extraction.

This incremental feature extraction is based on two approaches. First, Chunk Incremental Principal Component Analysis (CIPCA) detects intrusion in data streams. Then, two novel incremental feature extraction methods, Incremental Structured Sparse PCA (ISSPCA) and Incremental Generalized Power Method Sparse PCA (IGSPCA), find malicious elements. Metrics helped compare the performance of all methods.

The IGSPCA was found to perform as well as or better than CIPCA overall in term of dimensionality reduction, classification accuracy, and learning time. ISSPCA yielded better results for higher chunk values and greater accumulation ratio thresholds. CIPCA and IGSPCA reduced the IDS dataset to 10 principal components as opposed to 14 eigenvectors for ISSPCA. ISSPCA is more expensive in terms of learning time in comparison to the other techniques.

This dissertation presents new methods that perform feature extraction from continuous data streams to find the small number of features necessary to express the most data variance. Data subsets derived from a few important variables render their interpretation easier.

Another goal of this dissertation was to propose incremental sparse PCA algorithms capable to process data with concept drift and concept shift. Experiments using WaveForm and WaveFormNoise datasets confirmed this ability. Similar to CIPCA, the ISSPCA and IGSPCA updated eigen-axes as a function of the accumulation ratio value, forming informative eigenspace with few eigenvectors.

Share

COinS