CEC Theses and Dissertations

Campus Access Only

All rights reserved. This publication is intended for use solely by faculty, students, and staff of Nova Southeastern University. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, now known or later developed, including but not limited to photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the author or the publisher.

Date of Award


Document Type

Dissertation - NSU Access Only

Degree Name

Doctor of Philosophy in Information Systems (DISS)


Graduate School of Computer and Information Sciences


Sumitra Mukherjee

Committee Member

Michael Laszlo

Committee Member

Junping Sun


Essential genes constitute the minimal gene set of an organism that is indispensable for its survival under most favorable conditions. The problem of accurately identifying and predicting genes essential for survival of an organism has both theoretical and practical relevance in genome biology and medicine. From a theoretical perspective it provides insights in the understanding of the minimal requirements for cellular life and plays a key role in the emerging field of synthetic biology; from a practical perspective, it facilitates efficient identification of potential drug targets (e.g., antibiotics) in novel pathogens. However, characterizing essential genes of an organism requires sophisticated experimental studies that are expensive and time consuming. The goal of this research study was to investigate machine learning methods to accurately classify/predict "essential genes" in newly sequenced microbial genomes based solely on their genomic sequence data.

This study formulates the predication of essential genes problem as a binary classification problem and systematically investigates applicability of three different supervised classification methods for this task. In particular, Decision Tree (DT), Support Vector Machine (SVM), and Artificial Neural Network (ANN) based classifier models were constructed and trained on genomic features derived solely from gene sequence data of 14 experimentally validated microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features (including gene and protein sequence features, protein physio-chemical features and protein sub-cellular features) was used as input for the learners to learn the classifier models. The training and test datasets used in this study reflected between-class imbalance (i.e. skewed majority class vs. minority class) that is intrinsic to this data domain and essential genes prediction problem. Two imbalance reduction techniques (homology reduction and random under sampling of 50% of the majority class) were devised without artificially balancing the datasets and compromising classifier generalizability. The classifier models were trained and evaluated using 10-fold stratified cross validation strategy on both the full multi-genome datasets and its class imbalance reduced variants to assess their predictive ability of discriminating essential genes from non-essential genes. In addition, the classifiers were also evaluated using a novel blind testing strategy, called LOGO (Leave-One-Genome-Out) and LOTO (Leave-One-Taxon group-Out) tests on carefully constructed held-out datasets (both genome-wise (LOGO) and taxonomic group-wise (LOTO)) that were not used in training of the classifier models. Prediction performance metrics, accuracy, sensitivity, specificity, precision and area under the Receiver Operating Characteristics (AU-ROC) were assessed for DT, SVM and ANN derived models. Empirical results from 10 X 10-fold stratified cross validation, Leave-One-Genome-Out (LOGO) and Leave-One-Taxon group-Out (LOTO) blind testing experiments indicate SVM and ANN based models perform better than Decision Tree based models. On 10 X 10-fold cross validations, the SVM based models achieved an AU-ROC score of 0.80, while ANN and DT achieved 0.79 and 0.68 respectively. Both LOGO (genome-wise) and LOTO (taxonwise) blind tests revealed the generalization extent of these classifiers across different genomes and taxonomic orders.

This study empirically demonstrated the merits of applying machine learning methods to predict essential genes in microbial genomes by using only gene sequence and features derived from it. It also demonstrated that it is possible to predict essential genes based on features derived from gene sequence without using homology information. LOGO and LOTO Blind test results reveal that the trained classifiers do generalize across genomes and taxonomic boundaries and provide first critical estimate of predictive performance on microbial genomes. Overall, this study provides a systematic assessment of applying DT, ANN and SVM to this prediction problem.

An important potential application of this study will be to apply the resultant predictive model/approach and integrate it as a genome annotation pipeline method for comparative microbial genome and metagenome analysis resources such as the Integrated Microbial Genome Systems (IMG and IMG/M).

To access this thesis/dissertation you must have a valid nova.edu OR mynsu.nova.edu email address and create an account for NSUWorks.

  Contact Author

  Link to NovaCat