"Enhancing Sentiment Analysis in Niche Domains: Introducing Diverse Dat" by Kimon Andreou
 

CCE Theses and Dissertations

Date of Award

2024

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Information Systems (DISS)

Department

College of Computing and Engineering

Advisor

Sumitra Mukherjee

Committee Member

Michael Laszlo

Committee Member

Francisco Mitropoulos

Keywords

Artificial intelligence, information science, sentiment analysis

Abstract

The field of Natural Language Processing (NLP) has witnessed significant advancements in recent decades, with text classification emerging as a critical task, particularly in sentiment analysis applications. However, a constant challenge within sentiment analysis research is the scarcity of diverse and specialized labeled datasets. The present dissertation addresses this gap by developing two novel, labeled textual datasets sourced from niche areas: BoardGameGeek.com's top 250 board game reviews and TrustPilot.com's car dealership reviews

The main goal of this dissertation is to enrich sentiment analysis methodologies by providing unique datasets and insights into the performance of current models within specialized domains. By using publicly available review data, collected and preprocessed using Python libraries, the datasets enhance the diversity and applicability of sentiment analysis research.

The methodology involves a two-phase approach: data collection and data preparation. Data is extracted from the TrustPilot.com and BoardGameGeek.com websites and then is prefiltered and processed to ensure relevance and quality. Each review underwent language detection, removal of irrelevant elements, and labeling based on associated ratings, resulting in datasets categorized as Positive, Neutral, or Negative sentiments.

To facilitate model testing and measurement, for illustrative purposes a baseline using Naïve Bayes (NB) is established, followed by testing of deep learning models, including Convolutional Neural Networks (CNNs), Bidirectional Long Short-Term Memory (BiLSTM) networks, and hybrid CNN-BiLSTM models. Performance evaluation was conducted across fundamental metrics such as accuracy, precision, recall, and F1 scores, and Receiver Operating Characteristic (ROC) curves.

Share

COinS