CCE Theses and Dissertations

Date of Award


Document Type


Degree Name

Doctor of Philosophy in Computer Science (CISD)


Graduate School of Computer and Information Sciences


Francisco Mitropoulos

Committee Member

Gregory Simco

Committee Member

Sumitra Mukherjee


Software is often large, complicated and expensive to build and maintain. Redundant

code can make these applications even more costly and difficult to maintain. Duplicated

code is often introduced into these systems for a variety of reasons. Some of which

include developer churn, deficient developer application comprehension and lack of

adherence to proper development practices.

Code redundancy has several adverse effects on a software application including an

increased size of the codebase and inconsistent developer changes due to elevated

program comprehension needs. A code clone is defined as multiple code fragments that

produce similar results when given the same input. There are generally four types of

clones that are recognized. They range from simple type-1 and 2 clones, to the more

complicated type-3 and 4 clones. Numerous clone detection mechanisms are able to

identify the simpler types of code clone candidates, but far fewer claim the ability to find

the more difficult type-3 clones. Before CCCD, MeCC and FCD were the only clone

detection techniques capable of finding type-4 clones. A drawback of MeCC is the

excessive time required to detect clones and the likely exploration of an unreasonably

large number of possible paths. FCD requires extensive amounts of random data and a

significant period of time in order to discover clones.

This dissertation presents a new process for discovering code clones known as Concolic

Code Clone Discovery (CCCD). This technique discovers code clone candidates based on

the functionality of the application, not its syntactical nature. This means that things like

naming conventions and comments in the source code have no effect on the proposed

clone detection process. CCCD finds clones by first performing concolic analysis on the

targeted source code. Concolic analysis combines concrete and symbolic execution in

order to traverse all possible paths of the targeted program. These paths are represented

by the generated concolic output. A diff tool is then used to determine if the concolic

output for a method is identical to the output produced for another method. Duplicated

output is indicative of a code clone.

CCCD was validated against several open source applications along with clones of all

four types as defined by previous research. The results demonstrate that CCCD was able

to detect all types of clone candidates with a high level of accuracy.

In the future, CCCD will be used to examine how software developers work with type-3

and type-4 clones. CCCD will also be applied to various areas of security research,

including intrusion detection mechanisms.

  Link to NovaCat