8.2: Compare Code Mechanically

Last updated
Save as PDF

Page ID: 32408

Serge Demeyer, Stéphane Ducasse, Oscar Nierstrasz
University of Antwerp, INRIA Lille, University of Bern

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Intent Discover duplicated code by comparing all the source code files line-by-line.

Problem

How do you discover which parts of an application code have been duplicated?

This problem is difficult because:

You may suspect that code has been duplicated but you do not have any a priori evidence where the duplication occurs. For example, you know that two programmers cannot have developed 4 million lines of Cobol in one year without having duplicated some code.
Browsing the code is not an effective way of discovering duplication; you will only find duplicated code by accident.
Programmers may have not only copied and pasted code but also modified variables or changed the shape of the programs.

Yet, solving this problem is feasible because:

Most duplicated code can be detected by mechanical procedures.

Solution

Textually compare each line of the software source code with all the other lines of code.

Steps

Normalize the lines of code by removing comments, tabs and blanks.
Remove lines that contain uninteresting code elements (e.g., just else or })
Compare each line with all the other lines. Reduce search complexity by hashing:
- Preprocessing: Compute the hash value for each line
- Actual Comparison: Compare all lines in the same hash bucket

Variants

This approach may fail to identify some instances of duplicated code due to renaming of variables. By deleting all variable identifiers, or by mapping them to a common symbol, you can detect similar code patterns, while abstracting from the details of the specific identifiers. This variant, however, requires some syntactic processing of the code.

Tradeoffs

Pros

The approach is simple and gives good results while only requiring modest resources.
It is nearly language-independent in the sense that you only have to build a lexical analyzer and not a full parser. That’s why a simple perl script can be sufficient depending on the level of sophistication that you want.
Simple statistics and percentage rates are easily computed and may help you to gain credibility or more strength in discussions on resource allocation or hiring new people.

Cons

Code that has been heavily edited after copying maybe hard to identify as duplicated code.
Systems containing a lot of duplicated code will generate a lot of data that can be difficult to analyze effectively.

Example

Consider the case of a system written in C++ where you suspect duplicated code. However, you didn’t write to code yourself so you don’t know where the actual duplication occurs. How can you detect where the duplicated code fragments are? Consistent with Keep It Simple you do the simplest thing that may possibly work: you write a little script that first normalizes the code to remove all white space from the code and afterwards compares each line of code against itself.

The normalization would change the following code

...
// assign same fastid as container
fastid = NULL;
const char* fidptr = getFastid();
if(fidptr != NULL) {
    int l = strlen(fidptr);
    fastid = new char[l+1];
    char *tmp = (char*) fastid;
    for (int i =0;i<l;i++)
        tmp[i] = fidptr[i];
    tmp[l] = '\0';
}
...

into

...
fastid=NULL;
constchar*fidptr=getFastid();
if(fidptr!=NULL)
intl=strlen(fidptr);
fastid=newchar[l+1];
char*tmp=(char*)fastid;
for(inti=0;i<l;i++)
tmp[i]=fidptr[i];
tmp[l]='\0';
...

Afterwards, the line-by-line comparison of the code against itself produces a report telling which sequences of lines are duplicated.

Lines:fastid=NULL;;constchar*fidptr=getFastid();;if(fidptr!=NULL);
intl=strlen(fidptr);;fastid=newchar[l+1];;
Locations:
</typesystem/Parser.C>6178/6179/6180/6181/6182
</typesystem/Parser.C>6198/6199/6200/6201/6202

Below is a sample of a perl script that will do the trick.

#! /usr/bin/env perl --w
# duplocForCPP.pl -- detect duplicated lines of code (algorithm only)
# Synopsis: duplocForCPP.pl filename ...
# Takes code (or other) files and collects all line numbers of lines
# equal to each other within these files. The algorithm is linear (in
# space and time) to the number of lines in input.

# Output: Lists of numbers of equal lines.
# Author: Matthias Rieger

$equivalenceClassMinimalSize = 1;
$slidingWindowSize = 5;
$removeKeywords = 0;

@keywords = qw(if
        then
        else
        for
        {
        }
    );

$keywordsRegExp = join '|', @keywords;

@unwantedLines = qw( else
        return
        return;
        return result;
        }else{
        #else
        #endif
        {
        }
        ;
        };
    );
push @unwantedLines, @keywords;

@unwantedLines{@unwantedLines} = (1) x @unwantedLines;

$totalLines = 0;
$emptyLines = 0;
$codeLines = 0;
@currentLines = ();
@currentLineNos = ();
%eqLines = ();
$inComment = 0;

$start = (times)[0];

while (<>) {
    chomp;
    $totalLines++;

    # remove comments of type /* */
    my $codeOnly = ";
    while(($inComment && m|\*/|) || (!$inComment && m|/\*|)) {
        unless($inComment) { $codeOnly .= $` }
        $inComment = !$inComment;
        $_ = $';
    }
    $codeOnly .= $_ unless $inComment;
    $_ = $codeOnly;
    
    s|//.*$||; # remove comments of type //
    s/\s+//g; #remove white space
    s/$keywordsRegExp//og if $removeKeywords; #remove keywords
    
    # remove empty and unwanted lines
    if((!$_ && $emptyLines++)
        || (defined $unwantedLines{$_} && $codeLines++)) { next }

    $codeLines++;
    push @currentLines, $_;
    push @currentLineNos, $.;
    if($slidingWindowSize < @currentLines) {
        shift @currentLines;
        shift @currentLineNos;
    }
    
    # print STDERR "Line $totalLines >$_<\n";

    my $lineToBeCompared = join ", @currentLines;
    my $lineNumbersCompared = "<$ARGV>"; # append the name of the file
    $lineNumbersCompared .= join '/', @currentLineNos;
    # print STDERR "$lineNumbersCompared\n";
    if($bucketRef = $eqLines{$lineToBeCompared}) {
        push @$bucketRef, $lineNumbersCompared;
    } else {
        $eqLines{$lineToBeCompared} = [ $lineNumbersCompared ];
    }
    
    if(eof) { close ARGV } # Reset linenumber--count for next file
}

$end = (times)[0];
$processingTime = $end -- $start;

# print the equivalence classes

$numOfMarkedEquivClasses = 0; 
$numOfMarkedElements = 0;
foreach $line (sort { length $a <=> length $b } keys %eqLines) {
    if(scalar @{$eqLines{$line}} > $equivalenceClassMinimalSize) {
        $numOfMarkedEquivClasses++;
        $numOfMarkedElements += scalar @{$eqLines{$line}};
        print "Lines: $line\n";
        print "Locations: @{$eqLines{$line}}\n\n";
    }
}

print "\n\n\n";
print "Number of Lines processed: $totalLines\n";
print "Number of Empty Lines: $emptyLines\n";
print "Number of Code Lines: $codeLines\n";
print "Scanning time in seconds: $processingTime\n";
print "Lines per second: @{[$totalLines/$processingTime]}\n";
print "----------------------------------------------------------------------------\n";
print "Total Number of equivalence classes: @{[scalar keys %eqLines]}\n";
print "Size of Sliding window: $slidingWindowSize\n";
print "Lower bound of eqiv--class Size: $equivalenceClassMinimalSize\n"; print "Number of marked equivalence classes:
    $numOfMarkedEquivClasses\n";
print "Number of marked elements: $numOfMarkedElements\n";

Known Uses

In the context of software reengineering, the pattern has been applied to detect duplicated code in FAMOOS case studies containing up to one million lines of C++. It also has been applied to detect duplicated code in a COBOL system of 4 million lines of code. DATRIX has investigated multiple versions of a large telecommunications system, wading through 89 million lines of code all in all [LPM+97].