The "Dollies" of Software

By Tom Hill

(Back to article)

Software cloning took off in the 1960s, when programmers began using tools to copy and paste source code statements from one program to another. The practice remains pervasive as many organizations encourage code reuse without written policies and even evaluate developers by lines-of-code productivity metrics.

Cloning is an easy, fast way to duplicate business functionality. However, it's also been a key factor behind the uncontrolled growth of our industrial-strength systems.

Since 1965, total source code developed and maintained has risen from 3 billion lines to at least 500 billion lines. And it's estimated that clones today represent about 10% of code in many large systems.

What's more, cloning is creating unmanageable complexity; especially through the dissemination of dangerous, hard-to-fix errors. Growing size and complexity, in turn, is making these legacy systems more difficult and expensive to support.

As a system's size (measured in total lines of code) increases, so does the number of professionals needed to maintain it. The average software professional already manages about 100,000 lines of code, and the burden of managing clones will only add to the workload.

This spells big potential costs and productivity issues over the long term. After all, software maintenance represents more than 80% of total cost of ownership (TCO) during the life of a system, and the supply of software professionals is limited.

For a worst-case cloning scenario, consider a massive transportation-industry system that crashed in 2004. After analyzing the system, I discovered that one reason for the failure was overwhelming software complexity that had developed over decades.

The system contained more than 4,000 programs and 1.24 million lines of code, with perhaps 100,000 of them cloned. This was the perfect setup for further outtages and business disruptions.

The lesson? We must simplify such systems now. Detecting, analyzing and removing the clones are the essential first steps. Together, these steps set the stage for success with subsequent legacy modernization activities as we continue progressing through this "modernization decade."

Detecting Clones

Automatic clone detection techniques are especially valuable since they maximize cost savings and programmer productivity over the software maintenance life cycle.

Here are five automatic clone detection techniques commonly used today:

  • Text-based line matching identifies lines of code with similar sequences of symbols or values, often called text "strings."
  • Token-based comparison involves creating and comparing a list of to-kens -- character sequences that are the program's building blocks.
  • Abstract syntax tree evaluation involves parsing (dividing) code into a tree structure to streamline code detection.
  • Metrics-based detection involves use of code-counting tools to create metrics (such as total number of lines, blank lines and function pa-rameters) that generate clone "fingerprints" for each function (named procedure).
  • Program dependence graph assessment attempts to solve the most dif-ficult clone identification problem: code modifications that disturb code structure.
  • Using the text-based method on a 40,000-line COBOL system, researchers found 25% of the lines of code were cloned. The abstract syntax tree method revealed 12.7% cloned lines in a 400,000-line C-language system. The potential in that case -- 50,000 fewer lines to maintain, along with the substantial long-term cost savings -- is intriguing, indeed.

    Analyzing and Removing

    But automatically detecting clones is only the crucial first step in legacy system modernization. We must then analyze the clones by visualizing where they occur across the system in question. Graphical tools such as ones that display "clone pairs" as diagonal lines on a grid are useful here.

    Removing clones is the ultimate goal, of course. The process of removing code, which does not alter program functionality, is called generalizing, refactoring or restructuring.

    Today's agile development methodologies include two approaches for removing clones. The extract method extracts a fragment of code from multiple instances and redefines it as a new method or function. In the pull-up method, child methods are pulled up to a parent method.

    These approaches are primarily manual. But a tool called Cancer (part of the CCFinder toolset developed by Toshihiro Kamiya) shows promising results in automating them. Fully automated clone removal will prove to be a great time-saver for our software maintenance organizations.

    It's time to take the first steps to successful legacy modernization. Clone detection, analysis and removal will significantly reduce the costs and complexity of our large, aging, business-critical systems.

    Tom Hill, who became an EDS Fellow in 1991, is head of EDS' research and development (R&D) for the second time in his 30-year EDS career.