Going Beyond PHPCPD: Visualization of Duplicate Code: 20 PHP Projects Analyzed And Made Visual

Image of multiple visualization wheels
Beyond PHPCPD – Visualization of Duplicate Code

If you are looking for ways to understand how duplicate code is affecting your projects, Sider can help you visualize them. There are other tools such as PHPCPD (PHP Copy/Paste Detector) that look for duplicate code, but Sider has a much more sophisticated feature with advanced UI. It can distinguish duplicates well beyond an identical line or lines of code, and not only detect duplicate code in PHP, but in many more languages.  In this post, we analyze 20 well-known open-source PHP projects and visually present the duplicate code blocks using a wheel representation, surpassing the capabilities of PHPCPD.

How do you use Sider Scan?

The first step to analyzing the state of duplicate code in your PHP project is to go to https://siderlabs.com/labs/. There you will find a drop screen where you can drag and drop the project you want to analyze. It is completely free for the time being (as of March 2021) and is available as a browser application.

The technology behind Sider Scan

Sider Scan has been created as a standalone application that allows you to test your projects free of charge in a secure environment. It is a web assembly application where the application runs purely within your computer’s web browser, where analysis is done locally, and not in some external server. This means that your source code does not leave your local device, and does not risk exposure of your source code to the outside world.

To verify this, once the application has loaded to your browser (it is about 100MB large, but it should take no more than 15-30 seconds depending on your internet connection speed), you can disconnect from the network, and the application will work.

If you want to read more about how code duplication is measured, or how similarity scores are calculated, please read about its formulation here

Step 1: drag and drop your code

The first step is to wait for the application to load. Once it loads, you can select an entire directory, or multiple files and folders, and drag and drop them into the browser.

Step 2: wait for analysis to complete

Once you drop the code into the browser, the analysis will immediately begin. Our proprietary algorithm will go through the entire directory and find pairs and groupings of duplicate code (including similar code that are not exact matches) and record the exact directory/file and line numbers of where the duplicates are. Depending on the size and number of pairings of duplicates, the analysis can take anywhere from 1 or 2 minutes to ones that take much longer. Some may take several hours, and yes, all the analysis happens locally in your device.

Step 3: view results overview

Once the analysis is complete, you will be shown an initial set of stats that tell you what the duplication rate is, how many duplicate blocks there are, and well as a prioritized list of duplicates you may want to fix. You can also see a side-by-side comparison of the duplicates and check what they look like to verify whether it needs your attention.

Step 4: view visual representation of duplicates

The ‘Overview’ tab on the right will show the visualization of the duplicates. There are several views: top 10 most important, top 10 files with most duplicates, and all duplicates. You can also ‘enlarge’ the view for a closer look. Below is an example of all the duplicates.


How to interpret the visual wheel

The visual wheel makes it easier to digest the vast amount of data available in an instant. The arcs represent the duplicate portions of a code within a file, and the connecting strips/bands represent the part of the code that the two files have in common. When you click on the strip, it will open up a side-by-side view of the code that are duplicates. Below are explanations in detail.

The arc: represents a file

The arcs around the circle represent portions of any files that have duplicates. When you hover over the arc, it tells you the filename and the number of statements that are duplicates. In the image below, it tells us that in the file ParallelMapOptional.java, there are 1315 statements in total in that file that are considered duplicates that share duplicate or similar code located elsewhere.


The connecting strip: represents duplicates

Aside from the arc, you can hover over each strip or band that connects two files, which represents a duplicate or similar block common to both files. When you hover over it, it gives you information on the number of statements that are duplicated. The thickness of the strip represents the relative size of the duplicates. In the image below, it tells us that the green-ish strip represents 14 duplicate or similar statements that are shared between the two files ObservableTakeWhile.java and ParallelMapTry.java.

The image below shows a band connecting the two files FlowableMergeWithMaybe.java and FlowableMergeWithSingle.java, and it tells us that the band represents 205 statements that are common between the two files.

The bubbles: duplicates within a file

There are also duplicates within a single file. Visually, these are represented by what appears to be ‘bubbles’ on the arc, as seen in the image below. This example shows that the bubble represents 653 statements within the duplicate. The size of the bubble represents the relative number of statements within it.

Click to see comparison of code

When you click on any one of the bands or strips, the side-by-side comparison of code will appear, allowing you to check the details of the duplicates that were detected. The image below is the view that appears of the comparison of similar code found in two different directories. We can see that

  • the code on the left and right each have a different directory path, filename, and line numbers.
  • the number of similar statements shared is 15
  • the similarity score between the duplicates is 54% (read more about how the similarity score is calculated here)
Side by side comparison of duplicate or similar code

20 PHP Projects – Duplicate Code Analysis Stats

For this post, the following 20 PHP projects were analyzed. They are open-source projects publicly available on github with the links below. The analyses can be replicated by downloading the source code directories and then using Sider Scan as indicated above. We will present the summary of what the state of duplication is within each project, and we will also provide a visualization of some of the notable ones.

Below are the results of having analyzed 14,609 files and detecting 7,037 duplicate code pairs.

20 PHP Projects – Analysis Stats By Project

Project Name Directory Size (MB) Languages
# of Files Analyzed Duplicates Detected Duplication Rate
Laravel-Laravel 0.1 PHP, JavaScript 56 3 7.4%
jQuery 1.3 PHP, JavaScript 23 6 5.5%
Faker 9.5 PHP 455 105 16.2%
Composer 2.3 PHP 254 61 5.5%
Symfony 26 PHP, JavaScript 2849 426 9.9%
Laravel-Framework 4 PHP 1006 161 8.1%
Guzzle 0.3 PHP 40 1 0.8%
Design Patterns PHP 3 PHP 170 3 3%
Monolog 0.4 PHP 110 7 4.9%
CodeIgniter 3 PHP, JavaScript 214 117 17.8%
PhpUnit 1 PHP 365 112 20.1%
PhpMailer 0.4 PHP 57 12 9.2%
Carbon 2 PHP 896 229 44.1%
WordPress 58 PHP, JavaScript 1731 2214 21.7%
Matomo 35 PHP, JavaScript, TypeScript 2531 819 16.8%
PHP-Parser 0.9 PHP 238 148 36.8%
YII2 36 PHP, JavaScript, Ruby 984 763 22.2%
Grav 7 PHP, JavaScript 473 223 14.7%
Monica 18 PHP, JavaScript 1716 1573 57.9%
Koel 4 PHP, JavaScript, TypeScript 349 40 11.9%
Laravel-Debugbar 0.4 PHP, JavaScript 46 13 10.5%


Collection of compelling visualizations

The following are visual representations of the duplicate code between different files in a project. The arcs represent the duplicate portions of a code within a file. The connecting strips/bands represent the part of the code that the two files have in common. When you click on the strip, it will open up a side-by-side view of the code that are considered code duplication/copy-paste code or some variant. Here, we will look at the visualizations of duplicate code from popular open source PHP projects. What they mean will depend on the individual utilizing this tool, and it is a means to finding a duplicate line or block of code that may need your attention.

Most past PHPCPD and create a visualization of duplicate code while it’s free

If you are already using PHPCPD and want to take it miles further, use Sider Scan, an advanced duplicate code detection tool. While there are manual steps involved in determining what steps you need to take, a smart visualization significantly improves the process of finding the duplicates that may be problematic and makes it much more user-friendly. It accelerates the process of refactoring code that needs to be fixed or monitored as projects grow and evolve. 

Wondering what the status of duplicate code is in your project? Go beyond PHPCPD and go to https://siderlabs.com/scan/ and get a visual representation while it is free.

Image of link to create a visual report of your PHP code
Create a visual report of your PHP code

Sider is committed to creating/providing software that supports the software development process. Please also see our core product, Sider, a code review tool and code quality tool that conducts static analysis on your GitHub and GitLab repositories, and runs analyzers such as PHP Code Sniffer to find code with a code violation. It acts as a code beautifier that helps with the readability and maintainability of your projects.

Aki Asahara

CEO of Sider. Aki joined Fixstars in 2008 and served major clients such as the US Airforce, MIT, USC, Toyota, and Hitachi High-technologies. After his successful tenure, he was appointed CEO of US operations in 2012. He was appointed CEO of Sider in 2019. He holds a Ph.D. in Astrophysics from Kyoto University and is a Certified Scrum Master.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.