Duplicate Code Detection in Java: Analyzing 20 well-known Java projects

Sider Scan is a revolutionary duplicate code detection tool that finds the most problematic code duplicates in your projects. It is currently available as a standalone feature that you can test for free at https://siderlabs.com/scan/. Not only can it analyze greater combinatorial pairings of duplicates more quickly compared to existing tools, but it can order and prioritize them in a way that allows you to find the most relevant duplicates out of the many that exist, and assists in the efforts to refactor and clean up code.

In this post, we present the results of duplicate code detection in Java, having analyzed 20 of the most popular open-source Java projects found on Github. We summarize our findings by providing statistics such as the number of duplicates and duplication rate, but also provide a visual comparison of the most interesting finds, and tell you how you can do the same!

How do you measure code duplication in a project?

So what counts as a code duplication anyway? There are many parameters to consider for code to be counted as duplicates, whether exactly identical, or partially identical, or have a minimum number of lines or statements to qualify. There are many other criteria that can affect this. Here are some of the building blocks in quantifying duplication in order to systematize its detection. 

Similarity score

In the case of Sider Scan, we start by defining a similarity score between blocks of code that are considered ‘similar’. For example, if there is an identical match between two blocks of code, then we assign a similarity score of 100%. If two blocks of code are not exactly identical, then the similarity score is loosely defined as the proportion of the number of identical statements contained in the block out of the total number of statements in the block. So as a simplified example, if there are 10 statements each in a pair of code blocks being compared, and 9 of those statements are identical, then the pair would receive a 90% similarity score.

We define a similarity score between two blocks of code, but this gets more complicated when grouping them with other code blocks that may have the exact same logic (duplicate logic) that doesn’t necessarily have a high similarity score. In Sider Scan, we have created an algorithm that not only detects and surfaces relevant code duplication but does the hard work of filtering out those that are less consequential so that you can focus on the ones that matter. 

Duplication rate

Once an analysis of a project is complete, Sider Scan will provide a ‘code duplication rate’ that gives a sense of the number of relevant duplicates that exist in that project. This duplication rate (also known as ‘clone rate’) is a somewhat complex calculation because of the way code statements and duplication are measured and counted, however, it is roughly understood as the total number of duplicate code statements as a proportion of the total number of code statements. As a simplified example, if a project consists of 100 code statements, and 15 of the statements are duplicates (a duplicate pair exists somewhere in the 100 statements), then the duplication rate is 15%. 

It is important to note here that a duplicate code statement does not mean that they are exactly identical. Code statements even with a low similarity score can be considered a duplicate. If the algorithm detects some form of copying and pasting, or some identical logic with varying variable/function names, they will be grouped into a set of duplicates. 

While the duplication rate provides a singular quantity that represents a particular project, the similarity score, among other quantities, helps us to sort/filter/score parts of our code so that we can find what we are looking for. 

What we analyzed

In this study, we ran an analysis on 20 of the most popular open-source Java projects. This is a process that anyone can replicate using Sider Scan’s duplicate code detection and analysis tool which, as of 6/2/21, is available free for trial use at https://siderlabs.com/scan/.  

Java projects and links

The following are the projects analyzed, and the URL of the source code. Some of these are well-known open-source projects. The projects were ordered in descending order of the number of stars they received, and the top 20 were subject to analysis.

  1. Elastic Search
  2. Rx Java
  3. Spring Framework
  4. Retrofit
  5. Dubbo
  6. Java
  7. Proxyee-Down
  8. Spring Boot Examples
  9. Ghidra
  10. Apollo
  11. Druid
  12. Fast Json
  13. Recycler View Adapter
  14. Hystrix
  15. Jeecq-boot
  16. Kafka
  17. PhotoView
  18. Spring Cloud Alibaba
  19. Hutool
  20. Jenkins

The source code directory was dragged and dropped into Sider Scan for analysis, and a total of 33,641 files totaling 537 MB in size were analyzed. Below are the results for each project.

We can see that most projects centered around an 18% duplication rate. Some duplication rates were as low as 1.9%, which can be further reviewed at a later time. 

Summary of results

The analyses of most projects took several minutes, however, other projects with a greater number of duplicates took much longer to complete. Overall, more than 20,000 duplicates were discovered, and the mean duplication rate was 17.6% (the median was 17.5%). This roughly translates to a little over a sixth of the total number of code statements being some form of a duplicate of another within the same project. However, the duplication rate means very little unless we look at what these duplications are and assess the impact it has on the goals of that project.

How to take action – 3 Steps

So what now? While these stats provide some indication of the duplication landscape, there still is the need to look into the details of the duplicates to determine what action to take. If a project has hundreds or thousands of duplicate code blocks, there is no time to look through each and every pair. So how does Sider Scan help? It systematizes the search process by scoring duplicates and allows you to find the ones that may need the most attention. 

There are 3 steps involved in taking action:

  1. Sort through the list of duplicates detected
  2. Open the side-by-side comparison view of the duplicate pair and identify whether action is needed
  3. Make fixes as needed

While these individual steps will be covered in more detail in another post, the following are the filters and visualizations unique to Sider Scan.

How to detect problematic duplicate code in Java

Sider has developed an algorithm that can detect and determine which duplicate code blocks need the most attention. The analysis results interface allows you to sort the duplicate code blocks using the following criteria

  • Importance (needs most attention)
  • Number of duplicates (also known as number of clones)
  • Similarity score (or degree of similarity)

With this, Sider Scan selects and prioritizes for you which duplicate blocks are likely to be problematic.

It also provides a visual overview of how the duplicates are interconnected between different files. The example below shows which files share duplicate code, where each of the arcs represents a particular file, and the strips connecting the files represent shared duplicates. 

A visual representation of duplicate code blocks shared between various files.


What kind of duplicate code do you have? Find out easily.

As of Feb 2021, the advanced duplicate code detection tool, Sider Scan, is available free as a standalone tool for testing purposes and can be found at https://siderlabs.com/scan/. Not only does it detect duplicate code in Java, but it can do so in PHP, Ruby, Swift, JavaScript, TypeScript, C, C++, CUDA. Python and C# will be supported soon. It is scheduled to be part of the software build for Sider’s code review system so that important duplicates are not only found but also tracked over time. It’s fast, secure (your code does not leave your device), created based on user feedback, and is one-of-a-kind.

Find out the quality of your duplicates

Aki Asahara

CEO of Sider. Aki joined Fixstars in 2008 and served major clients such as the US Airforce, MIT, USC, Toyota, and Hitachi High-technologies. After his successful tenure, he was appointed CEO of US operations in 2012. He was appointed CEO of Sider in 2019. He holds a Ph.D. in Astrophysics from Kyoto University and is a Certified Scrum Master.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.