Sider Labs is a revolutionary tool that finds the most problematic code duplicates in your projects. It is currently available as a standalone feature that you can test for free at https://siderlabs.com/labs/. Not only can it analyze greater combinatorial pairings of duplicates more quickly compared to existing tools, but it can order and prioritize them in a way that allows you to find the most relevant duplicates out of the many that exist, and assists in the efforts to refactor and clean up code.
In this post, we present the results of duplicate code detection in Java, having analyzed 20 of the most popular open-source Java projects found on Github. We summarize our findings by providing statistics such as the number of duplicates and duplication rate, but also provide a visual comparison of the most interesting finds, and tell you how you can do the same!
How do you measure code duplication in a project?
So what counts as a code duplication anyway? There are many parameters to consider for code to be counted as duplicates, whether exactly identical, or partially identical, or have a minimum number of lines or statements to qualify. There are many other criteria that can affect this. Here are some of the building blocks in quantifying duplication in order to systematize its detection.
In the case of Sider Labs, we start by defining a similarity score between blocks of code that are considered ‘similar’. For example, if there is an identical match between two blocks of code, then we assign a similarity score of 100%. If two blocks of code are not exactly identical, then the similarity score is loosely defined as the proportion of the number of identical statements contained in the block out of the total number of statements in the block. So as a simplified example, if there are 10 statements each in a pair of code blocks being compared, and 9 of those statements are identical, then the pair would receive a 90% similarity score.
We define a similarity score between two blocks of code, but this gets more complicated when grouping them with other code blocks that may have the exact same logic (duplicate logic) that doesn’t necessarily have a high similarity score. In Sider Labs, we have created an algorithm that not only detects and surfaces relevant code duplication but does the hard work of filtering out those that are less consequential so that you can focus on the ones that matter.
Once an analysis of a project is complete, Sider Labs will provide a ‘code duplication rate’ that gives a sense of the number of relevant duplicates that exist in that project. This duplication rate (also known as ‘clone rate’) is a somewhat complex calculation because of the way code statements and duplication are measured and counted, however, it is roughly understood as the total number of duplicate code statements as a proportion of the total number of code statements. As a simplified example, if a project consists of 100 code statements, and 15 of the statements are duplicates (a duplicate pair exists somewhere in the 100 statements), then the duplication rate is 15%.
It is important to note here that a duplicate code statement does not mean that they are exactly identical. Code statements even with a low similarity score can be considered a duplicate. If the algorithm detects some form of copying and pasting, or some identical logic with varying variable/function names, they will be grouped into a set of duplicates.
While the duplication rate provides a singular quantity that represents a particular project, the similarity score, among other quantities, helps us to sort/filter/score parts of our code so that we can find what we are looking for.
What we analyzed
In this study, we ran an analysis on 20 of the most popular open-source Java projects. This is a process that anyone can replicate using Sider Labs’ duplicate code detection and analysis tool which, as of 2/12/21, is available free for unlimited use at https://siderlabs.com/labs/.
Java projects and links
The following are the projects analyzed, and the URL of the source code. Some of these are well-known open-source projects. The projects were ordered in descending order of the number of stars they received, and the top 20 were subject to analysis.
- Elastic Search
- Rx Java
- Spring Framework
- Spring Boot Examples
- Fast Json
- Recycler View Adapter
- Spring Cloud Alibaba
The source code directory was dragged and dropped into Sider Labs for analysis, and a total of 33,641 files totaling 537 MB in size were analyzed. Below are the results for each project.
We can see that most projects centered around an 18% duplication rate. Some duplication rates were as low as 1.9%, which can be further reviewed at a later time.
Summary of results
The analyses of most projects took several minutes, however, other projects with a greater number of duplicates took much longer to complete. Overall, more than 20,000 duplicates were discovered, and the mean duplication rate was 17.6% (the median was 17.5%). This roughly translates to a little over a sixth of the total number of code statements being some form of a duplicate of another within the same project. However, the duplication rate means very little unless we look at what these duplications are and assess the impact it has on the goals of that project.
How to take action – 3 Steps
So what now? While these stats provide some indication of the duplication landscape, there still is the need to look into the details of the duplicates to determine what action to take. If a project has hundreds or thousands of duplicate code blocks, there is no time to look through each and every pair. So how does Sider Labs help? It systematizes the search process by scoring duplicates and allows you to find the ones that may need the most attention.
There are 3 steps involved in taking action:
- Sort through the list of duplicates detected
- Open the side-by-side comparison view of the duplicate pair and identify whether action is needed
- Make fixes as needed
While these individual steps will be covered in more detail in another post, the following are the filters and visualizations unique to Sider Labs.
How to detect problematic duplicate code in Java
Sider Labs has developed an algorithm that can detect and determine which duplicate code blocks need the most attention. The analysis results interface allows you to sort the duplicate code blocks using the following criteria
- Importance (needs most attention)
- Number of duplicates (also known as number of clones)
- Similarity score (or degree of similarity)
With this, Sider Labs selects and prioritizes for you which duplicate blocks are likely to be problematic.
It also provides a visual overview of how the duplicates are interconnected between different files. The example below shows which files share duplicate code, where each of the arcs represents a particular file, and the strips connecting the files represent shared duplicates.
What kind of duplicate code do you have? Find out easily.