In theory, automatically finding recurring implicit anti-patterns would help improve the efficiency of code review. When we asked developers about creating rules for implicit anti-patterns, a frequent issue we often hear is that they can’t come up with any. What makes it difficult to find implicit patterns? Research with data mining identifies improvement patterns that reappear more often than we expect.
In a study by researchers at Nara Institute of Science and Technology and Wakayama University, researchers analyzed 228,099 patches in OpenStack for code improvement patterns. The study was conducted by comparing the initially submitted version and final accepted (or rejected) version of each patch. The changes between the two versions are the improvements through code review. They proposed an algorithm to extract patterns from the changes and found 1,476 individual code improvement patterns from the project.
The table shows the most frequently appeared patterns grouped into four categories. The Support value (number of rewrites during code review) would surprise you. Two of the patterns occurred more than 5,000 times!
Another point the study notes is there is no official documentation in the project about these patterns. Although OpenStack and Python have good coding style guidelines, nothing about these patterns were referenced in them. Do the developers know of them though? Yes, people notice these patterns because discussions of them could be found on StackOverflow or OpenStack forums. However, they are still not officially documented for new patch authors.
Rapid rise and fall of patterns
When we look closer into the patterns. The Project-specific and Language-specific patterns show more than 17,000 support instances in the span of 17 months. These patterns relate to dependency changes as well as changes to Python, occurring at various periods of the project.
assert-equals2equal for instance, only occurred in the 2nd period of the study. While
disk2disk_api only occurred during the 4th and 5th period. One may conclude the short life cycle of these improvements as the reason for their absence from any official guideline. A steep rise in the amount of detection occurs during the span of a few months, while others fade out over a period of time. The short life of these patterns can easily dismiss their impact on a project, but the high volume of patches show a large cost for repeatedly correcting the same issues.
Recurring readability improvement patterns
Looking at the Readability-improvement and Other categories, the patterns named
directly-dictionary-access in table 1 don’t have any impact on the behavior of the code, so in essence, can be ignored, but are frequently flagged by reviewers. A reason for this is that the patterns
directly-dictionary-access improve the output readability, while
remove-redundant-in, though not stated, could improve code readability. Their interpretation would suggest that these patterns would then be project-specific, thus the reason why they aren’t included in any best practice guideline.
If we look at
reverse-assert-arguments there is no strong standard order for the
assertEqual argument other than a preference in the project. By documenting this in a coding guideline, authors aren’t left with a 50% chance of matching this ambiguous order.
remove-redundant-in is also a good example of an implicit pattern followed in the source code. 930 supported instances of this pattern were found throughout the study, and still no official documentation regarding the reason for these changes.
The study with data mining has shown to help detect and track the transition of implicit code patterns leading to further improvements in the code review process. The fast cycle of project specific patterns and importance of readability have a larger impact than developers realize. Documenting project specific patterns or using tools to automatically address these issues would help reduce the cost of the review process.
For more information on the research check out the research paper at