Overview and Implementation of Performance/RegexpMatch Cop

  • Post author:
  • Post last modified:2020-11-25
  • Post category:Code Review
  • Reading time:13 mins read
Official logo of RuboCop. Source: http://www.rubocop.org/en/latest/

(Editor’s note: This post was originally published in Japanese on December 6, 2016, by Pocke, our Engineering Advisoron his blog, and translated and edited by Sider Team.)

Hello, this is Pocke. 
A Cop called Performance/RegexpMatch was added to RuboCop in December 2016.

Reference:
Add new Performance/RegexpMatch cop by pocke · Pull Request #3824 · bbatsov/rubocop

This Cop corresponds with match? method that was added in Ruby 2.4.

In this article, I am going to talk about the feature of Ruby 2.4 that Performance/RegexpMatch Cop deals with, then I will be discussing the Cop’s feature overview and its implementation.

About match? method

So what exactly is match?method?

This method was added to the following classes: Regexp, String, and Symbol. Previously, the method without ? mark such as Regexp#match existed in each class.

The reason why the new match? was added was performance.

Methods like Regexp#match generates MatchData objects. However, it would be a waste of time if we don’t use such objects.

if match = /re(gexp)/.match(foo)
do_something(match[1])
end

For example, the string of code above stores the result of Regexp#match to the match variable.

The value of the match variable will be MatchData object (or nil) and will keep the matched result.

In this example, a sub-match in a regular expression is passed onto do_something method.

However, the following example is not using MatchData.

if /re(gexp)/.match(foo)
do_something
end

In this case, a boolean value could suffice as the result of match.
And that is what match? does.

In Ruby 2.4, we can rewrite the above example by using match? as follows.

if /re(gexp)/.match?(foo)
do_something
end

Reference: What’s new in Ruby 2.4?

About RuboCop

RuboCop is a Linter for Ruby. 
For more information, please go to bbatsov/rubocop: A Ruby static code analyzer, based on the community Ruby style guide.

About Performance/RegexpMatch

As previously described, Performance/RegexpMatch is a cop that corresponds to match? Method. In RuboCop terminology, a Cop means a rule. This Cop detects and corrects the code unnecessarily using the match method as mentioned above.

Let’s see an example. If you analyze the code above with RuboCop, the following warning will be issued. In addition, since this Cop only works with Ruby 2.4 or higher, you need to describe the version of Ruby in .rubocop.yml.

# test.rb
if /re(gexp)/.match(foo)
do_something
end

# .rubocop.yml
AllCops:
TargetRubyVersion: 2.4
$ rubocop — only Performance/RegexpMatch
Inspecting 1 file
C
Offenses:
test.rb:2:4: C: Use match? instead of match when MatchData is not used.
if /re(gexp)/.match(foo)
^^^^^^^^^^^^^^^^^^^^^
1 file inspected, 1 offense detected

Like this, you will be warned to use the match? method. If you are referring to MatchData, it will not issue a warning.

In addition, since this Cop supports Auto-Correct, it is possible to automatically correct the code by executing RuboCop with the -a option.

$ rubocop — only Performance/RegexpMatch -a
Inspecting 1 file
C
Offenses:
test.rb:2:4: C: [Corrected] Use match? instead of match when MatchData is not used.
if /re(gexp)/.match(foo)
^^^^^^^^^^^^^^^^^^^^^
1 file inspected, 1 offense detected, 1 offense corrected

$ git diff
diff — git a/test.rb b/test.rb
index 6d73a78..af34863 100644
— — a/test.rb
+++ b/test.rb
@@ -1,4 +1,4 @@
# test.rb
-if /re(gexp)/.match(foo)
+if /re(gexp)/.match?(foo)
do_something
end

Implementing Performance/RegexpMatch

Now, I would like to look at the implementation of this Cop. If you want to simply implement it, you only need to check whether the method name is match in on_send, but this Cop has some features to prevent false positives.

For those unfamiliar with RuboCop implementation, reading RuboCop’s implementation first will help you read through the following sections, so please take a look.

I will describe the source code at the time when I wrote the article. Please note some code may have been rewritten by the later change.

● The commit: Add new Performance / RegexpMatch cop · bbatsov / rubocop @ 864531f

● Main reference source: rubocop / regexp_match.rb at 864531f61634354570a6b4458cb599c4373659b7 · bbatsov / rubocop

Entry point

First of all, let’s look at the entry point where this Cop is executed. In this Cop, there are two entry points: on_if and on_case.

https://github.com/bbatsov/rubocop/blob/864531f61634354570a6b4458cb599c4373659b7/lib/rubocop/cop/performance/regexp_match.rb#L90-L106

def on_if(node)
return if target_ruby_version < 2.4
  cond, = *node
check_condition(cond)
end
def on_case(node)
return if target_ruby_version < 2.4
  case_cond, = *node
return if case_cond
when_clauses(node).each do |when_node|
cond, = *when_node
check_condition(cond)
end
end

You see that it passes the condition statement part in the if and case expressions to RegexpMatch#check_condition.

That means code x = /re/.match(“foo”)is not targeted from the beginning.

check_condition

Next, let us see the implementation of the check_condition method.

https://github.com/bbatsov/rubocop/blob/864531f61634354570a6b4458cb599c4373659b7/lib/rubocop/cop/performance/regexp_match.rb#L124-L130

def check_condition(cond)
match_node?(cond) do
return if last_match_used?(cond)
add_offense(cond, :expression,
format(MSG, cond.loc.selector.source))
end
end

If cond suffices the following two conditions, you will see that it calls the add_offense method.

● The result of match_node? is true

● The result of last_match_used?(cond) is false

Note that add_offense is a method that adds that there is an offense (offense is a warning to code in RuboCop terms) in the target node. In other words, this method itself is one goal as Cop.

Let’s see this condition one by one.

match_node?

https://github.com/bbatsov/rubocop/blob/864531f61634354570a6b4458cb599c4373659b7/lib/rubocop/cop/performance/regexp_match.rb#L51-L74

def_node_matcher :match_method?, <<-PATTERN
{
(send _recv :match _)
(send _recv :match _ (:int …))
}
PATTERN
def_node_matcher :match_operator?, <<-PATTERN
(send !nil :=~ !nil)
PATTERN
def_node_matcher :match_with_lvasgn?, <<-PATTERN
(match_with_lvasgn !nil !nil)
PATTERN
MATCH_NODE_PATTERN = <<-PATTERN.freeze
{
#match_method?
#match_operator?
#match_with_lvasgn?
}
PATTERN
def_node_matcher :match_node?, MATCH_NODE_PATTERN

It’s a bit long, but the definition of the match_node? method is at the bottom of this code.

match_node? methods are defined using def_node_matcher.

The contents of MATCH_NODE_PATTERN constants that are used to def_node_matcher is as below.

{
#match_method?
#match_operator?
#match_with_lvasgn?
}

For those who are not familiar with NodePattern , this{}pattern represents or, and if you match one of the patterns in it, the entire pattern will match.

Also, the pattern #… is a method call. If the result of calling after passing Node to the corresponding method is true, it means it matches the pattern.

(Those who would like to know more about NodePattern should read rubocop / node_pattern.rb at master · bbatsov / rubocop .)

Now, looking at this matcher definition based on the above, you will see that match_node? becomes true when either one of the following becomes true:

  • match_method?
  • match_operator?
  • match_with_lvasgn?

In addition, though I am not going to explain in details in this post, these three matchers returnstrue in the following cases.

  1. #match_method?
  • foo.match(/re/)
  • foo.match(/re/, 1)

2. match_operator?

  • foo =~ /re/
  • re =~ “foo”

3. match_with_lvasgn?

  • /re/ =~ foo

In other words, true is returned when calling the match method or calling the = ~ operator.

Now you can issue warnings to the previous examples with only the code that we have.

# — within if conditional statement => `on_if` is applied
# — There is a call to the`match` method => `match_method?` is applied
if /re(gexp)/.match(foo)
do_something
end

But, there is another condition last_match_used? for this Cop. What is this doing? 
In the next section, we will look at how this method works.

last_match_used?

Some of the following code behaves the same way.

# part1
if match = /re(gexp)/.match(foo)
do_something(match[1])
end
# part2
if /re(gexp)/.match(foo)
do_something(Regexp.last_match[1])
end
# part3
if /re(gexp)/.match(foo)
do_something($~[1])
end
# part4
if /re(gexp)/.match(foo)
do_something($1)
end

Believe it or not, you can use the result of the match method without explicitly assigning it to a variable!

This last_match_used? method checks to see if MatchData is referred via global variables even when it is not explicitly assigned to a variable as described above.

Let’s take a look at the code in detail.

def_node_search :search_match_nodes, MATCH_NODE_PATTERN
def_node_search :last_matches, <<-PATTERN
{
(send (const nil :Regexp) :last_match)
(send (const nil :Regexp) :last_match _)
({back_ref nth_ref} _)
(gvar #dollar_tilde)
}
PATTERN

# The middle part is omitted.
def last_match_used?(match_node)
scope_root = scope_root(match_node)
body = scope_root ? scope_body(scope_root) : match_node.ancestors.last
match_node_pos = match_node.loc.expression.begin_pos
  next_match_pos = next_match_pos(body, match_node_pos, scope_root)
range = match_node_pos..next_match_pos
  find_last_match(body, range, scope_root)
end
def next_match_pos(body, match_node_pos, scope_root)
node = search_match_nodes(body).find do |match|
match.loc.expression.begin_pos > match_node_pos &&
scope_root(match) == scope_root
end
node ? node.loc.expression.begin_pos : Float::INFINITY
end
def find_last_match(body, range, scope_root)
last_matches(body).find do |ref|
ref_pos = ref.loc.expression.begin_pos
range.cover?(ref_pos) &&
scope_root(ref) == scope_root
end
end
def scope_body(node)
node.children[2]
end
def scope_root(node)
node.each_ancestor.find do |ancestor|
ancestor.def_type? ||
ancestor.class_type? ||
ancestor.module_type?
end
end
def dollar_tilde(sym)
sym == :$~
end

It looks long. First, let us take a look at the last_match_used? method. Roughly speaking, this method does the following three things.

● Obtain the scope in which the global variable is valid.

● Obtain the position where the match method will be called next within the target scope.

● Check whether the global variable is used within the specified range.

Next, I will explain these three things in detail.

Obtain the scope for which global variables are valid

The scope of the global variable that stores MatchData behaves in a somewhat unique way. Although the scope of these variables is local, these variables will still be accessible at the end of the block, unlike regular local variables. The following description of Local Scope is cited and translated from Ruby 2.5.0 Reference Manual written in Japanese.

It has the same scope as a normal local variable. In other words, assignments made in the body of the class expression or in the method body do not affect its outside. It is the same as a normal local variable, except that it can be accessed without an assignment in any place in the program.

Let us look at an example.

def foo
tap do
/x/ =~ ‘x’
p $~ # => #<MatchData “x”>
end
p $~ # => #<MatchData “x”>
end
foo
p $~ # => nil

Looking at the execution result of the foo method, you can see that $~ is not initialized at the end of the block, but only at the end of the method.

In order to support the scope above, this Cop will follow AST until it finds one of the following:

● Method definition

● Class definition

● Module definition

Then, scope_root method makes the definition that is first found in the scope of the global variable.

def scope_root(node)
node.each_ancestor.find do |ancestor|
ancestor.def_type? ||
ancestor.class_type? ||
ancestor.module_type?
end
end

Obtain the position where the match method will be called next within the target scope.

Let us suppose we have the following code.

def foo
if x.match(/re/)
do_something
end
if x.match(/rerere/)
do_something2($~)
end
end

In this code, you will see that the first match method call will be replaced with is the match? method. On the other hand, the second call of the match method references $~, so this cannot be replaced with match?.

In order to move the Cop correctly in such cases, we need to understand the position of the match method that will be called after the inspection object of the match method and check if there is a global variable between the match that will be inspected and the following match. We will have to ignore global variables that appear after the second match method and the following. next_match_pos method does this for us.

def next_match_pos(body, match_node_pos, scope_root)
node = search_match_nodes(body).find do |match|
match.loc.expression.begin_pos > match_node_pos &&
scope_root(match) == scope_root
end
node ? node.loc.expression.begin_pos : Float::INFINITY
end

This method returns the position of the match method that is called after the inspected match method if there is any. Otherwise, it returns infinity for the sake of convenience.

For this implementation, a method called search_match_nodes is used, which is defined by a method called def_node_search.

MATCH_NODE_PATTERN = <<-PATTERN.freeze
{
#match_method?
#match_operator?
#match_with_lvasgn?
}
PATTERN
def_node_search :search_match_nodes, MATCH_NODE_PATTERN

def_node_search is similar to def_node_matcher. The methods defined using def_node_matcher checks whether the passed node itself matches the pattern. On the other hand, the method defined using def_node_search returns a list of nodes that match the pattern that exists in the passed nodes.

By using this method, next_match_pos is implemented.

Checking whether a global variable is used within a specified range

This is going to be our last one. We will use the “Scope” and “ Global variable valid range” that has become available in the code we have explained and check if the corresponding global variable is used within the range.

def find_last_match(body, range, scope_root)
last_matches(body).find do |ref|
ref_pos = ref.loc.expression.begin_pos
range.cover?(ref_pos) &&
scope_root(ref) == scope_root
end
end

In addition, the last_matches method that is used in this find_last_match method is defined in the def_node_search described earlier.

def_node_search :last_matches, <<-PATTERN
{
(send (const nil :Regexp) :last_match)
(send (const nil :Regexp) :last_match _)
({back_ref nth_ref} _)
(gvar #dollar_tilde)
}
PATTERN

If you have read this post this far, I think you have grasped the meaning of this pattern somehow.

That is all for the explanation of this Cop code. There are some code such as the implementation of autocorrect that I didn’t include in this article, so please look into the code if you are interested.


For more information about Sider, please go to our website.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.