
(Editor’s note: This post was originally published in Japanese on December 6, 2016, by Pocke, our Engineering Advisoron his blog, and translated and edited by Sider Team.)
Hello, this is Pocke.
A Cop called Performance/RegexpMatch
was added to RuboCop in December 2016.
Reference:
Add new Performance/RegexpMatch cop by pocke · Pull Request #3824 · bbatsov/rubocop
This Cop corresponds with match?
method that was added in Ruby 2.4.
In this article, I am going to talk about the feature of Ruby 2.4 that Performance/RegexpMatch
Cop deals with, then I will be discussing the Cop’s feature overview and its implementation.
About match?
method
So what exactly is match?
method?
This method was added to the following classes: Regexp
, String
, and Symbol
. Previously, the method without ?
mark such as Regexp#match
existed in each class.
The reason why the new match? was added was performance.
Methods like Regexp#match
generates MatchData
objects. However, it would be a waste of time if we don’t use such objects.
if match = /re(gexp)/.match(foo)
do_something(match[1])
end
For example, the string of code above stores the result of Regexp#match
to the match
variable.
The value of the match
variable will be MatchData
object (or nil
) and will keep the matched result.
In this example, a sub-match in a regular expression is passed onto do_something
method.
However, the following example is not using MatchData
.
if /re(gexp)/.match(foo)
do_something
end
In this case, a boolean value could suffice as the result of match
.
And that is what match?
does.
In Ruby 2.4, we can rewrite the above example by using match?
as follows.
if /re(gexp)/.match?(foo)
do_something
end
Reference: What’s new in Ruby 2.4?
About RuboCop
RuboCop is a Linter for Ruby.
For more information, please go to bbatsov/rubocop: A Ruby static code analyzer, based on the community Ruby style guide.
About Performance/RegexpMatch
As previously described, Performance/RegexpMatch
is a cop that corresponds to match?
Method. In RuboCop terminology, a Cop means a rule. This Cop detects and corrects the code unnecessarily using the match method as mentioned above.
Let’s see an example. If you analyze the code above with RuboCop, the following warning will be issued. In addition, since this Cop only works with Ruby 2.4 or higher, you need to describe the version of Ruby in .rubocop.yml
.
# test.rb
if /re(gexp)/.match(foo)
do_something
end
# .rubocop.yml
AllCops:
TargetRubyVersion: 2.4
$ rubocop — only Performance/RegexpMatch
Inspecting 1 file
C
Offenses:
test.rb:2:4: C: Use match? instead of match when MatchData is not used.
if /re(gexp)/.match(foo)
^^^^^^^^^^^^^^^^^^^^^
1 file inspected, 1 offense detected
Like this, you will be warned to use the match?
method. If you are referring to MatchData
, it will not issue a warning.
In addition, since this Cop supports Auto-Correct, it is possible to automatically correct the code by executing RuboCop with the -a
option.
$ rubocop — only Performance/RegexpMatch -a
Inspecting 1 file
C
Offenses:
test.rb:2:4: C: [Corrected] Use match? instead of match when MatchData is not used.
if /re(gexp)/.match(foo)
^^^^^^^^^^^^^^^^^^^^^
1 file inspected, 1 offense detected, 1 offense corrected
$ git diff
diff — git a/test.rb b/test.rb
index 6d73a78..af34863 100644
— — a/test.rb
+++ b/test.rb
@@ -1,4 +1,4 @@
# test.rb
-if /re(gexp)/.match(foo)
+if /re(gexp)/.match?(foo)
do_something
end
Implementing Performance/RegexpMatch
Now, I would like to look at the implementation of this Cop. If you want to simply implement it, you only need to check whether the method name is match
in on_send
, but this Cop has some features to prevent false positives.
For those unfamiliar with RuboCop implementation, reading RuboCop’s implementation first will help you read through the following sections, so please take a look.
I will describe the source code at the time when I wrote the article. Please note some code may have been rewritten by the later change.
● The commit: Add new Performance / RegexpMatch cop · bbatsov / rubocop @ 864531f
● Main reference source: rubocop / regexp_match.rb at 864531f61634354570a6b4458cb599c4373659b7 · bbatsov / rubocop
Entry point
First of all, let’s look at the entry point where this Cop is executed. In this Cop, there are two entry points: on_if
and on_case
.
def on_if(node)
return if target_ruby_version < 2.4
cond, = *node
check_condition(cond)
end
def on_case(node)
return if target_ruby_version < 2.4
case_cond, = *node
return if case_cond
when_clauses(node).each do |when_node|
cond, = *when_node
check_condition(cond)
end
end
You see that it passes the condition statement part in the if
and case
expressions to RegexpMatch#check_condition
.
That means code x = /re/.match(“foo”)
is not targeted from the beginning.
check_condition
Next, let us see the implementation of the check_condition method.
def check_condition(cond)
match_node?(cond) do
return if last_match_used?(cond)
add_offense(cond, :expression,
format(MSG, cond.loc.selector.source))
end
end
If cond
suffices the following two conditions, you will see that it calls the add_offense
method.
● The result of match_node?
is true
● The result of last_match_used?(cond)
is false
Note that add_offense
is a method that adds that there is an offense (offense is a warning to code in RuboCop terms) in the target node. In other words, this method itself is one goal as Cop.
Let’s see this condition one by one.
match_node?
def_node_matcher :match_method?, <<-PATTERN
{
(send _recv :match _)
(send _recv :match _ (:int …))
}
PATTERN
def_node_matcher :match_operator?, <<-PATTERN
(send !nil :=~ !nil)
PATTERN
def_node_matcher :match_with_lvasgn?, <<-PATTERN
(match_with_lvasgn !nil !nil)
PATTERN
MATCH_NODE_PATTERN = <<-PATTERN.freeze
{
#match_method?
#match_operator?
#match_with_lvasgn?
}
PATTERN
def_node_matcher :match_node?, MATCH_NODE_PATTERN
It’s a bit long, but the definition of the match_node?
method is at the bottom of this code.
match_node?
methods are defined using def_node_matcher
.
The contents of MATCH_NODE_PATTERN
constants that are used to def_node_matcher
is as below.
{
#match_method?
#match_operator?
#match_with_lvasgn?
}
For those who are not familiar with NodePattern
, this{}
pattern represents or, and if you match one of the patterns in it, the entire pattern will match.
Also, the pattern #…
is a method call. If the result of calling after passing Node to the corresponding method is true
, it means it matches the pattern.
(Those who would like to know more about NodePattern should read rubocop / node_pattern.rb at master · bbatsov / rubocop .)
Now, looking at this matcher definition based on the above, you will see that match_node?
becomes true
when either one of the following becomes true
:
match_method?
match_operator?
match_with_lvasgn?
In addition, though I am not going to explain in details in this post, these three matchers returnstrue
in the following cases.
- #match_method?
foo.match(/re/)
foo.match(/re/, 1)
2. match_operator?
foo =~ /re/
re =~ “foo”
3. match_with_lvasgn?
/re/ =~ foo
In other words, true
is returned when calling the match method or calling the = ~
operator.
Now you can issue warnings to the previous examples with only the code that we have.
# — within if conditional statement => `on_if` is applied
# — There is a call to the`match` method => `match_method?` is applied
if /re(gexp)/.match(foo)
do_something
end
But, there is another condition last_match_used?
for this Cop. What is this doing?
In the next section, we will look at how this method works.
last_match_used?
Some of the following code behaves the same way.
# part1
if match = /re(gexp)/.match(foo)
do_something(match[1])
end
# part2
if /re(gexp)/.match(foo)
do_something(Regexp.last_match[1])
end
# part3
if /re(gexp)/.match(foo)
do_something($~[1])
end
# part4
if /re(gexp)/.match(foo)
do_something($1)
end
Believe it or not, you can use the result of the match
method without explicitly assigning it to a variable!
This last_match_used?
method checks to see if MatchData
is referred via global variables even when it is not explicitly assigned to a variable as described above.
Let’s take a look at the code in detail.
- https://github.com/bbatsov/rubocop/blob/864531f61634354570a6b4458cb599c4373659b7/lib/rubocop/cop/performance/regexp_match.rb#L81-L88
- https://github.com/bbatsov/rubocop/blob/864531f61634354570a6b4458cb599c4373659b7/lib/rubocop/cop/performance/regexp_match.rb#L132-L169
def_node_search :search_match_nodes, MATCH_NODE_PATTERN
def_node_search :last_matches, <<-PATTERN
{
(send (const nil :Regexp) :last_match)
(send (const nil :Regexp) :last_match _)
({back_ref nth_ref} _)
(gvar #dollar_tilde)
}
PATTERN
# The middle part is omitted.
def last_match_used?(match_node)
scope_root = scope_root(match_node)
body = scope_root ? scope_body(scope_root) : match_node.ancestors.last
match_node_pos = match_node.loc.expression.begin_pos
next_match_pos = next_match_pos(body, match_node_pos, scope_root)
range = match_node_pos..next_match_pos
find_last_match(body, range, scope_root)
end
def next_match_pos(body, match_node_pos, scope_root)
node = search_match_nodes(body).find do |match|
match.loc.expression.begin_pos > match_node_pos &&
scope_root(match) == scope_root
end
node ? node.loc.expression.begin_pos : Float::INFINITY
end
def find_last_match(body, range, scope_root)
last_matches(body).find do |ref|
ref_pos = ref.loc.expression.begin_pos
range.cover?(ref_pos) &&
scope_root(ref) == scope_root
end
end
def scope_body(node)
node.children[2]
end
def scope_root(node)
node.each_ancestor.find do |ancestor|
ancestor.def_type? ||
ancestor.class_type? ||
ancestor.module_type?
end
end
def dollar_tilde(sym)
sym == :$~
end
It looks long. First, let us take a look at the last_match_used?
method. Roughly speaking, this method does the following three things.
● Obtain the scope in which the global variable is valid.
● Obtain the position where the match
method will be called next within the target scope.
● Check whether the global variable is used within the specified range.
Next, I will explain these three things in detail.
Obtain the scope for which global variables are valid
The scope of the global variable that stores MatchData
behaves in a somewhat unique way. Although the scope of these variables is local, these variables will still be accessible at the end of the block, unlike regular local variables. The following description of Local Scope is cited and translated from Ruby 2.5.0 Reference Manual written in Japanese.
It has the same scope as a normal local variable. In other words, assignments made in the body of the class expression or in the method body do not affect its outside. It is the same as a normal local variable, except that it can be accessed without an assignment in any place in the program.
Let us look at an example.
def foo
tap do
/x/ =~ ‘x’
p $~ # => #<MatchData “x”>
end
p $~ # => #<MatchData “x”>
end
foo
p $~ # => nil
Looking at the execution result of the foo
method, you can see that $~
is not initialized at the end
of the block, but only at the end
of the method.
In order to support the scope above, this Cop will follow AST until it finds one of the following:
● Method definition
● Class definition
● Module definition
Then, scope_root
method makes the definition that is first found in the scope of the global variable.
def scope_root(node)
node.each_ancestor.find do |ancestor|
ancestor.def_type? ||
ancestor.class_type? ||
ancestor.module_type?
end
end
Obtain the position where the match method will be called next within the target scope.
Let us suppose we have the following code.
def foo
if x.match(/re/)
do_something
end
if x.match(/rerere/)
do_something2($~)
end
end
In this code, you will see that the first match
method call will be replaced with is the match?
method. On the other hand, the second call of the match
method references $~
, so this cannot be replaced with match?
.
In order to move the Cop correctly in such cases, we need to understand the position of the match
method that will be called after the inspection object of the match
method and check if there is a global variable between the match
that will be inspected and the following match
. We will have to ignore global variables that appear after the second match
method and the following. next_match_pos
method does this for us.
def next_match_pos(body, match_node_pos, scope_root)
node = search_match_nodes(body).find do |match|
match.loc.expression.begin_pos > match_node_pos &&
scope_root(match) == scope_root
end
node ? node.loc.expression.begin_pos : Float::INFINITY
end
This method returns the position of the match
method that is called after the inspected match
method if there is any. Otherwise, it returns infinity for the sake of convenience.
For this implementation, a method called search_match_nodes
is used, which is defined by a method called def_node_search
.
MATCH_NODE_PATTERN = <<-PATTERN.freeze
{
#match_method?
#match_operator?
#match_with_lvasgn?
}
PATTERN
def_node_search :search_match_nodes, MATCH_NODE_PATTERN
def_node_search
is similar to def_node_matcher
. The methods defined using def_node_matcher
checks whether the passed node itself matches the pattern. On the other hand, the method defined using def_node_search
returns a list of nodes that match the pattern that exists in the passed nodes.
By using this method, next_match_pos
is implemented.
Checking whether a global variable is used within a specified range
This is going to be our last one. We will use the “Scope” and “ Global variable valid range” that has become available in the code we have explained and check if the corresponding global variable is used within the range.
def find_last_match(body, range, scope_root)
last_matches(body).find do |ref|
ref_pos = ref.loc.expression.begin_pos
range.cover?(ref_pos) &&
scope_root(ref) == scope_root
end
end
In addition, the last_matches
method that is used in this find_last_match
method is defined in the def_node_search
described earlier.
def_node_search :last_matches, <<-PATTERN
{
(send (const nil :Regexp) :last_match)
(send (const nil :Regexp) :last_match _)
({back_ref nth_ref} _)
(gvar #dollar_tilde)
}
PATTERN
If you have read this post this far, I think you have grasped the meaning of this pattern somehow.
That is all for the explanation of this Cop code. There are some code such as the implementation of autocorrect that I didn’t include in this article, so please look into the code if you are interested.
For more information about Sider, please go to our website.