Why Using the Greedy .* in Regular Expressions Is Almost Never What You Actually Want

Yesterday, I stumbled upon the StackOverflow question How to Extract Data Between Square Brackets Using Perl in which the asker wants to use regular expressions to parse out tuples of values wrapped in square brackets and separated by a comma:

This is the range of values (a1,b1) and [c1,d1].

In the above example, the expected match would be [c1,d1] with two capturing groups holding the values c1 and d1, respectively. One user answering the question suggested the use of .* in their pattern, which is pretty much never what you want. Here's why.

tl;dr:

  • Don't use .* unless you know what you're doing.
  • Use .*? instead or avoid the dot altogether.

The Dot: Matching (Almost) Arbitrary Characters

Let me say a few words about wildcard matching and greediness before we dive into the specific issue at hand. It's important that we're on the same page here to understand why .* can be really bad.

Outside of a character class in a regular expression, the dot (.) will match any character except a newline; within a character class, the dot is interpreted as a literal and matches the dot character. Most regular expression implementations let you specify a flag instructing the engine to also match newline characters with the dot. Oftentimes, the flag is abbreviated as s, and in .NET its name is RegexOptions.Singleline.

Greedy Matching: Gimme, Gimme, Gimme!

To specify the number of times a token should be matched by the regex engine, you can choose one of the following quantifiers:

  • ? — match the token zero times (not at all) or exactly once
  • * — match the token zero or more times
  • + — match the token one or more times
  • {m,n} — match the token between m and n (both including) times, where m and n are natural numbers and n ≥ m.

In general, the regex engine will try to match as many input characters as possible once it encounters a quantified token like \d+ or, in our case, .*. That behavior is called greedy matching because the engine will eagerly attempt to match anything it can.

The opposite of greedy matching is lazy matching, which will instruct the engine to match as few input characters as possible and then proceed to the next token in the regular expression pattern. Lazy quantifiers are denoted by appending a ? to the quantifier symbol, yielding the following lazy quantifiers:

  • ??
  • *?
  • +?
  • {m,n}?

Take the input abc123, for example. The pattern [a-z]+\d+ (using greedy quantifiers +) will match the entire string abc123, while the pattern [a-z]+?\d+? (using lazy quantifiers +?) will only match abc1. Although [a-z]+? tries to only match one letter, it'll reluctantly try to match more letters if required for the pattern to successfully match the input as a whole.

Backtracking and Input Matching

As you've seen, a greedy quantifier will try to match as much as it possibly can and only give back matched characters as needed. Every time the engine greedily consumes one more character (or repeated token in general), it has to remember that it made that choice. It will therefore persist its current state and store it so it can come back to it later in a process we call backtracking. When the regular expression engine backtracks, it performs another match attempt at a different position in the pattern.

Storing this backtracking position doesn't come for free, and neither does the actual backtracking process. Because of that it's desirable to minimize the amount of backtracking we're forcing the engine to do. While this isn't too much of a problem for successful matches in small inputs, this kind of optimization is even more relevant for large input strings.

Let's assume the singleline flag is set (so that the dot will match any character) and consider the following pattern proposed in the StackOverflow thread:

\[(.*),(.*)\]

Note that the opening and closing brackets needed to be escaped because they're special characters in a regular expression. With a preceding backslash, the regex engine treats them as literals rather than character class boundaries.

Here's how the pattern is matched against some input:

  • First, it tries to match an opening bracket: \[
  • After that, it tries to match (and save) "any amount of anything": (.*)
  • Now it tries to match the separator, a literal comma: ,
  • Again, it tries to match (and save) "any amount of anything": (.*)
  • Finally, it tries to match a closing bracket: \]

So far, so good — but where's the problem?

Bad Performance and Incorrect Matches

Once the regex engine encounters the first .*, it'll match every character until the end of the input because the star quantifier is greedy. However, the token following the "anything" is a comma, which means that the regex engine has to backtrack until its current position is in front of a comma. The same goes for the second .* and the closing bracket.

The .* pattern does one thing extremely well, and that's creating a huge amount of backtracking positions that need to be saved by the regex engine. That's why this kind of greedy matching behavior can lead to extremely poor performance when executed. Even worse, eagerly consuming that much input can result in undesired matches, as the following input shows:

Points: [x1,y1] and [x2,y2]

The values matched by the capturing groups of the above pattern are x1,y1] and [x2 and y2, which is most likely not what you wanted to match. Because there was no restriction, .* kept consuming input characters until the end and after that only gave up as many characters as needed to get a successful input match.

If you want to play around with this pattern a bit, feel free to use this regex fiddle.

Lazy Quantifiers to the Rescue

The problems caused by greedy matching can easily be solved by making all quantifiers lazy, which looks like the following:

\[(.*?),(.*?)\]

"Any amount of anything" (.*?) will then try to match as few characters as possible, attempting to match a comma (or a closing bracket, respectively) after each time.

Another solution — and the one proposed by me in the StackOverflow question — is to not use the dot at all, which minimizes the amount of required backtracking:

\[([^,\]]+),([^,\]]+)\]

After the opening bracket, this pattern tries to match as many characters that aren't , or ] as possible. It then tries to match the comma, does the same thing for the second parameter, and attempts to match a closing bracket. While this pattern is slightly harder to read, it's correct and more performant than its competitor.

If you want to increase performance even further, consider employing atomic grouping, which reduces the amount of backtracking information stored by the regex engine. Be careful, though, as atomic groups likely change the set of input strings your expression will match.

The next time you're about to use .*, please think about it carefully — chances are it won't match what you'd actually want it to.

Futher reading:

Learn ES6

24 Comments

digitalsushi

Well Marius, you taught me something about regex today. I didn't know a single thing about lazy match. You have bested me. I am humbled. Fantastic article. Thanks for taking some time to write this up, you're a good guy.

Joe

I think the regex classes should only ignore what they expect next. So the first one stops at a comma, the second one stops at a bracket, like this.

\[([^,]+),([^\]]+)\]
Joe

I'm on mobile and your comment submission has no feedback if this went through or not.

antimeme

'However, the token following the "anything" is a comma, which means that the regex engine has to backtrack until its current position is in front of a comma.'

Not quite! Read this and consider especially the Thompson NFA: Regular Expression Matching Can Be Simple And Fast

Backtracking isn't actually required to implement * features. Rather it's necessary for actually non-regular things like back references (such as "([a-z]*)\1"). Most so called "regular expression" libraries are actually more powerful than regular expressions and as a result can have terrible performance in some cases. But the Thompson NFA is predictable and delightfully simple. There's no reason a good library couldn't fall back to a strategy like this when possible.

Of course, your point about the "Points: [x1,y1] and [x2,y2]" case is a good one and I agree that .* is rarely the right tool in practice.

andrew

[(.?),(.?)]

When I saw that all I could think about are boobs.

All jokes aside, good job on this post!

Dave

I guess if you're extremely lazy and can guarantee a single occurrence and don't care about performance issues e.g. for a quick and dirty one liner, then the boob matching is suitable.

I agree that you can usually replace the dot with the character class you're actually trying to match, and use an alternative match quantity if you have prior knowledge of the match length. This should be common sense though - its regex 101.

noko

How about this?

\[([^,]+),([^\]]+)\]

Without checking for ] on the first group and without checking for , in the second one. Is it better to check for both always?

Marius Schulz

If you don't check for the closing bracket ] within the character class of the first capturing group, you'll match the first value even if it includes a closing bracket, e.g. as in [a1],b2] — that might or might not be what you want.

Chad

Death to Dot Star! was my introduction to the perils of greedy matching. (14 year old programming advice still holds up!) It's always better to try to create a negated character class, especially since it can make the intent of the regex much clearer.

Chris

Great post! Thank you for taking the time to write so clearly and succinctly.

bmm6o

One place that you run into this (and is IMHO a more natural example) is trying to parse attributes from html. If you run "[a-z]+='.*'" against the html element "<div id='theId' class='main'>" you might think the element has id "theId' class='main".

Alex Goretoy

Good read, thanks for sharing.

_kT_

Very informative. Thanks for sharing.

Tom

Thanks. I learnt something.

sshaw

I'd recommend installing the Regexp::Debugger. It comes bundled with a program called rxrx.

See it in action here: http://vimeo.com/67927155

James Rolan

I recommend using http://liveregex.com. An online regular expression tester that helps you to test and understand regex.

David

Thanks for the article. I tried James tool regex tester. It's good but sometimes lag. Maybe because of server processing. But the diagram explainer makes my life easier. Thank you,

G.Arun

This is really good article. Thanks fro writing this one :)

Prakhar Singh

Well, I learned something new. Thanks!

Derek

Regex engines can use non-deterministic finite state automata and other optimizations to improve performance. To say that greedy matching always results in backtracking is disingenuous, but there are performance costs. Specifically

However, there are more complicated patterns that break non-deterministic finite state models and give the regex engines a run for their money. The big culprits are weird stuff like optional lookaheads and lookbehinds.

I don't have more insight in this comment other than the performance of regexes is normally quite good except for some irregular patterns that can perform poorly on some inputs.

Here is one possible google query that could lead to more info about regex performance constraints: http://google.com/search?q=worst+performing+regex

Manu

This was a really helpful post
Thanks for sharing the knowledge
keep up the good work

cheers
– Manu

bbr

http://www.regexpal.com/ works like a charm in trying to figure out what works and what not. :)

Ron

[(.?),(.?)] I think this would do something really weird when the input didn't contain a closing square bracket. First it would look at the last comma and see there's no closing bracket after it. Oh.. but maybe there's a bracket after the second to last comma? So it would go backwards through the list of commas trying to find a bracket after each comma to the end of the string. Even though it would make more sense to search for the closing bracket first and then search for the comma in the substring.

[([^,]]+),([^,]]+)] Would do even funnier stuff. Let's assume the input didn't contain a comma nor a closing bracket. So the first capture group goes like "Hey! No commas at all, great! Let's just match the whole thing." And then the regexp says something like "Great job group 1! Now let's see if there's a comma after the end of the input. Oh, unfortunately not.. But hey, no problem at all. Group 1, could you please backtrack until we find a comma?"

In this simple case it doesn't even matter probably. But I had really bad problems with long input strings (also invalid input) and more complex patterns I could only solve with possessive quantifiers. There are many pitfalls when trying to find a way to find matches in long and probably invalid inputs sidestepping stack overflows and non-linear time complexity.

volosincu bogdan

Great article, your examples are very good.