The greedyness of non-greedy regular expressions

24.7.2009
I repost some of my blog posts made @ liip. Please see here for the original post and comments: blog.liip.ch/archive/2009/07/24/the-greedyness-of-non-greedy-regular-expressions.html

We all love regular expressions, don't we? Well I usually do, but recently I lost quite a lot of time to find out this bit of particular behavior, so i thought i might share this.

Greedyness is basically the question whether an expression matches as much as it can, or as little as necessary. You find an excellent explanation and examples here. The default is to match as much as possible. The syntax *? , +? and ?? makes the quantifiers non-greedy. However, even the non-greedy expressions may match more than you expect.


Consider this String:


<a href="/link">Label</a> <a href="/en">[en]</a> <a href="/fr">[fr]</a>

And the regular expression


<a href=".*?">\[[a-z]*\]</a>

The ".*?" after the href=" tells the regexp engine to match non-greedy. If we omit the non-greedy order, the expression would generate one single match of the whole string. The non-greedy version generates two matches. But the first match is not the absolute minimum match on the input string:


<a href="/link">Label</a> <a href="/en">[en]</a>

The non-greedyness seems to result in the expression generating the maximum number of matches, but not necessarily the smallest possible match strings. A fixed regular expression to only match the language links would be this expression, which explicitely excludes the closing quotation mark from the match.


<a href="[^"]*">\[[a-z]*\]</a>


I played around with this nice web tool to test regular expressions while writing this post.



For comments, please go to the original post in the liip blog: blog.liip.ch/archive/2009/07/24/the-greedyness-of-non-greedy-regular-expressions.html

regexp php