Practical Use Cases for the RegexOptions Flags

April 10, 2014

Working with regular expressions in .NET is centered around the Regex class. Its most important methods are:

These methods are defined as both instance and static methods on the Regex class, allowing you to use them in two ways:

// Instance method
new Regex(@"\d+").IsMatch("12345") // True

// Static method
Regex.IsMatch("12345", @"\d+") // True

Note the order of parameters in the static method: First comes the input, then the pattern. This has bitten me more than once.

All the methods listed above allow you to pass in a RegexOptions value which tells the regex engine how to interpret the pattern and perform the matching. On top of that, the Regex class lets you pass in some options into its Regex(String, RegexOptions) constructor.

The following options are defined in the RegexOptions enumeration:

Because the enumeration is decorated with [Flags], you can combine any of the above options using the | operator:

var options = RegexOptions.IgnoreCase
    | RegexOptions.CultureInvariant
    | RegexOptions.ExplicitCapture;

In this post, I want to highlight a use case for each of the RegexOptions values. For a concise summary of all options, please refer to the Regular Expression Options article in the Microsoft docs.

#RegexOptions.Compiled

By default, the regex engine of .NET interprets regular expressions. It can also compile a regular expression to MSIL for increased matching performance, which is what the RegexOptions.Compiled flag specifies:

private static readonly Regex _digitsOnly =
    new Regex(@"^\d+$", RegexOptions.Compiled);

While a compiled regular expression executes slightly faster, it takes significantly more time to build. We're talking about orders of magnitude here! Compiling a regular expression will therefore only be advantageous if it's used repeatedly, e.g. in a loop or over the application's lifespan.

A good example of when it makes sense to compile a regular expression is its use in a components that's called repeatedly, such as Jeff Atwood's MarkdownSharp: It makes heavy use of regular expressions which are initialized once and stored in a static field to be reused over and over again.

#RegexOptions.CultureInvariant

When you specify RegexOptions.IgnoreCase, the regex engine has to somehow compare uppercase and lowercase characters. By default, it uses the current culture (Thread.CurrentThread.CurrentCulture) when doing string comparisons. You'll see in a second why this can lead to unexpected results. Take this short code snippet, for example:

Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");

string inputFilePath = "FILE://C:/sample_file.txt";
string filePathPattern = "^file://";

We're using the Turkish culture and defining a file path and our regular expression pattern. If we now try to match the inputFilePath variable against the pattern, the result will be false:

// False
Regex.IsMatch(inputFilePath, filePathPattern, RegexOptions.IgnoreCase)

This is because in the Turkish language, 'i' is not the lowercase equivalent of 'I', which is why the comparison fails despite the case-insensitive comparison specified by RegexOptions.IgnoreCase. Using RegexOptions.CultureInvariant will yield a match:

// True
Regex.IsMatch(inputFilePath, filePathPattern,
    RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)

Conclusion: If you're matching written text against a pattern that contains written text itself and you have no control over the culture your code is run in, consider the RegexOptions.CultureInvariant option.

#RegexOptions.ECMAScript

The .NET regex engine uses its own flavor and provides additions that aren't supported in other engine, such as the ECMAScript regex engine. By using the RegexOptions.ECMAScript flag, you can configure the .NET regex engine to be ECMAScript-compliant and match accordingly. This is especially useful if you're sharing the same regular expression between JavaScript and ASP.NET, e.g. for validation purposes. It lets you make sure the pattern is interpreted the same way on the server and the client.

Some RegexOptions flags can't be combined with RegexOptions.ECMAScript because they aren't defined in ECMAScript's regex engine. Those are:

RegexOptions.ExplicitCapture
RegexOptions.IgnorePatternWhitespace
RegexOptions.RightToLeft
RegexOptions.Singleline

#RegexOptions.ExplicitCapture

Grouping parts of a regular expression using parentheses — ( and ) — tells the regex engine to store the value of the matched subexpression so that it can be accessed later. If you don't ever do anything with the matched value, though, saving it is unnecessary overhead. This is why there's the concept of non-capturing groups which group a subexpression of a regex, but don't store its value for later reference.

Non-capturing groups start with (?: and end with ):

var matches = Regex.Matches(
    "Possible colors include darkblue and lightgreen.",
    "(?:dark|light)(?:blue|red|green)"
);

When your pattern contains lots of non-capturing groups, maybe even nested ones, its readability likely gets worse: The pattern gets longer and if you're not paying attention, you might mistake the ? for the optional quantifier. RegexOptions.ExplicitCapture turns all capturing groups that aren't explicitly named (see Named Matched Subexpressions) into non-capturing groups and thus allows for a cleaner syntax with less noise:

var matches = Regex.Matches(
    "Possible colors include darkblue and lightgreen.",
    "(dark|light)(blue|red|green)",
    RegexOptions.ExplicitCapture
);

#RegexOptions.IgnoreCase

By default, regular expressions are matched against strings case-sensitively:

Regex.IsMatch("abc", "abc") // True
Regex.IsMatch("ABC", "abc") // False

If you specify RegexOptions.IgnoreCase, both input strings (abc and ABC) will be matched by the pattern abc:

Regex.IsMatch("abc", "abc", RegexOptions.IgnoreCase) // True
Regex.IsMatch("ABC", "abc", RegexOptions.IgnoreCase) // True

It's especially handy to use the RegexOptions.IgnoreCase flag when using character classes: [a-zA-Z] can then be shortened to [a-z]. If you need to do a case-insensitive match, specifying this flag helps you write clearer, shorter, and more readable patterns.

Be careful, though, with behavior of different cultures. If you don't know ahead of time which culture your code will be run under, consider using the IgnoreCase flag in combination with CultureInvariant.

#RegexOptions.IgnorePatternWhitespace

Whitespace characters in a regular expression pattern are treated as whitespace literals by default: If there's a space in the pattern, the engine will attempt to match a space in the input string. You have significant whitespace, if you will.

The RegexOptions.IgnorePatternWhitespace options allows you to structure your pattern using insignificant whitespace as you like. You can even write your pattern across separate lines, which works perfectly together with C#'s verbatim strings:

const string identifierPattern = @"
    ^                # Identifiers start ...
    [a-zA-Z_]        # ... with a letter or an underscore.
    [a-zA-Z_0-9]*    # Possibly some alphanumeric characters ...
    $                # ... and nothing after those.
";

var identifierRegex = new Regex(identifierPattern,
    RegexOptions.IgnorePatternWhitespace);

bool validIdentifier = identifierRegex.IsMatch("_emailAddress"); // True

As the above example shows, you can also include comments: Everything after the # symbol until the end of the line will be treated as a comment. When it comes to improving a pattern's readability, RegexOptions.IgnorePatternWhitespace will likely make the most notable difference. For a real-world example, take a look at a couple of regex patterns in MarkdownSharp that benefit from RegexOptions.IgnorePatternWhitespace.

#RegexOptions.Multiline

The RegexOptions.Multiline flag changes the meaning of the special characters ^ and $. Usually, they match at the beginning (^) and the end ($) of the entire string. With RegexOptions.Multiline applied, they match at the beginning or end of any line of the input string.

Here's how you could use RegexOptions.Multiline to check whether some multi-line string (e.g. from a text file) contains a line that only consists of digits:

Regex.IsMatch("abc\n123", @"^\d+$") // False
Regex.IsMatch("abc\n123", @"^\d+$", RegexOptions.Multiline) // True

#RegexOptions.None

RegexOptions.None is the simplest option: It instructs the regular expression engine to use its default behavior without any modifications applied.

#RegexOptions.RightToLeft

The regular expression engine searches the input string from left to right, or from first to last, if you will. Specifying RegexOptions.RightToLeft changes that behavior so that strings are searched from right to left, or from last to first.

Note that the RegexOptions.RightToLeft flag does not change the way the pattern is interpreted: It will still be read from left to right (first to last). The option only changes the direction of the engine walking over the input string. Therefore, all regex constructs – including lookaheads, lookbehinds, and anchors – function identically.

Using RegexOptions.RightToLeft might result in increased performance if you're looking for a single match that you expect to find at the very end of the string, in which case you'll probably find it faster this way.

#RegexOptions.Singleline

Finally, RegexOptions.Singleline changes the meaning of the dot (.), which matches every character except \n. With the RegexOptions.Singleline flag set, the dot will match every character.

Sometimes, you'll see people use a pattern like [\d\D] to mean "any character". Such a pattern is a tautology, that is, it's universally true — every character will either be or not be a digit. It has the same behavior as the dot with RegexOptions.Singleline specified.

#Conclusion

In practice, I often find myself using the combination of the following options:

var options = RegexOptions.Compiled
    | RegexOptions.CultureInvariant
    | RegexOptions.ExplicitCapture
    | RegexOptions.IgnoreCase
    | RegexOptions.IgnorePatternWhitespace;

Since most of my work is web-related, compiled regular expressions in static fields generally make sense. The last three flags help me keep my patterns simple and readable.