Regular Expressions

What is Regular Expression?

A Regular Expression is a sequence of characters that define a search pattern, which is usually used by string searching algorithms for “find” or “find and replace” operations on strings, as well as input validation. - Wikipedia


Literal Matches

A literal match is when a given character or a string exactly matched. Surrounded by slashes to emphasize it as a Regular Expression : /limecake/

RegEx: /limecake/
String: "This is limecake's blog, LimeCake BRAIN"

Multiple matches

By default, most regex engines will return only the first occurrence of a match. But there are different ways which enable to obtain a list of all matches.

For example, in JavaScript, the optional g(global) flag returns an array of all the matches.

Case sensitivity

By default, regular expressions are case sensitive. So, /limecake/ will not match “LimeCake”.

However, can force it to be case sensitive. For example, in JavaScript, this can be done using the i flag.


The Dot

The dot(.) is one of the most common metacharacters. The dot matches any single character (except line breaks). Let’s sat we want to find all the words that start with the word “lime” but may end differently.

The regex /lime./ will match all the words that contain code :

RegEx: /limecake./
String: "this is limecake's presonal blog. not limecheese, or limebread. All lime are not same"

Matching Special Characters

There may be cases when you need to match special characters, like metacharacters. For example, search for a dollar sign($) or a bracket. The solution is to escape the metacharacters by preceding them with a backslash.

RegEx: /1 \+ 3 \* \(2 \+ 2\)/
String: "1 + 3 * (2 + 2)"

Character Classes

Character Classes (also called Character Sets) allow matching a range of characters. Character classes are defined using the metacharacters [ and ] (square brackets). Everything between them is part of the set, which means any one of the set members must match (but not all).

For example, [XYZ] will match any one of the three characters.

RegEx: /[XYZ]code\.js/
String: "I have few JS files named Xcode.js, Ycode.js, and Zcode.js. but I don't have the file named Wcode.js. And a file mXcode.js."

One of the most popular use cases of sets is matching both, uppercase and lowercase letters.

RegEx: /[Cc]ode\.js/
String: "There are two files, Code.js and code.js."

Another example use case is matching words in American or British English. like the regex /gr\[ae\]y/.


Range of Characters

RegEx supports character ranges, which can be defined using a dash (-) between two values. For example, [a-z] is for all Latin lower case letters. [A-Z] is for all capital letters. Same applies to numerical values: [0-9].

RegEx: /[A-Za-z0-9]\.js/
String: There are few files, named a.js, B.js, and 5.js.

[A-Za-z0-9] is the same as, [ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuywxyz0123456789].

Reverse ranges, such as [Z-A] or [5-1] do not work, and it will often prevent the entire regex pattern from working.

Negated Character Classes

The caret (^) metacharacter can be used to deny a character set. This is achieved by placing it after the opening square bracket of a character set. The following regex matches characters that are not numbers:

RegEx: /[^0-9]/
String: "Today is 2019, and Sep 29th."

Multiple character ranges ca be negated as well. For example, the following regex matches that are not numbers and capital letters:

RegEx: /[^0-9^A-Z]/
String: "Today is 2019, and Sep 29th."

Digits (and non-Digits)

We can match numbers, as well as negate them using character classes: [0-9] and [^0-9]. Regular Expressions provide a series of shorthand character classes. For matching digits:

\d - shorthand for [0-9]
\D - shorthand for [^0-9], the same as [^\d].

For example, /201[0-9]/ can be shorter, /201\d/


Alphanumeric Characters

Alphanumeric (a combination of alphabetic and numeric characters) is used to describe the collection of Latin letters and Arabic digits. There are 63 (A-Z + a-z + 0-9 + _) alphanumeric characters.

\w matches any alphanumeric character in upper- or lowercase and the underscore (same as /[a-zA-Z0-9_]/).
\W matches any non-alphanumeric or non-underscore character (same as /[^a-zA-Z0-9_]/).

Let’s see an example to understand the difference of \w and \d. The following regex matches a six character long word, where the first and last characters are digits, the rest are any alphanumeric characters.

/\d\w\w\w\w\d/

It matches 1abcd2, 0____0, 123456, 8_Xx_8.

but it does not match abcdef, A_B_C.


Matching Whitespace

Whitespace characters such as spaces, tabs, newslines, and others can be tricky to find when used a regex. Different systems use different characters to represent some of these characters. To represent whitespace characters, there are special metacharacters.

[\b] - backspace
\f - form feed
\n - line feed
\r - carriage return
\t - tab
\v - vertical tab

For example, a blank line can be matched using the following regex: /\n\n/.

To match a new line in windows, your regex should look like \r\n while in Linux you only need \n.

The most frequently used whitespace metacharacters are \n, \t, and \r. The characters n, v, t etc. are literal characters and only become metacharacters when used with a preceding backslash.

Shorthand for Whitespace

There is a simple shorthand notation for whitespace.

\s - any whitespace character (same as [\f\n\r\t\v]).
\S - any non-whitespace character (same as [^\f\n\r\t\v]).

[\b] is the backspace metacharacter, but it is not included in \s nor excluded by \S.


Anchors : Start of String

In regular expressions, anchors specify an exact position in the string or text where an occurrence of a match is necessary. It looks for a match in that specified position only.

For example, the caret ^ is a start of string anchor, which specifies that a match must occur at the beginning of the line (string).

The following regex matches www only when it occurs at the beginning of the line : /^www/.


Anchors : End of String

The $ (dollar) is the end anchor, the metacharacter that indicates the end of a line.

The following regex matches a line ending with “world”: /world$/

Using both start of line and end of line anchors, you have strict control of the line contents. The following regex matches a line containing exactly one letter: /^[A-Za-z]$/.

And if you need to match an empty line, use the following regex: /^$/.


Word Boundaries

Imagine searching for the word number using a literal match /number/ in the following text “I declared a number variable named my_number_var”. If we need to match only the word “number” we should use word boundaries.

Word boundaries allow matching whole words.

\b - word boundary
\B - non-word boundary

To solve the problem above with this new metacharacter, the regex to match the word will be: \bnumber\b.

The \B will do the opposite. It will check if the start and end point of the word is not surrounded by a word character. So, \Bnumber\B will match only the my_number_var.


Optional Character

The quantifier ? commonly referred to as the question mark represents optionality in regular expressions. It matches 0 or 1 instance of the preceding character, thus making the preceding character in the regex optional.

For example, the following regex matches both break and beak: /br?eak/.

To make multiple tokens optional, group the tokens together using parentheses, placing the question mark right after the closing bracket.

For example, the following regex matches both Jan and January: /Jan(uary)?/.

You may write a regex that matches many alternatives by including more than one question mark. /Jan(uary)?5(th)?/ matches January 5th, January 5, Jan 5th, and Jan 5.


Repetition with Plus and Star

The + quantifier matches one or more instances of the preceding character (or set).

All we need is to simply append a + character at the end of our character or the set.

For example, the following regex matches one or more digits in the text: /[0-9]+/.

123abc456
aa1234bb
0
+1001234

[0-9+]+ will match “+2291+3ab-cde291+”.

The asterisk or star (*) is a quantifier that matches zero of more instances of the preceding character (or set).

For example, the following regex matches any text containing “abc” or “abc” ending with a number: /abc[0-9]*.

abc
abc0
abc123

As the example above demonstrates, the * makes the sets of digits optional.


Limiting Repetitions

So far we either matched the exact amount of characters or we didn’t care about the number of characters at all. What if we want our regex to match from 5 to 8 characters?

There is a quantifier for interval matching. Intervals are specified between { and } metacharacters. They can take either one argument for exact interval matching {X}, two arguments for range interval matching {min, max}. If the comma is present but max is omitted, the maximum number of matches is infinite and minimum number of matches is at least min.

The metacharacter ? is equivalent to {0,1}.

For example, to match a HEX RGB color name we need the same set six times in order to match the HEX RGB correctly. The {6} will repeat the set six times : /#[0-9A-Fa-f]{6}/.

The metacharacter + is equivalent to {1,}. The metacharacter * is equivalent to {0,}.

For another example, let’s match all the prices that have a minimum of 3 digits and more : \\$\d{3,}\.?\d{0,2}.

$100
$450.0
$78.50 (not match)
$760.45


Overmatching

There are cases where the regexes will overmatch.

For example, we want our regex to match all the <\b> tags of HTML.

RegEx: <[Bb]>.*<\/[Bb]>
String: "<b>First</b> and <b>Second</b> words bold."

Matched: “First and Second words bold.”

Instead of selecting two tags, it selected only one where two instances are matched.

This is because metacharacters such as + and * are greedy. They try to match the maximum they can. The opposite version of greedy matching is lazy matching. They match the minimum they can.

Lazy quantifiers are defined by appending ? to the quantifier being used. The lazy equivalent of greedy ones will be,

* (greedy) : *? (lazy)
+ (greedy) : +? (lazy)
? (greedy) : ?? (lazy)
{x,} (greedy) : {x,}? (lazy)

The regex of lazy version of the example above will become,

RegEx: <[Bb]>.*?<\/[Bb]>
String: "<b>First</b> and <b>Second</b> words bold."

Matched: “First and Second words bold.”


Grouping

Grouping is an important part of regexes. Subexpressions are parts of a bigger expression, and they are grouped together to be treated as a single entity. For that, we need new metacharacters, ( and ).

Let’s match an IP address.

RegEx: /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/
String: "This is valid IP address: 127.0.0.1"

It is common practice to use subexpressions for grouping all the regex. The example above may be represented in a simpler form.

RegEx: /(\d{1,3}\.){3}\d{1,3}/
String: "This is valid IP address: 127.0.0.1"

Practice

Special Symbols in Username

Let’s create a regex to validate a proper username, which should be a string that does not contain the following special characters: !@#$%\^&*

We will search the given username for any occurrence these symbols, and if a match is found, deny the username. The regex will look like, /[!@#$%^&*]/.

Proper Filenames

Similarly, we can build a regex to validate the filename for a text file, having a .txt extension. Let’s deny filenames containing any of the following symbols as the end of the filename: !?@$%#

Our regex will match only names that don’t end with these symbols: /[^!?@$%#]\.txt/