A Very Long Regular Expression Tutorial

Photo by Nicole Wolf on Unsplash

A Very Long Regular Expression Tutorial

As written by someone who had never heard of a regular expression until three days ago.

In this week's bootcamp assignment, I was asked to create a tutorial that explains how a specific regular expression, or regex, functions by breaking down each part of the expression and describing what it does.

However, I had never heard or seen a regex...ever. I was being given the task of writing a tutorial on a technical topic I knew nothing about. After doing some research, I found myself confused and horrified. I mean, look at this example of a supah long regex:

var pattern = /^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/

regex_meme.jpeg

Scary, right? Well, after some deep breathing and enough YouTube tutorials, I became confident enough to break things down step-by-step. What I've written below is my attempt at sharing what I've learned!

Pretty thorough, if I do say so myself.

Summary

In today's tutorial, we'll be walking step-by-step through the creation of a regular expression, or regex. Regex are a series of special characters that define a search pattern.

In this lesson, we will be defining a search pattern for an email:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

This may just look like a jumble of characters right now, but regex are poweful tools that can help validate user input across multiple functions. That is, it checks to see if a user-generated string fulfills given requirements.

Without further ado, let's jump right in and begin breaking down the components of a regex!

Regex Components

Anchors

Before we dive into anchors, it is important to note that a regex is considered a literal, so the regex pattern must be wrapped in slash characters (/) as seen in our regex example.

Anchors do not match any characters within the regex. Instead, they are used to match a position before, after, or between characters. They can be used to "anchor" the regex at certain positions.

There are two main anchors we need to be concerned about in our regex. The caret (^) and the dollar sign ($).

The caret ^ signifies a string that begins with the characters that follow it. There are two formats this string can come in:

  • An exact string match, such as ^Orange, where the strings "Orange" or "Orange cat" match, but "orange" and "orange cat" do not. This is because a regex is case-sensitive.
  • The second format is a range of possible matches, displayed using bracket expressions, as seen in our regex below. Make sure to read the "Bracket Expressions" and "Quantifiers" sections for more on this!

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Similarly, the dollar sign $ matches right after the last character in the string, as seen above.

Bracket Expressions

Anything inside a set of square brackets ([]) represents a range of characters that we want to match. Let's take a look at our regex expression to break this down further.

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

We have three bracket expressions, and any individual character between the brackets will match.

Given this information, let's see if we can break down the bracket expressions in our regex.

First bracket expression: [a-z0-9_\.-]

  • a-z: The string can contain any lowercase letter between a–z.
  • 0-9: any digit between 0-9
  • -: the string can contain a hyphen
  • _: the string can contain an underscore
  • additionally, the string can contain a period

An example of a matching string would look like:

_koh_8a8phcpo96hmcu.epem8uhygmrhosdf2yy_48bxp39cmp.0ddea3doisbxmr4dmk7b2zqx8ssx

Second bracket expression:[\da-z\.-]

  • a-z: The string can contain any lowercase letter between a–z.
  • -: the string can contain a hyphen
  • additionally, the string can contain a period

An example of a matching string would look like:

3p.28r-lj5f9i.7xnz7gfh7xcmz7jxgcmexntbe7zrmorq8x3wuhrsb2zqgum2yrj2llo0khe0ex2si4b9bonb7cx7.q.zhthc-di

Third bracket expression: [a-z\.]

  • a-z: The string can contain any lowercase letter between a–z.
  • additionally, the string can contain a period

An example of a matching string would look like:

hbe.ct

We'll get into the additional characters that are seen within the brackets soon, but first let's look at the use of quantifiers!

Quantifiers

Now that we know what we're looking for inside of our square brackets, you may be wondering what the curly brackets {2,6} and addition symbols mean for our regex. This a quantifier, which sets the limits of the string that your regex matches and often includes the minimum and maximum number of characters that your regex is looking for.

Quantifiers are also defined as greedy, which we'll dive into a little later in this tutorial. For now, all you need to know is that being greedy means the regex will match as many occurrences of particular patterns as possible. This includes:

  • *: Matches the pattern zero or more times
  • +: Matches the pattern one or more times
  • ?: Matches the pattern zero or one time
  • {}: Curly brackets provides a way to set limits to a match, where {x,y} defines the minimum and maximum amount of characters allowed in the preceding string.

If we look back at our regex, this tells us that the first and second bracket expressions matches a pattern one or more times, and the final bracket expression of our regex is looking for any string between 2 and 6 characters that contains a combination of lowercase letters.

An example of a matching string for the third bracket expression would look like:

lakft.

Grouping and Capturing

While many regular expressions are straightforward, others have multiple bracket expressions that need to be checked to determine different requirements. Looking at our regex, you can see that our bracket expressions have been separated into three groups, and divided by the "@", "\", and "." symbols.

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

However, the backslash () is actually a type of character escape - in other words, it escapes a character that otherwise would be interpreted literally. For example, the open curly brace ({) is used to begin a quantifier, but adding a backslash before the open curly brace ({) means that the regex should look for the open curly brace character instead of beginning to define a quantifier - in other words, it must have an open curly brace, and any strings without the open curly brace will not be accepted as a match.

Using this logic, we now know one more key piece of information about our regex:

  • there must be a period (.) before the beginning of the third group to define the third group

But what about the at (@) symbol? Knowing that we are using a regex to match an email value, we can infer that the @ symbol and the period are being used to break up an email search into three groups, like so:

/^([user email]+)@([email service or site]+)\.([domain name(.ca, .com, .org, .gov, etc)]{2,6})$/ Ex: claudiacdavis + @gmail + .com

Note: the backslash loses its special significance inside bracket expressions.

Character Classes

Character classes allow us to write more compact regular expressions. Two common character classes are:

  • \d: matches any digit between 0-9
  • \w: matches any letter, digit and underscore character

Hey! It looks like we have a "\d" in our regex! In our second bracket expression, we can see that in addition to any lowercase letter between a-z, a period, and a hyphen we also have any digit between 0-9.

An example of a matching string would look like:

5qkil-1d7tndsc2.

In summary, \d is a more concise way to write 0-9, saving space in our bracket expression.

Greedy and Lazy Match

As mentioned in our quantifiers section, in the greedy mode, a quantified character is repeated as many times as possible. This is symbolized by the plus sign seen in the first and second groups.

([a-z0-9_\.-]+)@([\da-z\.-]+)

The lazy mode of quantifiers is an opposite to the greedy mode. It means: “repeat minimal number of times”. We can enable it with a question mark.

/".+?"/

Final Explanation

Given our regex /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ , let's break things down character by character for good measure.

^ asserts position at start of the string

1st Capturing Group ([a-z0-9_.-]+)

Match a single character present in the list below [a-z0-9_.-]

  • a-z matches a single character in the range between a and z (case sensitive)
  • 0-9 matches a single character in the range between 0 and 9 (case sensitive)
  • matches the character _ literally (case sensitive)
  • matches the character . literally (case sensitive)
  • matches the character - literally (case sensitive)
  • plus sign (+) matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
  • matches the character @ literally (case sensitive)

2nd Capturing Group ([\da-z.-]+)

Match a single character present in the list below [\da-z.-]

  • plus sign (+) matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
  • \d matches a digit (equivalent to [0-9])
  • a-z matches a single character in the range between a and z (case sensitive)
  • matches the character . literally (case sensitive)
  • matches the character - literally (case sensitive)
  • matches the character . literally (case sensitive)

3rd Capturing Group ([a-z.]{2,6})

Match a single character present in the list below [a-z.]

  • a-z matches a single character in the range between a and z (case sensitive)
  • matches the character . literally (case sensitive)
  • {2,6} matches the previous token between 2 and 6 times, as many times as possible, giving back as needed (greedy)
  • $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

Final Thoughts

Regex is complicated, but by breaking them down into smaller bits and pieces, can be manageable!

I hope this tutorial was helpful. If you're reading this and have any tips or thoughts, please drop me a line. In the meantime, happy coding!