›INDEX
Last Updated:

Regular Expression

A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a match pattern in text (wikipedia).

I've sourced most of this information from the following websites:

Terminology

  • pattern: regular expression pattern.
  • string: test string used to match the pattern.
  • digit: numbers from 0 to 9.
  • letter: alphabets a-z and A-Z.
  • symbol: !$%^&*()_+|~-=`{}[]:”;'<>?,./.
  • space: single white space, tab.
  • character: refers to a letter, digit or symbol.

Introduction

We've all used a find function in either a text editor or a webpage. Regular Expressions just enhance this ability by now allowing you to match patterns rather than just literal strings.

Example:

Task: Find all students whose student ID starts with 121.
Example: 12143 12184 121444

Regular Expression: /\b121\d+/

The above expression will match all the student ID's that start with 121. \b specifies a "word boundary", 121 matches the literal string 121, \d+ matches one or more digits.

This may seem complex at first, but it's going to be a sequential set of rules, that once you understand, is going to makes things very straight forward.

Character Sets

Consider the example: you want to match all the following words with your match; cat, bat, fat. Here the first letter must be one of a given SET of characters. For such cases we can use "character sets".

Regex: /[bcf]at/

Matches: bat, cat, fat

Anything between [ ] are kind of like "options" can any of the options may be used.

Ranges

Let assume we want to match all the words that end with at. We could supply the full alphabet inside the character set, but that would be tedious. We can here use ranges to simply the pattern.

Regex: /[a-z]at/

Matches: aat bat, rat, ...

You don't have to have the entire alphabet range, [g-p] is also a valid range for that "group".

We can even combine ranges such as [a-fS-Z] such a range would be a character set of a to f and S to Z. Note that capital letter are their own range.

Ranges are based on the ASCII codes, it will work for any ASCII range.

Repetition

If you'd like to repeat a particular "group" rather that doing something such as:

\[a-z][a-z][a-z]

For all 3-letter words, we can use the {} curly braces to specify the number of times that a group should repeat.

Here are some examples:

  • a{5} will match "aaaaa".
  • [a-z]{4} will match any four-letter word such as "door", "room", or "book".
  • [a-z]{6,} will match any word with size six or more letters.
  • [a-z]{8,11} will match any word with eight and eleven letters.

Note the syntax here: {n} will match exactly n times. {a,b} will match anywhere between a times to b times (inclusive). {n,} will match n or more times.

Meta characters

Meta characters allow you to write regular expressions that are more compact.

  • \d matches any digit, same as [0-9].
  • \w matches any letter, digit, and underscore.
  • \s matches any whitespace character - that is, a space or tab.
  • \t matches a tab character.

Example:

  • \w{5} would match any five-letter word or five-digit number (or anything in-between).
  • \d{11} would match a 11-digit number.

Special characters

Special characters allow you to talk about the count of particular groups or characters.

  • + One or more quantifies. For example c+at would match "cat", "ccat", "ccccat".
  • ? Zero or one quantifier. For example c?at would match "cat" or "at".
  • * Zero or more quantifier. For example c*at will match "at", "cat", "ccccat".
  • \ this "escape character" is used when we want to use a special character literally. For example c\*at will match "cat", the "" is taken to be a literal.
  • [^] This "negate" notation is used to indicate a character that should not be matched within a range. For example b[^a-c]t will NOT match "bat" or "bct" but will match "bet".
  • . This "dot" will match any digit, letter, or symbol except newline. For example .{8} will match an eight-character password consisting of letters, numbers and symbols.

Groups

Groups allow us to talk about properties of more than just single characters. Going back to our previous example, if we wanted to match the words "cat", "bat", "fat" AND "flat", here's what can do:

Regex: \([bcf]|fl)at

Match: cat, bat, fat, flat

Groups are indicated using parenthesis (), and we can use the pipe symbol | to specify "or" conditions in the group.

Now, the special-characters mentioned above will apply to the whole group rather than just the characters. Example: book(.com)? will match the strings "book" and "book.com" but nothing in-between.

Back Referencing: You can reference groups that were matched by the regex by using one of the following ways: \1, $1, {1}. The exact version used will be dependent on the particular implementation. The easiest would be to just try all the options.

  • Nvim: Use \(\) for groups and \1 for back references.
  • VSCode: Use () for groups and $1 for back references.
  • Python: Use () for groups and \1 for back references.

Back references can be used in cases such as search and replace or if you're looking for a repeated variable word. For example, to replace the names of files that end with ".html" to ".php", we can employ something like this:

# \(.*\)\.html[^\/] -> matches all .html files
# \1 references the group marked by \(\) which
# is the filename without the extension.
sed "s/\(.*\)\.html[^\/]/\1.php/g"

Position Specifiers

Regex expressions will match within strings as well, if not specified. For example: book(.com)? would also match within the string "notebook.com". Note that it matches only the "book.com" part of "notebook.com". However, sometimes we want it to only match exactly "book.com" or "book" any not within other words. We can use more special characters and specifiers to specify the position of a pattern.

  • ^ placed at the start, this character matches a pattern at the start of a line.
  • $ placed at the end, this character matches a pattern at the end of a line.
  • \b a word boundary, forces word to start or end.
  • \B a non-word boundary, forces word to continue here.

Examples:

  • /^book$/ matches all lines with JUST the word "book".
  • /^book/ matches all "book"s that are at the start of the sentence.
  • /\bbook\b/ matches all "book"s but does not match if it's not just book such as "notebook" or "bookworm".

Resources

  • Mozilla Developer Reference Great guide if you're looking for something complete.
  • Regex 101 Great place to test regular expressions, you can even change the particular "flavor".
  • Regexr Great to see what each part of your expression means.

Enjoy the notes on this website? Consider supporting me in this adventure in you preferred way: Support me.