Regular expressions

Regular Expressions

What’s common to all the following lines? Or what’s different?

metrics,1632976282813843891,,,23.3,03f5ca58,emse/fayol/e4/S431H,sensors
metrics,1632976283071666118,61.3,,,03f5ca58,emse/fayol/e4/S431H,sensors
metrics,1632976283256456669,61.3,,,03f5ca58,emse/fayol/e4/S431H,sensors
metrics,1632976283338625565,,11,,f9538ac8,emse/fayol/e4/S431F,sensors
metrics,1632976283363173281,,11,,f9538ac8,emse/fayol/e4/S431F,sensors
metrics,1632976283390128670,,,24.7,f9538ac8,emse/fayol/e4/S431F,sensors
metrics,1632976283438988637,53.8,,,f9538ac8,emse/fayol/e4/S431F,sensors
metrics,1632976283463404640,,,24.7,f9538ac8,emse/fayol/e4/S431F,sensors
metrics,1632976283489232778,,3,,6bd134b6,emse/fayol/e4/S405,sensors

How to extract only the fifth field?

How to automatically remove the first and last fields?

Regular Expressions

Hello

[Hh]ello

^[Hh]el+o [A-Z][a-z]*$

[-a-zA-Z0-9]+\([-.[:alnum:]]*\)*@...
...[[:lower:][:digit:]]...
...[-[:lower:][:digit:]]+\.[a-z][a-z]+

<([a-zA-Z][a-zA-Z9-0]*)\b[^>]*>(.*?)</\1>

What do those lines mean?

Introduction

A regular expression is a pattern (a set of characters) used to describe a set of character strings.

Regular expressions are built the same way arithmetic expressions are; they make use of different operators to make up large expressions from smaller ones.

In these slides we address extended regular expressions, those understood for example by the egrep utility.

First example

Hello

The text above is a regular expression matching the Hello word.

However, one may want to write Hello without a capital letter. It would be nice to match it too.

First example

[Hh]ello

We achieve that by creating a character class [Hh] (meaning H or h). This regex reads as " H or h followed by e, then l, then l and then o ".

Often tho, Hello is followed by an exclamation mark we don’t want to miss!

First example

[Hh]ello!?

The question mark tells the regular expression engine the character (or character class) right before it may appear at most once.

Well, how to match a question mark then?

First example

[Hh]ello\?

In order to match a special character, it has to be escaped with a backslash \.

This regular expression up there matches both Hello? and hello?.

Right, but what if one wants to match Hello or hello followed by a question mark or an interrogation mark or even nothing, or by the word world, itself followed or not by an exclamation mark?…

First example

[Hh]ello(⎵[Ww]orld!?|[!?])?

⎵ denotes a space character

The character | introduces an alternative between character strings.

inside a character class, special characters often lose their special meaning, reverting back to being regular characters.

Parentheses form a group. One may then create a choice with a postfixed ?.

Therefore, this expression reads as …

Well, but what if I want this expression to be alone on its line? I don’t want to match all the sentences that contain this pattern.

First example

^[Hh]ello(⎵[Ww]orld!?|[!?])?$

To achieve that, one may use the ^ and $ characters. They match the empty string respectively at the beginning and at the end of a line.

A logical line may print on several physical lines in a text editor or in the terminal emulator.

A logical line ends with \n
$ matches the empty string before the \n character

Summing things up

What we’ve learned up to now

a character matches with itself
some characters get a special meaning depending on the context
[…] creates a character class from which any of the characters may match with what is at this spot in the string
…|… creates an alternative in which each string between | may match with what is at this spot in the string
(…) creates a group
^ and $ define anchors that match the beginning and the end of a line, respectively (or of a character string)

Second example

<[[:alpha:]]+>

[:alpha:] is a predefined character class, matching all alphabetical characters, uppercase or lowercase.

This is a regular expression that matches a word containing only letters and which is written between a lower-than sign and a greater-than sign. This is how HTML elements are written.

The + sign means that the class or character right before it may appear more than once, but at least once (once or more).

However, some HTML elements are written using digits as, for example, titles (h1, h2, …). This regular expression does not match them.

Second example

<[[:alnum:]]+>

There are several pre-defined character classes, [:alnum:] being one of those. It is equivalent to [:alpha:][0-9].

This regular expression will match everything that looks like an HTML element, and more!

HTML elements may also contain attributes, as in
<meta charset="utf-8" /> or
<link rel="stylesheet" href="custom.css" />.

Our regex becomes useless…

Second example

The following regex may solve the issue.

<[^>]+>

If, in a character class, the first character is ^, the set of characters that will match is the complement of the set listed in the class. Therefore, [^>] will match everything but >.

It emphasizes one of the most confusing aspect of regexes, the greediness. A regex engine will always try to match the largest string possible.

<p class="wide">Regular expressions are <i>greedy</i>.</p>

In the example above, the p element will match, including the i element, the closing tags and everything in between.

Second example

<[^>]*?>

Some regex engines may become non-greedy. A behaviour triggered by the *? formula.

This way, the match will end at the first occurrence (> in our case).

Two issues still exist: closing tags (they begin with /) and empty elements will match too.

Second example

<[[:alpha:]]+[[:digit:]]*\b[^>]*?>

Here we provide a solution to all the issues mentioned above.

We want to match words between < and > that begin with a set of lowercase and/or uppercase characters, possibly followed by a digit. This word may appear before anything that is not the > character.

The \b sequence matches the beginning or the end of a word. Therefore, everything after the first < must be a word, possibly followed by other words.

Second example

The last touch to our regular expression will be to make it match everything inside an element: the opening tag with attributes, the content of the element and the closing tag.

<([[:alpha:]]+[[:digit:]]*)\b[^>]*?>.*?</\1>

For that purpose, we use back-references. They help us identify a pattern matched earlier in the same regex.

The first set of parentheses defines a group that can then be called back using its number (\1 in our case).

Several groups of parentheses may exist at the same time. The rank of its first parenthesis defines the number of a group.

Summing up the second example

From this second example we learned that:

Regular expressions are greedy
Several predefined character classes exist
It is possible to match characters not in a list
Several anchors exist to match the beginning and the end of words, lines, …
Some regex engines can use back references
Regular expressions are never perfect!

References

"Unix Text Processing", Dale Dougherty and Tim O’Reilly, Hayden Books, 1987
https://www.oreilly.com/openbook/utp/
Christophe Blaess cheat sheets^[FR]
https://www.blaess.fr/christophe/developpements/aides-memoires/
- Unix commands^[FR]
  https://www.blaess.fr/christophe/memo_commandes_unix.html
- Shell programming^[FR]
  https://www.blaess.fr/christophe/memo_programmation_shell.html
Rich’s sh (POSIX shell) tricks
https://www.etalabs.net/sh_tricks.html
Bash Reference Manual
https://www.gnu.org/software/bash/manual/bashref.html

Advanced Bash-Scripting Guide
http://tldp.org/LDP/abs/html/
"Mastering Regular Expressions, 3rd Edition — Understand Your Data and Be More Productive", Jeffrey Friedl
https://www.oreilly.com/library/view/mastering-regular-expressions/0596528124/
"GAWK: Effective AWK Programming", Edition 4.1
http://www.gnu.org/software/gawk/manual
Manual pages^‡ : bash(1), grep(1), regex(7), gawk(1)
https://www.regular-expressions.info/
https://mywiki.wooledge.org/BashFAQ/031

‡ : read thoses pages on your own operating system, not on the Internet!