@elazar@phpc.social #tek2023

Regular Expressions Made Eas(y|ier)

Matthew Turland

Hello!

My name is Matt
It's nice to see you
Thank you for coming

Forrest Gump waving

TL;DR

Verifying that a string matches a pattern
Extracting pattern matches from a string
Replacing pattern matches in a string
Splitting strings by pattern matches

There Will Be Slides

matthewturland.com/presentations

joind.in/talk/26619

Michael Scott in The Office repeatedly saying 'PowerPoint.'

Go Ahead, Hashtag It

Feel free to live-tweet/toot!
Hashtag: #tek2023
Feel free to @ me

Deadpool instructing Negasonic Teenage Warhead to finish typing a tweet into her phone

Regular Expressions?

"A regular expression... is a sequence of characters that specifies a match pattern in text." Wikipedia

In PHP?

The Regular Expressions (Perl-Compatible) extension (AKA PCRE, preg) is generally the one you want. In particular, check out its Pattern Syntax manual section.
The POSIX regular expression extension (AKA ereg) was deprecated in PHP 5.3 and removed in PHP 7.

Match Verification

preg_match() finds the first match.
preg_match_all() finds all matches.
Both take a $pattern, $subject, and a $matches array passed by reference to store found matches.
Both return the number of matches found, i.e. 0 or 1 for preg_match(), 0+ for preg_match_all().
For arrays, check out preg_grep().

Substring Match


$present = (strpos($string, 'foo') !== false);

$present = (preg_match('/foo/', $string) === 1);

This is an intentionally contrived example.
Use string functions (e.g. strpos) for static substrings, ctype functions for common simple patterns.
In /foo/, / is the pattern delimiter. More on that later.

Starting Anchor


$present = (strpos($string, 'foo') === 0);
// or in PHP 8
$present = str_starts_with($string, 'foo');

$present = (preg_match('/^foo/', $string) === 1);

A caret (^) before a pattern denotes the start of $string.
Note: any non-literal character in a pattern is called a metacharacter.

Ending Anchor


$present = (substr($string, -1 * strlen('foo')) === 'foo');
// or in PHP 8
$present = str_ends_with($string, 'foo');

$present = (preg_match('/foo$/', $string) === 1);

A dollar sign ($) after a pattern denotes the end of $string.

Both Anchors


$present = ($string === 'foo');

$present = (preg_match('/^foo$/', $string) === 1);

Using ^ and $ together means the pattern must match the entirety of $string.

Alternation


$present = (strpos($string, 'foo') !== false
    || strpos($string, 'bar') !== false
    || strpos($string, 'baz') !== false);

$present = (preg_match('/foo|bar|baz/', $string) === 1);

A pipe (|) can be used to delimit multiple possible patterns to match.

Alternation + Anchors


$result = preg_match('/^foo|bar/', 'abar');  // 1

$result = preg_match('/^foo|^bar/', 'abar'); // 0

Anchors must be applied to each alternation.

Quantifiers: 0-1


$present = (preg_match('/a{0,1}/', $string) === 1);

$present = (preg_match('/a?/', $string) === 1);

Quantifiers: 0+


$present = (preg_match('/a{0,}/', $string) === 1);

$present = (preg_match('/a*/', $string) === 1);

Quantifiers: 1+


$present = (preg_match('/a{1,}/', $string) === 1);

$present = (preg_match('/a+/', $string) === 1);

Quantifiers: n


$present = (preg_match('/a{2}/', $string) === 1);

Subpatterns: Counterexample


// a followed by 1+ instances of b
$present = (preg_match('/ab+/', $string) === 1);

Subpatterns: Quantifiers 1+


// 1+ instances of ab (ab, abab, ababab, etc.)
$present = (preg_match('/(ab)+/', $string) === 1);

Subpatterns: Quantifiers 0-1


// foo or foobar
$present = (preg_match('/foo(bar)?/', $string) === 1);

Subpatterns: Alternation


// ab or ac
$present = (preg_match('/a(b|c)/', $string) === 1);

Subpatterns:
Alternation + Quantifiers


// ab, ac, abb, abc, acb, acc, etc.
$present = (preg_match('/a(b|c)+/', $string) === 1);

Subpatterns: Named


$present = preg_match(
    '/^(?P<area>[0-9]{3})'
        . '-(?P<prefix>[0-9]{3})'
        . '-(?P<line>[0-9]{4})$/',
    '123-456-7890',
    $match
);
print_r($match);

Subpatterns: Named


Array
(
    [0] => 123-456-7890
    [area] => 123
    [1] => 123
    [prefix] => 456
    [2] => 456
    [line] => 7890
    [3] => 7890
)

Captured Matches


if (preg_match('/foo(bar)?(baz)?/',
    'foo', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => foo
)

Captured Matches


if (preg_match('/foo(bar)?(baz)?/',
    'foobar', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => foobar
    [1] => bar
)

Captured Matches


if (preg_match('/foo(bar)?(baz)?/',
    'foobarbaz', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => foobarbaz
    [1] => bar
    [2] => baz
)

Captured Matches


if (preg_match('/foo(bar)?(baz)?/',
    'foobaz', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => foobarbaz
    [1] =>
    [2] => baz
)

Nested Subpatterns


if (preg_match('/foo(ba(r|z))?/',
    'foobar', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => foobar
    [1] => bar
    [2] => r
)

Non-Captured Subpatterns


if (preg_match('/foo(?:bar)?(baz)?/',
    'foobarbaz', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => foobarbaz
    [1] => baz
)

Non-Captured Subpatterns

Captured subpatterns are limited to 99.
Total subpatterns are limited to 200.
Use (?: to denote the start of non-captured subpatterns.

Matching Ranges

Three ways to match a single character from a range of possible characters:

Period (.) metacharacter
Escape sequences
Character ranges

Period Metacharacter


if (preg_match('/.+/', 'foobarbaz', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => foobarbaz
)

. matches any character except a line feed ("\n").

Escape Sequences

Sequence	Description	Inverse
`\d`	Digit, 0 through 9.	`\D`
`\h`	Horizontal whitespace, e.g. `" "`, `"\t"`	`\H`
`\v`	Vertical whitespace, e.g. `"\r"`, `"\n"`	`\V`
`\s`	Any whitespace, i.e. any from `\h` or `\v`	`\S`
`\w`	"Word character", i.e. any letter, digit, or underscore	`\W`

Escape Sequences


if (preg_match('/\d+/', '0123456789', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => 0123456789
)

Character Ranges

Demarcated by square brackets (i.e. [ and ])
Can include individual characters, ranges of characters, or escape sequences
Ranges (e.g. a-z) are respective to ASCII
Case-sensitive by default

Character Ranges


// [0-9] is equivalent to \d
if (preg_match('/[0-9]+/', '0123456789', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => 0123456789
)

Character Ranges


// [a-zA-Z0-9_] is equivalent to \w
if (preg_match('/[a-zA-Z0-9_]+/', 'FOObar_123', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => FOObar_123
)

Character Ranges


// Matches hex strings
if (preg_match('/[0-9a-fA-F]+/',
    '7c0319169c4aba498d441ca91c6c4f1d', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => 7c0319169c4aba498d441ca91c6c4f1d
)

Character Ranges


// Out-of-order ASCII range
if (preg_match('/[F-A]+/', 'ABCDEF', $match) === 1) {
    print_r($match);
}


Warning: preg_match(): Compilation failed: range out of order in character
class at offset 3

Negating Character Ranges


if (preg_match('/[^a]+/', 'abc', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => bc
)

^ inside a character range negates it.

Character Ranges:
Edge Cases


// Matching ]
preg_match('/[\\]]/', ']', $match);

// Matching ^
preg_match('/[\\^]/', '^', $match); // or
preg_match('/[a^]/', '^', $match);

Modifiers

Remember how we said earlier that / was the pattern delimiter?
It separates the pattern from modifiers that change how some aspects of patterns work.
The delimiter doesn't have to be / — it can be any character that isn't alphanumeric, a backslash, or whitespace.
Use a different delimiter to make escaping characters easier, e.g. in patterns containing /.

Modifiers: Case-Insensitivity


if (preg_match('/[a-z]+/i', 'ABCDEF', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => ABCDEF
)

/i makes any letters match both upper/lowercase.

Modifiers: Anchors


if (preg_match('/^bar/m', "foo\nbar", $match) === 1) {
    print_r($match);
}


Array
(
    [0] => bar
)

/m makes ^ and $ match line (versus string) starts/ends.

Modifiers: Dot-All


if (preg_match('/.+/s', "foo\nbar", $match) === 1) {
    print_r($match);
}


Array
(
    [0] => foo
bar
)

/s makes . match \n.

Modifiers: Analyze


for ($i = 0; $i < 10_000; $i++) {
    if (preg_match('/[0-9a-f]+/S', md5($i), $match) === 1) {
        print_r($match);
    }
}

/S analyzes a pattern for better performance.
No effect in PHP 7.3+ due to PCRE2 migration.

Modifiers: Ungreedy


if (preg_match('/p.*/', 'php', $match) === 1) {
    print_r($match);
}


Array
(
    [0] => php
)

Modifiers: Ungreedy


if (preg_match('/p.*/U', 'php') === 1) {
    print_r($match);
}


Array
(
    [0] => p
)

Modifiers: Ungreedy


if (preg_match('/p.*?/', 'php') === 1) {
    print_r($match);
}


Array
(
    [0] => p
)

This works in patterns that use /U.

Modifiers: Extended

"If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, and makes it possible to include commentary inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern."

More trouble than it's worth IMO. YMMV.

Backreferences


$result = preg_replace(
    '/([0-9]{3})-([0-9]{3})-([0-9]{4})/',
    '($1) $2-$3',
    '123-456-7890'
);
echo $result;


(123) 456-7890

See preg_replace() documentation.

Splitting Strings


$result = preg_split('/\s*,\s*/', '3,4   , 5 , 6');
print_r($result);


Array
(
    [0] => 3
    [1] => 4
    [2] => 5
    [3] => 6
)

Filtering Strings


$result = preg_grep('/^[0-9]+$/', ['1', 'a', '1a', '2', 'b3']);
print_r($result);


Array
(
    [0] => 1
    [3] => 2
)

Best Practices

Split large patterns into smaller ones where feasible.
Use named subpatterns to make patterns more self-documenting.
Use tools like regex101.com for editing and ad-hoc testing of patterns.
Use automated tests to verify that patterns match only what they should.

Other Resources

PHP Manual PCRE section
"Mastering Regular Expressions, 3rd ed."
a book by Jeffrey E. F. Friedl
/\bRegular\s+expressions\!+/i
a presentation by Eric Wastl
"Web Scraping with PHP, 2nd ed."
a book by Matthew Turland
Chapter 15: PCRE Extension

That's All, Folks

joind.in/talk/26619 - Please leave feedback!
matthewturland.com
me@matthewturland.com

Outro for Looney Tunes with Porky Pig saying, 'That's all, folks.'

Regular Expressions Made Eas(y|ier)

Hello!

TL;DR

There Will Be Slides

Go Ahead, Hashtag It

On a Related Note

Regular Expressions?

In PHP?

Match Verification

Substring Match

Starting Anchor

Ending Anchor

Both Anchors

Alternation

Alternation + Anchors

Quantifiers: 0-1

Quantifiers: 0+

Quantifiers: 1+

Quantifiers: n

Subpatterns: Counterexample

Subpatterns: Quantifiers 1+

Subpatterns: Quantifiers 0-1

Subpatterns: Alternation

Subpatterns:Alternation + Quantifiers

Subpatterns: Named

Subpatterns: Named

Captured Matches

Captured Matches

Captured Matches

Captured Matches

Nested Subpatterns

Non-Captured Subpatterns

Non-Captured Subpatterns

Matching Ranges

Period Metacharacter

Escape Sequences

Escape Sequences

Character Ranges

Character Ranges

Character Ranges

Character Ranges

Character Ranges

Negating Character Ranges

Character Ranges:Edge Cases

Modifiers

Modifiers: Case-Insensitivity

Modifiers: Anchors

Modifiers: Dot-All

Modifiers: Analyze

Modifiers: Ungreedy

Modifiers: Ungreedy

Modifiers: Ungreedy

Modifiers: Extended

Backreferences

Splitting Strings

Filtering Strings

Best Practices

Other Resources

Further Reading

That's All, Folks

Subpatterns:
Alternation + Quantifiers

Character Ranges:
Edge Cases