RegEx for Google Analytics

What are Regular Expressions?

Regular Expressions, or RegEx for short, are the swiss army knife of the text world.

They provide a rich syntax for validating, extracting and generally manipulating text.

At first glance they can appear obscure and complex, and indeed they can get quite complex as demonstrated by the example below which is used to validate IP addresses like ‘192.168.1.100’

\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}\b

However the good news is that in most cases you can achieve everything you need using quite simple expressions. 

In this article we will focus on the most useful regex features available to us in Google Analytics.

Why do we need RegEx?

Regular Expressions enable you to create flexible and efficient filters in several places in Google Analytics including Custom Reports, Tables, Goals, Custom segments and Channel grouping. They can be used in place of plain text to greatly simplify complex logic. 

The fact is that Regular Expressions were created to solve these kinds of problems and constructing filters without them would be extremely difficult.

Nuts and Bolts

In this section we will cover the core regex features along with examples of how they can be used effectively in GA.

Important: Several characters take on special meaning when used in a Regular Expression and you need to add a leading \ (backslash) to cancel, or escape their special meaning if you want to match them.

Anchors – ^ and $

Placing a ^ (caret) at the beginning of a regex anchors the search to the beginning of the data being searched. Similarly, placing a $ (dollar) at the end of the regex anchors the search to the end of the searched data. Without any anchors the regex would match anywhere in the data. 

Regex

Description

middle

matches middle anywhere in the data  

^Start

matches Start at the beginning of the data

end$

matches end at the end of the data

^Top Dog$

matches data containing exactly Top Dog

\$ rate

matches $ rate

(adding a leading \, means the $ loses its special meaning as an anchor)

For example, if I wanted a filter to include only traffic to polkadotdata.com and any of its subdomains I could use a custom include filter with a regex Filter pattern of

polkadotdata.com$

The OR operator – |

Often you will want to search for combinations of words within data. The | (vertical bar) character is used to separate a list of alternatives to match. Use round brackets to constrain the list within its surrounding text as illustrated in the last two examples. 

Regex

Description

cat|dog

matches either cat or dog anywhere in the data

abc|de|fg

matches either abc, de or fg

a(bc|de|fg)

Adding round brackets constrains the match to be a followed by either bc, de or fg 

For example, in GA, we could filter product detail views for just dresses and jackets using the following filter:

/products/(dresses|jackets)/detail/$


The Wildcard – .

A common source of confusion when first encountering Regular Expressions is the wildcard character.

We are all familiar with the use of the * character as a wildcard to represent any number of characters, but in a regex the * has a very different meaning as we will see next.

Instead we use a dot as a wildcard to represent any single character, including white space. If we want to match more than one character we have to add a Quantifier which we cover next.

Quantifiers – * + ? and {}

Quantifiers are used to specify if, and how many times, we want something to match.

The characters * + ? and {} are used as qualifiers (which means you must remember to escape them with a \ if they appear in your data).

Qualifiers only apply to the character, or group that immediately precedes them. Their usage is easiest explained by example. 

abc*

matches ab followed by zero or more c’s

abc+

matches ab followed by one or more c’s

abc?

matches ab followed by zero or one c

abc{2}

matches ab followed by exactly 2 c’s

abc{2,5}

matches ab followed by 2 to 5 c’s

a(bc)*

matches a followed by zero or more occurences of the sequence bc

a(bc){2,5}

matches a followed by 2 to 5 occurences of the sequence bc

.*

matches zero or more occurences of any character (including white space).
This effectively matches anything

For example, to write a filter to exclude traffic from the IP address range 72.129.10.0 through 72.129.20.99 we could use the regex below.

We have escaped the dots with a backslash to disable their special meaning as a wildcard.

^72\.129\.10\.20\.[0–9]{1,2}$

Bracket Expressions – [ ]

Bracket expressions are used to match on one of a set of alternative characters. Inserting a ^ inside the opening square bracket inverts the logic from match to does not match.

As a shortcut when entering a range of characters you can specify the first and last characters separated by a dash. Note that special characters (except ^ and ) lose their special meaning inside bracket expressions and so don’t need escaping.

[aef]

matches a single character, either a, e or f

[^aef]

matches a single character EXCEPT a, e or f
The leading ^ is used to negate the whole expression

[a-f]

matches any single character in the range a through f

[0-9]%

matches any single digit between 0 and 9 followed by %

[a-fA-F0-9]

matches any single upper or lower case hexadecimal digit

[A-Z\-]

matches A through Z or a dash.
The \ escapes the special meaning of the dash

Character classes

Regular Expressions include a set of shorthand character classes to allow more compact expressions. The most common ones are illustrated below.

\d

matches a digit (0-9)

\D

matches a non-digit (inverse of \d)

\w

matches a word character (a-z, A-Z, 0-9, and underscore)

\W

matches a nonword character (inverse of \w)

\s

matches a whitespace character (includes tabs and line breaks)

\S

matches a non-whitespace character (inverse of \s)

.

matches any character

Remember that to match any of these characters ^.[] $()|*+?{} \ – , you must escape them with a backslash \ to disable their special meaning.

Grouping and capturing – ()

As well their use for grouping alternate patterns, round brackets capture their matched content as separate elements.

The match returns a Match object from which you can extract the group contents by index. Group 0 contains the whole matched string and groups 1 onward contain the matched content of each group.

If you nest brackets, the outer groups are evaluated before inner groups.  This feature is exposed when creating custom advanced filters in GA. 

In the screenshot below we are applying a regex to the Page Title field to capture the text following the final backslash as a group and using it to replace the existing Page Title.

regex-polka-dot-data

Boundaries

\b  is an anchor (similar to $ and ^) which matches a word boundary. For example at the beginning or end of the string or word.

\babc\b

Matches abc only if it appears as a whole word

\Babc\B

Inverse of \b. Matches abc only if it is contained within a word

Comparing the use of \b with \s:

\sflowers\s   matches “the flowers are red” but not “red flowers”
\bflowers\b   matches both “the flowers are” red and “red flowers”

Greedy and Lazy matches

On their own the quantifiers  * + { and } are greedy in that they match as much as they can. This can sometimes give surprising results as shown below. To make them match as little as possible (lazy) we need to add a ? after the quantifier.

For example, given the text:

<div>apples</div><div>pears</div>

<div>.+<\/div>  will match everything to the last </div>:

<div>apples</div><div>pears</div>

Adding ? after the + quantifier makes the match lazy:

<div>.+?<\/div>  will match only up the first </div>

<div>apples</div>

RegEx 101 for Testing and Debugging

RegEx 101 is a very useful online tool for testing and debugging your Regular Expressions in real time with sample data.

Just make sure to select the correct flavor of RegEx on the left side menu. For Google Analytics that would be JavaScript.

Summary

Regular Expressions have a powerful impact on how we can filter and use data in Google Analytics. They are especially useful for pinpointing matches to a query in large sets of data where basic filtering is not exact enough.

We use them almost daily and regard them as an essential tool to any data analyst working in Google Analytics.

To learn more about the impact of using RegEx in Analytics, see our Google Analytics Training courses