Regular expression (regex) not filtering as expected when using [:alnum:] and [:digit:] in the SQL query

Use \p{Alnum} or \p{Digit} instead.

Written by Vidhi Khaitan

Last published at: March 18th, 2025

Problem

When you use [:alnum:] or [:digit:] in your regular expressions (regex) in an SQL query, you notice the regex does not filter the data correctly. 

 

For example, in the following image, the query SELECT * from <table-name> where <id> REGEXP '[:alnum:]' returns two rows, not the expected four. 

 

Cause

This discrepancy lies in the difference between Photon and non-Photon compute. Photon uses the RE2 regex library, which interprets [:alnum:] as a character class meaning "all alphanumeric characters."

 

In contrast, Apache Spark relies on the Java regex library, which simply treats [:alnum:] as a pattern matching the literal characters ":", "a", "l", and so on.

 

Solution

Use \p{Alnum} or \p{Digit} instead of POSIX-exclusive syntax.

 

The previous example query, rewritten as SELECT * from <table-name> where <id> REGEXP ‘\\p{Alnum}’ , returns the expected four rows.