Problem
When you use [:alnum:]
or [:digit:]
in your regular expressions (regex) in an SQL query, you notice the regex does not filter the data correctly.
For example, in the following image, the query SELECT * from <table-name> where <id> REGEXP '[:alnum:]'
returns two rows, not the expected four.
Cause
This discrepancy lies in the difference between Photon and non-Photon compute. Photon uses the RE2 regex library, which interprets [:alnum:]
as a character class meaning "all alphanumeric characters."
In contrast, Apache Spark relies on the Java regex library, which simply treats [:alnum:]
as a pattern matching the literal characters ":", "a", "l", and so on.
Solution
Use \p{Alnum}
or \p{Digit}
instead of POSIX-exclusive syntax.
The previous example query, rewritten as SELECT * from <table-name> where <id> REGEXP ‘\\p{Alnum}’
, returns the expected four rows.