Select files using a pattern match

When selecting files, a common requirement is to only read specific files from a folder.

For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression.

This article uses example patterns to show you how to read specific files from a sample list.

Sample files

Assume that the following files are located in the root folder.

//root/1999.txt
//root/2000.txt
//root/2001.txt
//root/2002.txt
//root/2003.txt
//root/2004.txt
//root/2005.txt
//root/2020/04.txt
//root/2020/05.txt

Glob patterns

Asterisk

* - The asterisk matches one or more characters. It is a wild card for multiple characters.

This example matches all files with a .txt extension

display(spark.read.format("text").load("//root/*.txt"))

Question mark

? - The question mark matches a single character. It is a wild card that is limited to replacing a single character.

This example matches all files from the root folder, except 1999.txt. It does not search the contents of the 2020 folder.

display(spark.read.format("text").load("//root/200?.txt"))

Character class

[ab] - The character class matches a single character from the set. It is represented by the characters you want to match inside a set of brackets.

This example matches all files with a 2 or 3 in place of the matched character. It returns 2002.txt and 2003.txt from the sample files.

display(spark.read.format("text").load("//root/200[23].txt"))

Negated character class

[^ab] - The negated character class matches a single character that is not in the set. It is represented by the characters you want to exclude inside a set of brackets.

This example matches all files except those with a 2 or 3 in place of the matched character. It returns 2000.txt, 2001.txt, 2004.txt, and 2005.txt from the sample files.

display(spark.read.format("text").load("//root/200[^23].txt"))

Character range

[a-b] - The character class matches a single character in the range of values. It is represented by the range of characters you want to match inside a set of brackets.

This example matches all files with a character within the search range in place of the matched character. It returns 2002.txt, 2003.txt, 2004.txt, and 2005.txt from the sample files.

display(spark.read.format("text").load("//root/200[2-5].txt"))

Negated character range

[^a-b] - The negated character class matches a single character that is not in the range of values. It is represented by the range of characters you want to exclude inside a set of brackets.

This example matches all files with a character outside the search range in place of the matched character. It returns 2000.txt and 2001.txt from the sample files.

display(spark.read.format("text").load("//root/200[^2-5].txt"))

Alternation

{a,b} - Alternation matches either expression. It is represented by the expressions you want to match inside a set of curly brackets.

This example matches all files with an expression that matches one of the two selected expressions. It returns 2004.txt and 2005.txt from the sample files.

display(spark.read.format("text").load("//root/20{04, 05}.txt"))