When selecting files, a common requirement is to only read specific files from a folder.
For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression.
This article uses example patterns to show you how to read specific files from a sample list.
Sample files
Assume that the following files are located in the root folder.
//root/1999.txt //root/2000.txt //root/2001.txt //root/2002.txt //root/2003.txt //root/2004.txt //root/2005.txt //root/2020/04.txt //root/2020/05.txt
Glob patterns
Asterisk
* - The asterisk matches one or more characters. It is a wild card for multiple characters.
This example matches all files with a .txt extension
%scala display(spark.read.format("text").load("//root/*.txt"))
Question mark
? - The question mark matches a single character. It is a wild card that is limited to replacing a single character.
This example matches all files from the root folder, except 1999.txt. It does not search the contents of the 2020 folder.
%scala display(spark.read.format("text").load("//root/200?.txt"))
Character class
[ab] - The character class matches a single character from the set. It is represented by the characters you want to match inside a set of brackets.
This example matches all files with a 2 or 3 in place of the matched character. It returns 2002.txt and 2003.txt from the sample files.
%scala display(spark.read.format("text").load("//root/200[23].txt"))
Negated character class
[^ab] - The negated character class matches a single character that is not in the set. It is represented by the characters you want to exclude inside a set of brackets.
This example matches all files except those with a 2 or 3 in place of the matched character. It returns 2000.txt, 2001.txt, 2004.txt, and 2005.txt from the sample files.
%scala display(spark.read.format("text").load("//root/200[^23].txt"))
Character range
[a-b] - The character class matches a single character in the range of values. It is represented by the range of characters you want to match inside a set of brackets.
This example matches all files with a character within the search range in place of the matched character. It returns 2002.txt, 2003.txt, 2004.txt, and 2005.txt from the sample files.
%scala display(spark.read.format("text").load("//root/200[2-5].txt"))
Negated character range
[^a-b] - The negated character class matches a single character that is not in the range of values. It is represented by the range of characters you want to exclude inside a set of brackets.
This example matches all files with a character outside the search range in place of the matched character. It returns 2000.txt and 2001.txt from the sample files.
%scala display(spark.read.format("text").load("//root/200[^2-5].txt"))
Alternation
{a,b} - Alternation matches either expression. It is represented by the expressions you want to match inside a set of curly brackets.
This example matches all files with an expression that matches one of the two selected expressions. It returns 2004.txt and 2005.txt from the sample files.
%scala display(spark.read.format("text").load("//root/20{04, 05}.txt"))