How to organize Batches in a file-based Data Asset
In this guide we will demonstrate the ways in which Batches can be organized in a file-based Data Asset. We will discuss how to use a regular expression to indicate which files should be returned as Batches. We will also show how to add Batch Sorters to a Data Asset in order to specify the order in which Batches are returned.
Prerequisites
- A working installation of Great Expectations
- A Datasource that connects to a location with source data files
If you still need to set up and install GX...
Please reference the appropriate one of these guides:
If you still need to connect a Datasource to the location of your source data files...
Please reference the appropriate one of these guides:
Local Filesystems
Google Cloud Storage
Azure Blob Storage
- How to connect to data in Azure Blob Storage using Pandas
- How to connect to data in Azure Blob Storage using Spark
Amazon Web Services S3
If you are using a Datasource that was created with the advanced block-config method please follow the appropriate guide from:
Steps
1. Import GX and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Retrieve a file-based Datasource
For this guide, we will use a previously defined Datasource named "my_datasource"
. For purposes of our demonstration, this Datasource is a Pandas Filesystem Datasource that uses a folder named "data" as its base_folder
.
To retrieve this Datasource, we will supply the get_datasource(...)
method of our Data Context with the name of the Datasource we wish to retrieve:
my_datasource = context.get_datasource("my_datasource")
3. Create a batching_regex
In a file-based Data Asset, any file that matches a provided regular expression (the batching_regex
parameter) will be included as a Batch in the Data Asset. Therefore, to organize multiple files into Batches in a single Data Asset we must define a regular expression that will match one or more of our source data files.
For this example, our Datasource points to a folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
To create a batching_regex
that matches multiple files, we will include a named group in our regular expression:
my_batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
In the above example, the named group "year
" will match any four numeric characters in a file name. This will result in each of our source data files matching the regular expression.
By naming the group in your batching_regex
you make it something you can reference in the future. When requesting data from this Data Asset, you can use the names of your regular expression groups to limit the Batches that are returned.
For more information, please see: How to request data from a Data Asset
You can access files that are nested in folders under your Datasource's base_directory
!
If your source data files are split into multiple folders, you can use the folder that contains those folders as your base_directory
. When you define a Data Asset for your Datasource, you can then include the folder path (relative to your base_directory
) in the regular expression that indicates which files to connect to.
For more information on how to format regular expressions, we recommend referencing Python's official how-to guide for working with regular expressions.
4. Add a Data Asset using the batching_regex
Now that we have put together a regular expression that will match one or more of the files in our Datasource's base_folder
, we can use it to create our Data Asset. Since the files in this particular Datasource's base_folder
are csv files, we will use the add_pandas_csv(...)
method of our Datasource to create the new Data Asset:
my_asset = my_datasource.add_csv_asset(
name="my_taxi_data_asset", batching_regex=my_batching_regex
)
batching_regex
?If you choose to omit the batching_regex
parameter, your Data Asset will automatically use the regular expression ".*"
to match all files.
5. (Optional) Add Batch Sorters to the Data Asset
We will now add a Batch Sorter to our Data Asset. This will allow us to explicitly state the order in which our Batches are returned when we request data from the Data Asset. To do this, we will pass a list of sorters to the add_sorters(...)
method of our Data Asset.
The items in our list of sorters will correspond to the names of the groups in our batching_regex
that we want to sort our Batches on. The names are prefixed with a +
or a -
depending on if we want to sort our Batches in ascending or descending order based on the given group.
When there are multiple named groups, we can include multiple items in our sorter list and our Batches will be returned in the order specified by the list: sorted first according to the first item, then the second, and so forth.
In this example we have two named groups, "year"
and "month"
, so our list of sorters can have up to two elements. We will add an ascending sorter based on the contents of the regex group "year"
and a descending sorter based on the contents of the regex group "month"
:
my_asset = my_asset.add_sorters(["+year", "-month"])
6. Use a Batch Request to verify the Data Asset works as desired
To verify that our Data Asset will return the desired files as Batches, we will define a quick Batch Request that will include all the Batches available in the Data asset. Then we will use that Batch Request to get a list of the returned Batches.
my_batch_request = my_asset.build_batch_request()
batches = my_asset.get_batch_list_from_batch_request(my_batch_request)
Because a Batch List contains a lot of metadata, it will be easiest to verify which files were included in the returned Batches if we only look at the batch_spec
of each returned Batch:
for batch in batches:
print(batch.batch_spec)
Next steps
Now that you have further configured a file-based Data Asset, you may want to look into: