Skip to main content
Version: 0.16.14

How to connect to in-memory data using Pandas

In this guide we will demonstrate how to connect to an in-memory Pandas DataFrame. Pandas can read many types of data into its DataFrame class, but in our example we will use data originating in a parquet file.

Prerequisites

Steps

1. Import the Great Expectations module and instantiate a Data Context

The code to import Great Expectations and instantiate a Data Context is:

import great_expectations as gx

context = gx.get_context()

2. Create a Datasource

To access our in-memory data, we will create a Pandas Datasource:

Python code
datasource = context.sources.add_pandas(name="my_pandas_datasource")

3. Read your source data into a Pandas DataFrame

For this example, we will read a parquet file into a Pandas DataFrame, which we will then use in the rest of this guide.

The code to create the Pandas DataFrame we are using in this guide is defined with:

Python code
import pandas as pd

dataframe = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-11.parquet")

4. Add a Data Asset to the Datasource

A Pandas DataFrame Data Asset can be defined with two elements:

  • name: The name by which the Datasource will be referenced in the future
  • dataframe: A Pandas DataFrame containing the data

We will use the dataframe from the previous step as the corresponding parameter's value. For the name parameter, we will define a name in advance by storing it in a Python variable:

Python code
name = "taxi_dataframe"

Now that we have the name and dataframe for our Data Asset, we can create the Data Asset with the code:

Python code
data_asset = datasource.add_dataframe_asset(name=name, dataframe=dataframe)

Next steps

Now that you have connected to your data, you may want to look into:

Additional information

External APIs

For more information on Pandas read methods, please reference the official Pandas Input/Output documentation.