Database Catalogs¶

Database Catalogs are configuration objects that grant BodoSQL access to load tables from a database. For example, when a user wants to load data from Snowflake, a user will create a SnowflakeCatalog to grant BodoSQL access to their Snowflake account and load the tables of interest.

A database catalog can be registered during the construction of the BodoSQLContext by passing it in as a parameter, or can be manually set using the BodoSQLContext.add_or_replace_catalog API. Currently, a BodoSQLContext can support at most one database catalog.

When using a catalog in a BodoSQLContext we strongly recommend creating the BodoSQLContext once in regular Python and then passing the BodoSQLContext as an argument to JIT functions. There is no benefit to creating the BodoSQLContext in JIT and this could increase compilation time.

catalog = bodosql.SnowflakeCatalog(
    username,
    password,
    account_name,
    "DEMO_WH", # warehouse name
    "SNOWFLAKE_SAMPLE_DATA", # database name
)
bc = bodosql.BodoSQLContext({"LOCAL_TABLE1": df1}, catalog=catalog)

@bodo.jit
def run_query(bc):
    return bc.sql("SELECT r_name, local_id FROM TPCH_SF1.REGION, local_table1 WHERE R_REGIONKEY = local_table1.region_key ORDER BY r_name")

run_query(bc)

Database catalogs can be used alongside local, in-memory DataFrame or TablePath tables. If a table is specified without a schema then BodoSQL resolves the table in the following order:

Default Catalog Schema
Local (in-memory) DataFrames / TablePath names

An error is raised if the table cannot be resolved after searching through both of these data sources.

This ordering indicates that in the event of a name conflict between a table in the database catalog and a local table, the table in the database catalog is used.

If a user wants to use the local table instead, the user can explicitly specify the table with the local schema __BODOLOCAL__.

For example:

SELECT A from __BODOLOCAL__.table1

Currently, BodoSQL supports catalogs for Snowflake and a user's FileSystem. Support for other data storage systems will be added in future releases.

SnowflakeCatalog¶

With a Snowflake Catalog, users only have to specify their Snowflake connection once. They can then access any tables of interest in their Snowflake account. Currently, a Snowflake Catalog requires a default DATABASE (e.g., USE DATABASE), as shown below.

catalog = bodosql.SnowflakeCatalog(
    username,
    password,
    account_name,
    "DEMO_WH", # warehouse name
    "SNOWFLAKE_SAMPLE_DATA", # default database name
)
bc = bodosql.BodoSQLContext(catalog=catalog)

@bodo.jit
def run_query(bc):
    return bc.sql("SELECT r_name FROM TPCH_SF1.REGION ORDER BY r_name")

run_query(bc)

BodoSQL does not currently support Snowflake syntax for specifying defaults and session parameters (e.g. USING SCHEMA <NAME>). Instead users can pass any session parameters through the optional connection_params argument, which accepts a Dict[str, str] for each session parameter. For example, users can provide a default schema to simplify the previous example.

catalog = bodosql.SnowflakeCatalog(
    username,
    password,
    account,
    "DEMO_WH", # warehouse name
    "SNOWFLAKE_SAMPLE_DATA", # database name
    connection_params={"schema": "TPCH_SF1"}
)
bc = bodosql.BodoSQLContext(catalog=catalog)

@bodo.jit
def run_query(bc):
    return bc.sql("SELECT r_name FROM REGION ORDER BY r_name")

run_query(bc)

Internally, Bodo uses the following connections to Snowflake:

A JDBC connection to lazily fetch metadata.
The Snowflake-Python-Connector's distributed fetch API to load batches of arrow data.

API Reference¶

bodosql.SnowflakeCatalog(username: str, password: str, account: str, warehouse: str, database: str, connection_params: Optional[Dict[str, str]] = None, iceberg_volume: Optional[str] = None)

Constructor for SnowflakeCatalog. This allows users to execute queries on tables stored in Snowflake when the SnowflakeCatalog object is registered with a BodoSQLContext.

Arguments
- username: Snowflake account username.
- password: Snowflake account password.
- account: Snowflake account name.
- warehouse: Snowflake warehouse to use when loading data.
- database: Name of Snowflake database to load data from. The Snowflake Catalog is currently restricted to using a single Snowflake database.
- connection_params: A dictionary of Snowflake session parameters.
- iceberg_volume: The name of a storage volume to use for writing Iceberg tables. When provided any tables created by BodoSQL will be written as an Iceberg table.

Supported Query Types¶

The SnowflakeCatalog currently supports the following types of SQL queries:

SELECT
INSERT INTO
DELETE
CREATE TABLE AS

FileSystemCatalog¶

The FileSystemCatalog allows users to read and write tables using their local file system or S3 storage without needing access to a proper database. To use this catalog, you will have to select a root directory. The catalog will treat each subdirectory as a schema that you can also specify. We recommend always using at least one schema to avoid any potential issues with table resolution. For example, the following code shows how a user could read a table called MY_TABLE that is located at s3://my_bucket/MY_SCHEMA/MY_TABLE.

catalog = bodosql.FileSystemCatalog(
    "s3://my_bucket", # root directory
)
bc = bodosql.BodoSQLContext(catalog=catalog)

@bodo.jit
def run_query(bc):
    return bc.sql("SELECT * FROM MY_SCHEMA.MY_TABLE")

run_query(bc)

When working with tables in the FileSystemCatalog, BodoSQL uses the full name of any directory or file as the object's name and is case sensitive. When constructing a query you must follow the BodoSQL rules for identifier case sensitivity.

To simplify your queries you can also provide a default schema resolution path to the FileSystemCatalog constructor. For example, this code provides a default schema of MY_SCHEMA.other_schema for loading OTHER_TABLE from s3://my_bucket/MY_SCHEMA/other_schema/OTHER_TABLE.

catalog = bodosql.FileSystemCatalog(
    "s3://my_bucket",
    default_schema="MY_SCHEMA.\"other_schema\""
)
bc = bodosql.BodoSQLContext(catalog=catalog)

@bodo.jit
def run_query(bc):
    return bc.sql("SELECT * FROM OTHER_TABLE")

run_query(bc)

API Reference¶

bodosql.FileSystemCatalog(root: str, default_write_format: str = "iceberg", default_schema: str = ".")

Constructor for FileSystemCatalog. This allows users to try a file system as a database for querying or writing tables with a BodoSQLContext.

Arguments
- root: Filesystem path that provides the root directory for the database. This can either be a local file system path or an S3 path.
- default_write_format: The default format to use when writing tables using create table as. This can be either iceberg or parquet.
- default_schema: The default schema to use when resolving tables. This should be a . separated string that represents the path to the default schema. Each value separated by a . should be treated as its own SQL identifier. If no default schema is provided the root directory is used.

Supported Query Types¶

The FileSystemCatalog currently supports the following types of SQL queries:

SELECT
CREATE TABLE AS

Supported Table Types¶

The FileSystemCatalog currently only supports reading Iceberg tables. It can write tables as either Iceberg or Parquet, depending on the default_write_format parameter. When writing tables, any specified schema must already exist as directories in the file system. Future releases will provide additional table support.

S3 Support¶

The FileSystemCatalog supports reading and writing tables from S3. When using S3, the root parameter should be an s3 uri. To access S3 BodoSQL uses the following environment variables to connect to S3:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION

If you encounter any issues connecting to s3 or accessing a table, please ensure that these environment variables are set. For more information please refer to the AWS documentation.

TabularCatalog¶

The TabularCatalog allows users to read and write tables using the REST Iceberg Catalog provided by Tabular. To use this catalog, you will have to select a warehouse and provide a method of authentication.

catalog = bodosql.TabularCatalog(
    warehouse="warehouse_name",
    credential="clientid:clientsecret"
)
bc = bodosql.BodoSQLContext(catalog=catalog)

@bodo.jit
def run_query(bc):
    return bc.sql("SELECT * FROM MY_SCHEMA.MY_TABLE")

run_query(bc)

When constructing a query you must follow the BodoSQL rules for identifier case sensitivity.

API Reference¶

bodosql.TabularCatalog(warehouse: str, credential: str, token: str)

Constructor for TabularCatalog. This allows users to use a Tabular Iceberg Warehouse as a database for querying or writing tables with a BodoSQLContext. Either credential or token must be provided.

Arguments
- warehouse: Name of the Tabular Iceberg Warehouse to query.
- rest_uri: The REST URI for Tabular's REST Iceberg catalog, defaults to https://api.tabular.io/ws.
- credential: The Tabular credential to use for authentication. This should be in the form of clientid:clientsecret.
- token: A temporary OAuth2 token to use for authentication. This token should be a valid token for the Tabular Warehouse.

Supported Query Types¶

The TabularCatalog currently supports the following types of SQL queries:

SELECT
CREATE TABLE AS

S3 Support¶

The TabularCatalog supports reading and writing tables stored on S3. S3 access is provided by the Tabular Warehouse and does not require any additional configuration.

GlueCatalog¶

The GlueCatalog allows users to read and write tables using AWS Glue. To use this catalog, you will have to select a warehouse that is the Glue S3 bucket. The bucket must exist. The s3:// prefix is optional. I.e., both s3://bucket_name and bucket_name are valid.

catalog = bodosql.GlueCatalog(
    warehouse="warehouse_name",
)
bc = bodosql.BodoSQLContext(catalog=catalog)

@bodo.jit
def run_query(bc):
    return bc.sql("SELECT * FROM MY_SCHEMA.MY_TABLE")

run_query(bc)

When constructing a query you must follow the BodoSQL rules for identifier case sensitivity.

Authentication / Authorization¶

Before creating a catalog for the Glue S3 bucket, you must first ensure that the cloud config role for the workspace has access to the Glue S3 bucket with the following permissions:

{
    "Sid": "BodoPlatformCatalog",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket"
    ],
    "Resource": "arn:aws:s3:::<backet>"
}

The following are the steps to find the cloud config role: * find the Cloud Config uuid in the workpace details. * find the Cloud Config that has the same uuid in the Cloud Configurations page. * find the Role ARN in the details of the Cloud Config. It may look like BodoPlatformUser-XXXXXXXX if it was created using Cloud Formation or Access Key.

The Bodo clusters that run the queries on the Glue Catalog need to be created with an instance role. This instance role needs to have read/write access to the Glue S3 bucket and also appropreate Glue permissions. The exact set of Glue permissions depend on the queries on the Glue database. The following is an example that could work for most of the use cases.

{
    "Sid": "BodoPlatformAccessGlueCatalog",
    "Effect": "Allow",
    "Action": [
        "glue:CreateDatabase",
        "glue:CreateSchema",
        "glue:CreateTable",
        "glue:DeleteDatabase",
        "glue:DeleteSchema",
        "glue:DeleteTable",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetSchema",
        "glue:GetSchemaByDefinition",
        "glue:GetSchemaVersion",
        "glue:GetSchemaVersionsDiff",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions",
        "glue:ListSchemas",
        "glue:ListSchemaVersions",
        "glue:PutSchemaVersionMetadata",
        "glue:QuerySchemaVersionMetadata",
        "glue:RegisterSchemaVersion",
        "glue:RemoveSchemaVersionMetadata",
        "glue:UpdateDatabase",
        "glue:UpdateSchema",
        "glue:UpdateTable"
    ],
    "Resource": [
      "arn:aws:glue:*:*:catalog",
      "arn:aws:glue:*:*:database/*<bodo-db>*",
      "arn:aws:glue:*:*:table/*<bodo-db>*/<table>",
      "arn:aws:glue:*:*:connection/*<bodo-connection>*",
      "arn:aws:glue:*:*:session/*<bodo-session>*"
    ]
}, {
    "Sid": "BodoPlatformAccessGlueBucket",
    "Effect": "Allow",
    "Action": [
        "s3:*",
    ],
   "Resource": "arn:aws:s3:::<backet>"
}

For workspaces that have PrivateLink enabled, or have no internet access, users need to create AWS Glue endpoint in the VPC to allow access to Glue.

API Reference¶

bodosql.GlueCatalog(warehouse: str)

Constructor for GlueCatalog. This allows users to use an AWS Glue Iceberg Warehouse as a database for querying or writing tables with a BodoSQLContext.

Arguments
- warehouse: Name of the Glue S3 bucket, with or without the s3:// prefix. I.e., both s3://bucket_name and bucket_name are valid.

Supported Query Types¶

The GlueCatalog currently supports the following types of SQL queries:

SELECT
CREATE TABLE AS