Reading and Writing Iceberg in Bodo¶
SQL¶
BodoSQL can be used to read, create, or insert into an Iceberg table. Iceberg Tables are automatically detected by existing catalogs and are used during read:
- Snowflake Iceberg Tables are automatically detected when using the
SnowflakeCatalog. - Tables within the specified warehouse are automatically detected when using the
TabularCatalog. - Hadoop Iceberg Catalogs and Tables are detected when using the
FileSystemCatalog. - Other Catalogs supported in the Python APIs can be accessed via the
TablePathAPI using the same connection string syntax.
To query an Iceberg table, use the standard SELECT syntax. To learn more about supported SELECT syntax, see the SELECT API reference.
Write Support¶
The CREATE TABLE syntax can be used to create Iceberg tables:
Inserting into existing Iceberg tables is supported via the INSERT INTO syntax:
Snowflake Iceberg Write Support¶
To create Iceberg tables in Snowflake, a Snowflake External Volume is required. The volume to use must be specified via the exvol argument to the SnowflakeCatalog:
catalog = bodosql.SnowflakeCatalog(
...
exvol='<... Snowflake Volume ...>'
)
bc = bodosql.BodoSQLContext(catalog=catalog)
Warning
- Inserting into Snowflake Managed Iceberg Tables is not supported.
- When the
exvolparameter is specified, all tables constructed viaCREATE TABLEwill be Snowflake Iceberg tables.
Python¶
Bodo supports reading and writing to Iceberg tables from multiple catalogs and object stores (local, S3, and HDFS).
- Iceberg Reads are supported through the
pandas.read_sql_tableAPI. - Iceberg Writes are supported through the
pandas.DataFrame.to_sqlAPI.
Connection String Syntax¶
To specify the Iceberg catalog in the Pandas APIs, the conn parameter must contain a connection string in one of the following formats.
Iceberg connection strings vary by catalog, but in general are of the form iceberg<+conn>://<path><?params> where
- <conn>://<path> is the location of the catalog or Iceberg warehouse
- params is a list of properties to pass to the catalog. Each parameter must be of the form <key>=<value> and separated with &, similar to HTTP URLs.
The following parameters are supported:
- type: Type of catalog. The supported values are listed below. When the connection string is ambiguous, this parameter is used to determine the type of catalog implementation.
- warehouse: Location of the warehouse. Required when creating a new table using a Glue or Hive catalog.
The following catalogs are supported:
-
Hadoop Catalog on Local Filesystem:
- Used when
type=hadoopis specified or when<conn>isfileor empty <path>is the absolute path to the warehouse (directory containing the database schema)- Parameter
warehousewill be ignored if specified - E.g.
iceberg://<ABSOLUTE PATH TO ICEBERG WAREHOUSE>oriceberg+file://<ABSOLUTE PATH TO ICEBERG WAREHOUSE>
- Used when
-
Hadoop Catalog on S3
- Used when
type=hadoop-s3is specified or when<conn>iss3. <conn>://<path>is the S3 path to the warehouse (directory or bucket containing the database schema).- Parameter
warehousewill be ignored if specified. - E.g.
iceberg+s3://<S3 PATH TO ICEBERG WAREHOUSE>
- Used when
-
AWS Glue Catalog
- Connection string must be of the form
iceberg+glue?<params>. - Parameter
typewill be ignored if specified. - Parameter
warehouseis required to create a table. - E.g.
iceberg+glueoriceberg+glue?warehouse=s3://<ICEBERG-BUCKET>
- Connection string must be of the form
-
Hive / Thrift Catalog
- Used when
type=hiveis specified or when<conn>isthrift. <conn>://<path>is the URL to the Thrift catalog, i.e.thrift://localhost:9083.- Parameter
warehouseis required to create the table. - E.g.
iceberg+thrift://<THRIFT URL>
- Used when
-
REST Catalog
- Connection string must be of the form
iceberg+rest://<rest-uri>?<params>. - Parameter
typewill be ignored if specified. - Parameter
warehouseis required. - Parameter
tokenorcredentialis required for authentication and should be retrieved from the REST catalog provider. - E.g.
iceberg+restoriceberg+rest://<rest-uri>?warehouse=<warehouse>&token=<token>
- Connection string must be of the form
Pandas APIs¶
Example code for reading:
@bodo.jit
def example_read_iceberg() -> pd.DataFrame:
return pd.read_sql_table(
table_name="<... Name of the Iceberg Table ...>",
con="<... Connection String. See previous section ...>",
schema="<... Namespace Path to Iceberg Table ...>"
)
Note
-
The
schemaargument is required for reading Iceberg tables. -
The Iceberg table to read should be located at
<warehouse-location>/<schema>/<table_name>, whereschemaandtable_nameare the arguments topd.read_sql_table, andwarehouse-locationis inferred from the connection string based on the description provided above.
An example for writing to Iceberg via pandas.DataFrame.to_sql:
@bodo.jit(distributed=["df"])
def write_iceberg_table(df: pandas.DataFrame):
df.to_sql(
name="<... Name of the Iceberg Table ...>",
con="<... Connection String. See previous section ...>",
schema="<... Namespace Path to Iceberg Table ..>",
if_exists="replace"
)
Note
schemaargument is required for writing Iceberg tables.- Writing a Pandas Dataframe index to an Iceberg table is not supported. If
indexandindex_labelare provided, they will be ignored. chunksize,dtypeandmethodarguments are not supported and will be ignored if provided.