Reading and Writing Iceberg in Bodo¶
SQL¶
BodoSQL can be used to read, create, or insert into an Iceberg table. Iceberg Tables are automatically detected by existing catalogs and are used during read:
- Snowflake Iceberg Tables are automatically detected when using the
SnowflakeCatalog
. - Tables within the specified warehouse are automatically detected when using the
TabularCatalog
. - Hadoop Iceberg Catalogs and Tables are detected when using the
FileSystemCatalog
. - Other Catalogs supported in the Python APIs can be accessed via the
TablePath
API using the same connection string syntax.
To query an Iceberg table, use the standard SELECT
syntax. To learn more about supported SELECT syntax, see the SELECT API reference.
Write Support¶
The CREATE TABLE
syntax can be used to create Iceberg tables:
Inserting into existing Iceberg tables is supported via the INSERT INTO
syntax:
Snowflake Iceberg Write Support¶
To create Iceberg tables in Snowflake, a Snowflake External Volume is required. The volume to use must be specified via the exvol
argument to the SnowflakeCatalog
:
catalog = bodosql.SnowflakeCatalog(
...
exvol='<... Snowflake Volume ...>'
)
bc = bodosql.BodoSQLContext(catalog=catalog)
Warning
- Inserting into Snowflake Managed Iceberg Tables is not supported.
- When the
exvol
parameter is specified, all tables constructed viaCREATE TABLE
will be Snowflake Iceberg tables.
Python¶
Bodo supports reading and writing to Iceberg tables from multiple catalogs and object stores (local, S3, and HDFS).
- Iceberg Reads are supported through the
pandas.read_sql_table
API. - Iceberg Writes are supported through the
pandas.DataFrame.to_sql
API.
Connection String Syntax¶
To specify the Iceberg catalog in the Pandas APIs, the conn
parameter must contain a connection string in one of the following formats.
Iceberg connection strings vary by catalog, but in general are of the form iceberg<+conn>://<path><?params>
where
- <conn>://<path>
is the location of the catalog or Iceberg warehouse
- params
is a list of properties to pass to the catalog. Each parameter must be of the form <key>=<value>
and separated with &
, similar to HTTP URLs.
The following parameters are supported:
- type
: Type of catalog. The supported values are listed below. When the connection string is ambiguous, this parameter is used to determine the type of catalog implementation.
- warehouse
: Location of the warehouse. Required when creating a new table using a Glue or Hive catalog.
The following catalogs are supported:
-
Hadoop Catalog on Local Filesystem:
- Used when
type=hadoop
is specified or when<conn>
isfile
or empty <path>
is the absolute path to the warehouse (directory containing the database schema)- Parameter
warehouse
will be ignored if specified - E.g.
iceberg://<ABSOLUTE PATH TO ICEBERG WAREHOUSE>
oriceberg+file://<ABSOLUTE PATH TO ICEBERG WAREHOUSE>
- Used when
-
Hadoop Catalog on S3
- Used when
type=hadoop-s3
is specified or when<conn>
iss3
. <conn>://<path>
is the S3 path to the warehouse (directory or bucket containing the database schema).- Parameter
warehouse
will be ignored if specified. - E.g.
iceberg+s3://<S3 PATH TO ICEBERG WAREHOUSE>
- Used when
-
AWS Glue Catalog
- Connection string must be of the form
iceberg+glue?<params>
. - Parameter
type
will be ignored if specified. - Parameter
warehouse
is required to create a table. - E.g.
iceberg+glue
oriceberg+glue?warehouse=s3://<ICEBERG-BUCKET>
- Connection string must be of the form
-
Hive / Thrift Catalog
- Used when
type=hive
is specified or when<conn>
isthrift
. <conn>://<path>
is the URL to the Thrift catalog, i.e.thrift://localhost:9083
.- Parameter
warehouse
is required to create the table. - E.g.
iceberg+thrift://<THRIFT URL>
- Used when
-
REST Catalog
- Connection string must be of the form
iceberg+rest://<rest-uri>?<params>
. - Parameter
type
will be ignored if specified. - Parameter
warehouse
is required. - Parameter
token
orcredential
is required for authentication and should be retrieved from the REST catalog provider. - E.g.
iceberg+rest
oriceberg+rest://<rest-uri>?warehouse=<warehouse>&token=<token>
- Connection string must be of the form
Pandas APIs¶
Example code for reading:
@bodo.jit
def example_read_iceberg() -> pd.DataFrame:
return pd.read_sql_table(
table_name="<... Name of the Iceberg Table ...>",
con="<... Connection String. See previous section ...>",
schema="<... Namespace Path to Iceberg Table ...>"
)
Note
-
The
schema
argument is required for reading Iceberg tables. -
The Iceberg table to read should be located at
<warehouse-location>/<schema>/<table_name>
, whereschema
andtable_name
are the arguments topd.read_sql_table
, andwarehouse-location
is inferred from the connection string based on the description provided above.
An example for writing to Iceberg via pandas.DataFrame.to_sql
:
@bodo.jit(distributed=["df"])
def write_iceberg_table(df: pandas.DataFrame):
df.to_sql(
name="<... Name of the Iceberg Table ...>",
con="<... Connection String. See previous section ...>",
schema="<... Namespace Path to Iceberg Table ..>",
if_exists="replace"
)
Note
schema
argument is required for writing Iceberg tables.- Writing a Pandas Dataframe index to an Iceberg table is not supported. If
index
andindex_label
are provided, they will be ignored. chunksize
,dtype
andmethod
arguments are not supported and will be ignored if provided.