bodo.pandas.BodoDataFrame.to_iceberg¶
BodoDataFrame.to_iceberg(
table_identifier,
catalog_name=None,
*,
catalog_properties=None,
location=None,
append=False,
partition_spec=None,
sort_order=None,
properties=None,
snapshot_properties=None
)
Refer to pandas.DataFrame.to_iceberg
for more details.
Warning
This function is experimental in Pandas and may change in future releases.
Note
This function assumes that the Iceberg namespace is already created in the catalog.
Parameters
-
table_identifier: str: Table identifier to write
-
catalog_name: str, optional: Name of the catalog to use. If not provided, the default catalog will be used. See PyIceberg's documentation for more details.
-
catalog_properties: dict[str, Any], optional: Properties for the catalog connection.
-
location: str, optional: Location of the table (if supported by the catalog). If this is passed a path and catalog_name and catalog_properties are None, it will use a filesystem catalog with the provided location. If the location is an S3 Tables ARN it will use the S3TablesCatalog.
-
append: bool: Append or overwrite if the table exists
-
partition_spec: PartitionSpec, optional: PyIceberg partition spec for the table (only used if creating a new table). See PyIceberg's documentation for more details.
-
sort_order: SortOrder, optional: PyIceberg sort order for the table (only used if creating a new table). See PyIceberg's documentation for more details.
-
properties: dict[str, Any], optional: Properties to add to the new table.
-
snapshot_properties: dict[str, Any], optional: Properties to add to the new table snapshot.
Example
Simple write of a table on the filesystem without a catalog:
import bodo.pandas as bd
from pyiceberg.transforms import IdentityTransform
from pyiceberg.partitioning import PartitionField, PartitionSpec
from pyiceberg.table.sorting import SortField, SortOrder
bdf = bd.DataFrame(
{
"one": [-1.0, 1.3, 2.5, 3.0, 4.0, 6.0, 10.0],
"two": ["foo", "bar", "baz", "foo", "bar", "baz", "foo"],
"three": [True, False, True, True, True, False, False],
"four": [-1.0, 5.1, 2.5, 3.0, 4.0, 6.0, 11.0],
"five": ["foo", "bar", "baz", None, "bar", "baz", "foo"],
}
)
part_spec = PartitionSpec(PartitionField(2, 1001, IdentityTransform(), "id_part"))
sort_order = SortOrder(SortField(source_id=4, transform=IdentityTransform()))
bdf.to_iceberg("test_table", location="./iceberg_warehouse", partition_spec=part_spec, sort_order=sort_order)
out_df = bd.read_iceberg("test_table", location="./iceberg_warehouse")
# Only reads Parquet files of partition "foo" from storage
print(out_df[out_df["two"] == "foo"])
Output:
Write a DataFrame to an Iceberg table in S3 Tables using the location parameter:
df.to_iceberg(
table_identifier="my_table",
location="arn:aws:s3tables:<region>:<account_number>:my-bucket/my-table"
)