Skip to content

Input/Output

bodo.pandas.read_parquet

bodo.pandas.read_parquet(
    path,
    engine="auto",
    columns=None,
    storage_options=None,
    use_nullable_dtypes=lib.no_default,
    dtype_backend=lib.no_default,
    filesystem=None,
    filters=None,
    **kwargs,
) -> BodoDataFrame

Creates a BodoDataFrame object for reading from parquet file(s) lazily.

Parameters

path : str, list[str]: Location of the parquet file(s) to read. Refer to pandas.read_parquet for more details. The type of this argument differs from Pandas.

All other parameters will trigger a fallback to pandas.read_parquet if a non-default value is provided.

Returns

BodoDataFrame

Example

import bodo.pandas as bd

original_df = bd.DataFrame(
    {"foo": range(15), "bar": range(15, 30)}
   )

original_df.to_parquet("example.pq")

restored_df = bd.read_parquet("example.pq")
print(type(restored_df))
print(restored_df.head())

Output:

<class 'bodo.pandas.frame.BodoDataFrame'>
   foo  bar
0    0   15
1    1   16
2    2   17
3    3   18
4    4   19

bodo.pandas.read_iceberg

bodo.pandas.read_iceberg(
    table_identifier: str,
    catalog_name: str | None = None,
    catalog_properties: dict[str, Any] | None = None,
    row_filter: str | None = None,
    selected_fields: tuple[str] | None = None,
    case_sensitive: bool = True,
    snapshot_id: int | None = None,
    limit: int | None = None,
    scan_properties: dict[str, Any] | None = None,
    location: str | None = None,
) -> BodoDataFrame

Creates a BodoDataFrame object for reading from an Iceberg table lazily.

Refer to pandas.read_iceberg for more details.

Warning

This function is experimental in Pandas and may change in future releases.

Parameters

table_identifier: str: Identifier of the Iceberg table to read. This should be in the format schema.table

catalog_name: str, optional: Name of the catalog to use. If not provided, the default catalog will be used. See PyIceberg's documentation for more details.

catalog_properties: dict[str, Any], optional: Properties for the catalog connection.

row_filter: str, optional: expression to filter rows.

selected_fields: tuple[str], optional: Fields to select from the table, if not provided, all fields will be selected.

snapshot_id: int, optional: ID of the snapshot to read from. If not provided, the latest snapshot will be used.

limit: int, optional: Maximum number of rows to read. If not provided, all rows will be read.

location: str, optional: Location of the table (if supported by the catalog). If this is passed a path and catalog_name and catalog_properties are None, it will use a filesystem catalog with the provided location. If the location is an S3 Tables ARN it will use the S3TablesCatalog.

Non-default values for case_sensitive and scan_properties will trigger a fallback to pandas.read_iceberg.

Returns

BodoDataFrame

Examples

Simple read of a table stored without a catalog on the filesystem:

import bodo.pandas as bd

df = bd.read_iceberg("my_table", location="s3://path/to/iceberg/warehouse")

Read a table using a predefined PyIceberg catalog.

import bodo.pandas as bd

df = bd.read_iceberg(
    table_identifier="my_schema.my_table",
    catalog_name="my_catalog",
    row_filter="col1 > 10",
    selected_fields=("col1", "col2"),
    snapshot_id=123456789,
    limit=1000
)

Read a table using a new PyIceberg catalog with custom properties.

import bodo.pandas as bd
import pyiceberg.catalog

df = bd.read_iceberg(
    table_identifier="my_schema.my_table",
    catalog_properties={
        pyiceberg.catalog.PY_CATALOG_IMPL: "bodo.io.iceberg.catalog.dir.DirCatalog",
        pyiceberg.catalog.WAREHOUSE_LOCATION: path_to_warehouse_dir,
    }
)

Read a table from an S3 Tables Bucket using the location parameter.

import bodo.pandas as bd

df = bd.read_iceberg(
    table_identifier="my_table",
    location="arn:aws:s3tables:<region>:<account_number>:my-bucket/my-table"
)