Skip to content

Input/Output

bodo.pandas.read_parquet

bodo.pandas.read_parquet(
    path,
    engine="auto",
    columns=None,
    storage_options=None,
    use_nullable_dtypes=lib.no_default,
    dtype_backend=lib.no_default,
    filesystem=None,
    filters=None,
    **kwargs,
) -> BodoDataFrame

Creates a BodoDataFrame object for reading from parquet file(s) lazily.

Parameters

path : str, list[str]: Location of the parquet file(s) to read. Refer to pandas.read_parquet for more details. The type of this argument differs from Pandas.

All other parameters will trigger a fallback to pandas.read_parquet if a non-default value is provided.

Returns

BodoDataFrame

Example

import bodo
import bodo.pandas as bodo_pd
import pandas as pd

original_df = pd.DataFrame(
    {"foo": range(15), "bar": range(15, 30)}
   )

@bodo.jit
def write_parquet(df):
    df.to_parquet("example.pq")

write_parquet(original_df)

restored_df = bodo_pd.read_parquet("example.pq")
print(type(restored_df))
print(restored_df.head())

Output:

<class 'bodo.pandas.frame.BodoDataFrame'>
   foo  bar
0    0   15
1    1   16
2    2   17
3    3   18
4    4   19

bodo.pandas.read_iceberg

bodo.pandas.read_iceberg(
    table_identifier: str,
    catalog_name: str | None = None,
    catalog_properties: dict[str, Any] | None = None,
    row_filter: str | None = None,
    selected_fields: tuple[str] | None = None,
    case_sensitive: bool = True,
    snapshot_id: int | None = None,
    limit: int | None = None,
    scan_properties: dict[str, Any] | None = None,
) -> BodoDataFrame

Creates a BodoDataFrame object for reading from an Iceberg table lazily.

Refer to pandas.read_iceberg for more details.

Warning

This function is experimental in Pandas and may change in future releases.

Parameters

table_identifier: str: Identifier of the Iceberg table to read. This should be in the format schema.table

catalog_name: str, optional: Name of the catalog to use. If not provided, the default catalog will be used. See PyIceberg's documentation for more details.

catalog_properties: dict[str, Any], optional: Properties for the catalog connection.

row_filter: str, optional: expression to filter rows.

selected_fields: tuple[str], optional: Fields to select from the table, if not provided, all fields will be selected.

snapshot_id: int, optional: ID of the snapshot to read from. If not provided, the latest snapshot will be used.

limit: int, optional: Maximum number of rows to read. If not provided, all rows will be read.

Non-default values for case_sensitive and scan_properties will trigger a fallback to pandas.read_iceberg.

Returns

BodoDataFrame

Example

Read a table using a predefined PyIceberg catalog.

import bodo
import bodo.pandas as bodo_pd
df = bodo_pd.read_iceberg(
    table_identifier="my_schema.my_table",
    catalog_name="my_catalog",
    row_filter="col1 > 10",
    selected_fields=("col1", "col2"),
    snapshot_id=123456789,
    limit=1000
)

Read a table using a new PyIceberg catalog with custom properties.

import bodo
import bodo.pandas as bodo_pd
import pyiceberg.catalog
df = bodo_pd.read_iceberg(
    table_identifier="my_schema.my_table",
    catalog_properties={
        pyiceberg.catalog.PY_CATALOG_IMPL: "bodo.io.iceberg.catalog.dir.DirCatalog",
        pyiceberg.catalog.WAREHOUSE_LOCATION: path_to_warehouse_dir,
    }
)