Bodo Iceberg Quick Start¶
This quickstart guide will walk you through the process of creating and reading from an Iceberg table using Bodo on your local machine.
Installation¶
Install Bodo to get started (e.g., pip install -U bodo
or conda install bodo -c bodo.ai -c conda-forge
).
Create a Local Iceberg Table with Bodo DataFrame Library¶
This example demonstrates simple write of a table on the filesystem without a catalog:
import bodo.pandas as pd
import numpy as np
n = 20_000_000
df = pd.DataFrame({"A": np.arange(n) % 30, "B": np.arange(n)})
df.to_iceberg("test_table", location="./iceberg_warehouse")
Now let's read the Iceberg table:
See DataFrame Library API reference for more information. Note that this quickstart uses a local Iceberg table, but you can also use Bodo with Iceberg tables on S3, ADLS, and GCS as well.
Amazon S3 Tables¶
Amazon S3 Tables simplify Iceberg use and table maintenance by providing builtin Apache Iceberg support. Bodo supports S3 Tables in both Python and SQL seamlessly. Here is a step by step example for using S3 Tables in Bodo.
Make sure you have your environment ready:
- Create a Table bucket on S3 (not a regular bucket). You can simply use the console with this link (replace region if desired).
- Make sure you have AWS credentials in your environment (e.g.
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
). - Make sure the user associated with your credentials has
AmazonS3TablesFullAccess
policy attached. You can use IAM in the AWS console (e.g. this link). - Set default region to the bucket region in the environment. For example:
- Make sure you have the latest AWS CLI (see here) since this is a new feature and create a namespace in the table bucket. For example (replace region, account number and bucket name):
Now you are ready to use Bodo to read and write S3 Tables. Run this example code (replace bucket name, account ID, region, namespace):
import pandas as pd
import numpy as np
import bodo
BUCKET_NAME="my-test-bucket"
ACCOUNT_ID="111122223333"
REGION="us-east-2"
NAMESPACE="my_namespace"
CONN_STR=f"iceberg+arn:aws:s3tables:{REGION}:{ACCOUNT_ID}:bucket/{BUCKET_NAME}"
NUM_GROUPS = 30
NUM_ROWS = 20_000_000
@bodo.jit
def example_write_iceberg_table():
df = pd.DataFrame({
"A": np.arange(NUM_ROWS) % NUM_GROUPS,
"B": np.arange(NUM_ROWS)
})
df.to_sql(
name="my_table_1",
con=CONN_STR,
schema=NAMESPACE,
if_exists="replace"
)
example_write_iceberg_table()
@bodo.jit
def example_read_iceberg():
df = pd.read_sql_table(
table_name="my_table_1",
con=CONN_STR,
schema=NAMESPACE
)
print(df)
return df
df_read = example_read_iceberg()
print(df_read)
You can use BodoSQL to work with S3 Tables as well. Here is a simple example:
import pandas as pd
import bodosql
BUCKET_NAME="my-test-bucket"
ACCOUNT_ID="111122223333"
REGION="us-east-2"
NAMESPACE="my_namespace"
ARN_STR=f"arn:aws:s3tables:{REGION}:{ACCOUNT_ID}:bucket/{BUCKET_NAME}"
catalog = bodosql.S3TablesCatalog(ARN_STR)
bc = bodosql.BodoSQLContext(catalog=catalog)
df = pd.DataFrame({"A": [1, 2, 3], "B": ["a", "b", "c"]})
bc = bc.add_or_replace_view("TABLE1", df)
query = f"""
CREATE OR REPLACE TABLE "{NAMESPACE}"."my_table" AS SELECT * FROM __bodolocal__.table1
"""
bc.sql(query)
df_read = bc.sql(f"SELECT * FROM \"{NAMESPACE}\".\"my_table\"")
print(df_read)