Bodo SQL Quick Start¶
This quickstart guide will walk you through the process of running a simple SQL query using Bodo on your local machine.
Prerequisites¶
Install Bodo to get started (e.g., pip install bodo
or conda install bodo -c bodo.ai -c conda-forge
).
Additionally, install bodosql with pip or conda:
Generate Sample Data¶
Let's start by creating a Parquet file with some sample data. The following Python code creates a Parquet file with two columns A
and B
and 20 million rows. The column A
contains values from 0 to 29, and the column B
contains values from 0 to 19,999,999.
import pandas as pd
import numpy as np
import bodo
import bodosql
NUM_GROUPS = 30
NUM_ROWS = 20_000_000
df = pd.DataFrame({
"A": np.arange(NUM_ROWS) % NUM_GROUPS,
"B": np.arange(NUM_ROWS)
})
df.to_parquet("my_data.pq")
Create a local in-memory SQL Table¶
Now let's create a local in-memory SQL table from the Parquet file. We can use the TablePATH
API to register the table into our BodoSQLContext
.
Write a SQL Query¶
Now we can write a SQL query to compute the sum of column A
for all rows where B
is greater than 4.
Running your code¶
Bringing it all together, the complete code looks like this:
import pandas as pd
import numpy as np
import bodosql
NUM_GROUPS = 30
NUM_ROWS = 20_000_000
df = pd.DataFrame({
"A": np.arange(NUM_ROWS) % NUM_GROUPS,
"B": np.arange(NUM_ROWS)
})
df.to_parquet("my_data.pq")
bc = bodosql.BodoSQLContext(
{
"TABLE1": bodosql.TablePath("my_data.pq", "parquet")
}
)
df1 = bc.sql("SELECT SUM(A) as SUM_OF_COLUMN_A FROM TABLE1 WHERE B > 4")
print(df1)
To run the code, save it to a file, e.g. test_bodo_sql.py
, and run the following command in your terminal:
By default Bodo will use all available cores. To set a limit on the number of processes spawned, set the environment variable BODO_NUM_WORKERS
.
Check the SQL API Reference for the full list of supported SQL operations.