Bodo Platform SDK Guide¶

Bodo Platform SDK is a Python library that provides a simple way to interact with the Bodo Platform API. It allows you to create, manage, and monitor resources such as clusters, jobs, and workspaces.

Getting Started¶

Installation¶

pip install bodosdk

Create a workspace client¶

First you need to access your workspace in https://platform.bodo.ai/ and create an API Token in the Bodo Platform. This token is used to authenticate your client with the Bodo Platform API.

Navigate to API Tokens in the Admin Console to generate a token. Copy and save the token's Client ID and Secret Key and use them to define a client (BodoClient) that can interact with the Bodo Platform.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient(
    client_id="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    secret_key="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
)

Alternatively, set BODO_CLIENT_ID and BODO_SECRET_KEY environment variables to avoid storing keys in your code.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()

You can access the workspace_data attribute of the client to get more information about your workspace:

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
print(my_workspace.workspace_data)

Additional Configuration Options for `BodoClient`¶

print_logs: defaults to False. All API requests and responses are printed to the console if set to True.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient(print_logs=True)

Create a cluster¶

This example creates a simple one node cluster in your workspace with the latest available bodo version. It returns a cluster object.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.create(
    name='My first cluster',
    instance_type='c5.large',
    workers_quantity=1
)

Wait for status¶

You can use the wait_for_status method to wait until the cluster ready to use.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.create(
    name='My first cluster',
    instance_type='c5.large',
    workers_quantity=1
)
my_cluster.wait_for_status(['RUNNING'])

This method will wait until either the cluster reaches any of the provided status, or fails (goes into the FAILED status) for any reason.

Update your cluster¶

Now let's update our cluster. While the cluster is RUNNING, you can update the name, description, auto_pause, auto_stop, and workers_quantity(this will trigger scaling) fields only:

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.create(
    name='My first cluster',
    instance_type='c5.large',
    workers_quantity=1
)

my_cluster.wait_for_status(['RUNNING'])
my_cluster.update(
    description='My description',
    name="My updated cluster",
    auto_pause=15,
    auto_stop=30
)

All other modifications like instance_type, bodo_version etc. need the cluster to be STOPPED first.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.get("cluster_id")
if my_cluster.status != 'STOPPED':
    my_cluster.stop(wait=True)
my_cluster.update(instance_type='c5.2xlarge', workers_quantity=2)

Create a Job¶

Lets now create a job on our RUNNING cluster. First, access https://platform.bodo.ai and navigate to the Jupyter notebook in your workspace. Then create the following test.py file in your main directory:

import pandas as pd
import numpy as np
import bodo
import time

NUM_GROUPS = 30
NUM_ROWS = 20_000_000

df = pd.DataFrame({
    "A": np.arange(NUM_ROWS) % NUM_GROUPS,
    "B": np.arange(NUM_ROWS)
})
df.to_parquet("my_data.pq")
time.sleep(1)  # wait till file will be available on all nodes


@bodo.jit(cache=True)
def computation():
    t1 = time.time()
    df = pd.read_parquet("my_data.pq")
    df2 = pd.DataFrame({"A": df.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)})
    df2.to_parquet("out.pq")
    print("Execution time:", time.time() - t1)

computation()

Now you can define a job on cluster through SDK, wait till it has SUCCEEDED and check its logs as follows:

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.get("cluster_id")
my_job = my_cluster.run_job(
    code_type='PYTHON',
    source={'type': 'WORKSPACE', 'path': '/'},
    exec_file='test.py'
)
print(my_job.wait_for_status(['SUCCEEDED']).get_stdout())

You can use almost the same configuration to run a SQL file. All you need is to define your test.sql file and a Catalog on https://platform.bodo.ai:

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.get("cluster_id")
my_job = my_cluster.run_job(
    code_type='SQL',
    source={'type': 'WORKSPACE', 'path': '/'},
    exec_file='test.sql',
    catalog="MyCatalog"
)
print(my_job.wait_for_status(['SUCCEEDED']).get_stdout())

Cluster List and executing jobs on multiple clusters¶

Now let's try to run same job on different clusters:

from bodosdk import BodoWorkspaceClient
import random

my_workspace = BodoWorkspaceClient()

random_val = random.random()  # just to avoid conflicts on name
clusters_conf = [('c5.large', 8), ('c5.xlarge', 4), ('c5.2xlarge', 2)]
for i, conf in enumerate(clusters_conf):
    my_workspace.ClusterClient.create(
        name=f'Test {i}',
        instance_type=conf[0],
        workers_quantity=conf[1],
        custom_tags={'test_tag': f'perf_test{random_val}'}  # let's add tag to easy filter our clusters
    )

# get list by tag
clusters = my_workspace.ClusterClient.list(filters={
    'tags': {'test_tag': f'perf_test{random_val}'}
})

# run same job 3 times, once per each cluster
jobs = clusters.run_job(
    code_type='PYTHON',
    source={'type': 'WORKSPACE', 'path': '/'},
    exec_file='test.py'
)

# wait for jobs to finish and print results
for job in jobs.wait_for_status(['SUCCEEDED']):
    print(job.name, job.cluster.name)
    print(job.get_stdout())

# remove our clusters
jobs.clusters.delete()  # or clusters.delete()

Execute SQL query¶

You can also execute SQL queries by passing just query text like following:

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_sql_job = my_workspace.JobClient.run_sql_query(
    sql_query="SELECT 1", 
    catalog="MyCatalog", 
    cluster={
        "name": 'Temporary cluster',
        "instance_type": 'c5.large',
        "workers_quantity": 1
    }
)
print(my_sql_job.wait_for_status(['SUCCEEDED']).get_stdout())

In this case, when you provide a cluster configuration rather than an existing cluster, the created cluster will be terminated as soon as SQL job finishes.

If you want to execute a SQL job on existing cluster, you can use run_sql_query on cluster:

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.create(
    name='My cluster',
    instance_type='c5.large',
    workers_quantity=1
)
my_sql_job = my_cluster.run_sql_query(sql_query="SELECT 1", catalog="MyCatalog")
print(my_sql_job.wait_for_status(['SUCCEEDED']).get_stdout())

Connector¶

You can also execute SQL queries using a cluster Connector via a Cursor:

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.create(
    name='My cluster',
    instance_type='c5.large',
    workers_quantity=1
)
connection = my_cluster.connect(
    'MyCatalog')  # or connection = my_workspace.ClusterClient.connect('MyCatalog', 'cluster_id')
print(connection.cursor().execute("SELECT 1").fetchone())
my_cluster.delete()

Job Templates¶

Rather than defining jobs from scratch every time, you can create a template for your jobs.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
tpl = my_workspace.JobTemplateClient.create(
    name='My template',
    cluster={
        'instance_type': 'c5.xlarge',
        'workers_quantity': 1
    },
    code_type="SQL",
    catalog="MyCatalog",
    exec_text="SELECT 1"
)
job1 = tpl.run()  # you can simply run it
job2 = tpl.run(exec_text="SELECT 2")  # or run it with overriding template values
job3 = tpl.run(cluster={'instance_type': 'c5.large'})  # you can override even part of cluster configuration

jobs = my_workspace.JobClient.list(filters={'template_ids': [tpl.id]})  # you can filter jobs by its template_id
for job in jobs.wait_for_status(['SUCCEEDED']):
    print(job.name, job.cluster.instance_type, job.get_stdout())

You can also run your template on specific clusters:

from bodosdk import BodoWorkspaceClient
from bodosdk.models import JobTemplateFilter

my_workspace = BodoWorkspaceClient()
tpls = my_workspace.JobTemplateClient.list(filters=JobTemplateFilter(names=['My template']))
my_cluster = my_workspace.ClusterClient.create(
    name='My cluster',
    instance_type='c5.large',
    workers_quantity=1
)
print(my_cluster.run_job(template_id=tpls[0].id).wait_for_status(['SUCCEEDED']).get_stdout())
my_cluster.delete()

Statuses¶

Each resource, Cluster, Job or Workspace has own set of statuses which are as follows:

Cluster¶

NEW
INPROGRESS
PAUSING
PAUSED
STOPPING
STOPPED
INITIALIZING
RUNNING
FAILED
TERMINATED

Job¶

PENDING
RUNNING
SUCCEEDED
FAILED
CANCELLED
CANCELLING
TIMEOUT

Workspace¶

NEW
INPROGRESS
READY
FAILED
TERMINATING
TERMINATED

Organization client and workspaces¶

To manage workspaces, you need different keys (generated for the organization) and different SDK clients. Let's list all our workspaces:

from bodosdk import BodoOrganizationClient

my_org = BodoOrganizationClient(
    client_id="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    secret_key="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
)  # or BodoOrganizationClient() if `BODO_ORG_CLIENT_ID` and `BODO_ORG_SECRET_KEY` are exported
for w in my_org.list_workspaces():
    print(w.name)

You can filter workspaces with valid filters:

from bodosdk import BodoOrganizationClient
from bodosdk.models import WorkspaceFilter

my_org = BodoOrganizationClient()

for w in my_org.list_workspaces(filters=WorkspaceFilter(statuses=['READY'])):
    print(w.name)

You can provide filters such as WorkspaceFilter imported from bodosdk.models or as a dictionary:

from bodosdk import BodoOrganizationClient

my_org = BodoOrganizationClient()

for w in my_org.list_workspaces(filters={"statuses": ['READY']}):
    print(w.name)

Create new Workspace¶

from bodosdk import BodoOrganizationClient

my_org = BodoOrganizationClient()
my_workspace = my_org.create_workspace(
    name="SDK test",
    region='us-east-2',
    cloud_config_id="a0d1242c-3091-42de-94d9-548e2ae33b73",
    storage_endpoint_enabled=True
).wait_for_status(['READY'])
assert my_workspace.id == my_org.list_workspaces(filters={"names": ['SDK test'], "statuses": ['READY']})[0].id
my_workspace.delete()  # remove workspace at the end

Upgrade workspace infra¶

In some cases when you have a workspace created a while ago, you may want to re-run terraform to apply fresh changes to workspace infrastructure. You can do it as follows:

from bodosdk import BodoOrganizationClient

my_org = BodoOrganizationClient()
my_org.list_workspaces(filters={'ids': ['workspace_to_update1_id', 'workspace_to_update2_id']}).update_infra()

Advanced¶

In this section we will present more examples of bodosdk usage.

Workspace created in existing VPC¶

Its possible to create workspace on existing infrastructure. The only requirement is that the VPC needs access to the Internet, either NAT or IGW. Bodo platform needs this to allow clusters to be authorized via an external authorization service.

from bodosdk import BodoOrganizationClient

my_org = BodoOrganizationClient()
my_workspace = my_org.create_workspace(
    cloud_config_id="cloudConfigId",
    name="My workspace",
    region="us-east-1",
    storage_endpoint_enabled=True,
    vpc_id="existing-vpc-id",
    private_subnets_ids=['subnet1', 'subnet2'],
    public_subnets_ids=['subnet3']
)
my_workspace.wait_for_status(['READY'])

Spot instances and Auto-AZ¶

You can create a cluster using spot instances, to reduce cost of usage. A limitation of using spot instances is that you cannot pause this type of cluster. Further cluster may be occasionally become unavailable (when spot instance is released).

Auto-AZ is a mechanism which retries cluster creation in another Availability zone, when current availability zone does not have enough instances of desired type.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.create(
    name='Spot cluster',
    instance_type='c5.large',
    workers_quantity=1,
    use_spot_instance=True,
    auto_az=True,
)

Accelerated networking¶

Accelerated networking is enabled by default for instances that support it.

You can get a list of all supported instances using the ClusterClient.get_instances function. This returns a list of InstanceType objects. The field accelerated_networking tells you whether network acceleration is enabled.

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()

accelerated_networking_instances = [x for x in my_workspace.ClusterClient.get_instances() if x.accelerated_networking]

my_cluster = my_workspace.ClusterClient.create(
    name='Spot cluster',
    instance_type=accelerated_networking_instances[0].name,
    workers_quantity=1,
)

Preparing clusters for future use¶

A cluster may be suspended in two states: PAUSED and STOPPED. Spot clusters cannot be PAUSED. There are 3 differences between those states: cost, start up time, error rate.

Costs¶

PAUSED > STOPPED - ThePAUSED state incurs disk costs while the STOPPED state doesn't.

Start up time¶

STOPPED > PAUSED - Resuming machines in the PAUSED state is much faster than restarting STOPPED clusters, as those machines are already defined in the cloud provider.

Error rate¶

PAUSED > STOPPED - The error rate is the likelihood of the situation where number of available instances of desired types is lower than number of requested workers.

In PAUSED state, instance entities are already defined. Upon resuming a PAUSED cluster, since we request for the cluster resources all at once, it's more likely to run into capacity issues than in a STOPPED state. In a STOPPED state, instance creation is managed by ASG, which will try to create instances in a staggered manner.

In the following example, we prepare a cluster for future use by pausing it once its created:

from bodosdk import BodoWorkspaceClient
from bodosdk.models import ClusterFilter

my_workspace = BodoWorkspaceClient()

clusters_conf = {
    'Team A': {
        'instance_type': 'c5.2xlarge',
        'workers': 4,
    },
    'Team b': {
        'instance_type': 'c5.xlarge',
        'workers': 2,
    },
    'Team C': {
        'instance_type': 'c5.16xlarge',
        'workers': 2,
    }
}
for owner, conf in clusters_conf.items():
    my_workspace.ClusterClient.create(
        name=f"{owner} Cluster",
        instance_type=conf['instance_type'],
        workers_quantity=conf['workers'],
        custom_tags={'owner': owner, 'purpose': 'test'}
    )

my_workspace.ClusterClient.list(
    filters=ClusterFilter(tags={'purpose': 'test'})
).wait_for_status(
    ['RUNNING', 'INITIALIZING']
).pause().wait_for_status(['PAUSED'])

Run a job using a cluster as template¶

Let's imagine that you have a cluster (in any state) and you want to run job on the same specification, but you don't want to use the previously defined cluster. You can use the existing cluster as a template for the cluster to run the job:

from bodosdk import BodoWorkspaceClient

my_workspace = BodoWorkspaceClient()
my_cluster = my_workspace.ClusterClient.get('existing_cluster')
cluster_conf = my_cluster.dict()
del cluster_conf['uuid']
my_sql_job = my_workspace.JobClient.run_sql_query(sql_query="SELECT 1", catalog="MyCatalog", cluster=cluster_conf)

A new cluster will be created with the same configuration as the existing cluster, and the job will be run on it. The new cluster will be terminated as soon as the job finishes.