2025)¶

🎉 Highlights¶

We're excited to add several APIs to enhance the ease of integrating Bodo DataFrames with your AI workflows. This includes support for embedding text using LLMs, using LLMs for text generation, and storing and retrieving embeddings in/from S3 Vectors.

✨ New Features¶

BodoSeries.ai.llm_generate has been added to pass each element of a series to an OpenAI compatible generation endpoint.
BodoSeries.ai.embed has been added to pass each element of a series to an OpenAI compatible embedding endpoint.
BodoSeries.ai.tokenize has been added to tokenize a series of strings to a series of a list of tokens.
spawn_process_on_workers has been added to create a process on each worker node to allow managing external processes that your program needs to interact with, such as a local LLM inference server.
Support for BodoSeries.quantile via streaming approximate quantile
Full support for BodoSeries.describe
Added Series.map_with_state that takes an initialization routine run once per worker whose output state is passed to the mapping function along with each individual row.
Support for BodoDataFrame/BodoSeries.reset_index
Support for BodoSeries binary operations with DateOffsets, strings, etc.
Series.map and DataFrame.apply by default will attempt to JIT compile user provided functions to improve performance. If JIT is not possible then the mapping function will be run as a normal Python function.
Improved setting dataframe columns to handle more cases
Improved filter expression handling
Support left/right/outer/cross joins
Support Series.isin(Series) use case (both Series parallel) and filtering dataframe with its output
Support writing to S3 Vectors
Support querying S3 Vectors
Support dataframe dropduplicates.

🐛 Bug Fixes¶

Suppressed excessive subsequent fallback warnings
Handle pd.NA in unboxing object arrays

🏎️ Performance Improvements¶

Improved performance of read_parquet on string columns
Increased default streaming batch size for better performance
Improved performance of operations that use Python functions under the hood
Reduce memory pressure in execution pipeline and improve performance
Process a subset of datetime properties in BodoSeries via Arrow Compute in C++ to improve performance