Skip to content

Bodo 2025.8.1 Release (Date: 08/07/2025)

🎉 Highlights

We're excited to add several APIs to enhance the ease of integrating Bodo DataFrames with your AI workflows. This includes support for embedding text using LLMs, using LLMs for text generation, and storing and retrieving embeddings in/from S3 Vectors.

✨ New Features

  • BodoSeries.ai.llm_generate has been added to pass each element of a series to an OpenAI compatible generation endpoint.
  • BodoSeries.ai.embed has been added to pass each element of a series to an OpenAI compatible embedding endpoint.
  • BodoSeries.ai.tokenize has been added to tokenize a series of strings to a series of a list of tokens.
  • spawn_process_on_workers has been added to create a process on each worker node to allow managing external processes that your program needs to interact with, such as a local LLM inference server.
  • Support for BodoSeries.quantile via streaming approximate quantile
  • Full support for BodoSeries.describe
  • Added Series.map_with_state that takes an initialization routine run once per worker whose output state is passed to the mapping function along with each individual row.
  • Support for BodoDataFrame/BodoSeries.reset_index
  • Support for BodoSeries binary operations with DateOffsets, strings, etc.
  • Series.map and DataFrame.apply by default will attempt to JIT compile user provided functions to improve performance. If JIT is not possible then the mapping function will be run as a normal Python function.
  • Improved setting dataframe columns to handle more cases
  • Improved filter expression handling
  • Support left/right/outer/cross joins
  • Support Series.isin(Series) use case (both Series parallel) and filtering dataframe with its output
  • Support writing to S3 Vectors
  • Support querying S3 Vectors
  • Support dataframe dropduplicates.

🐛 Bug Fixes

  • Suppressed excessive subsequent fallback warnings
  • Handle pd.NA in unboxing object arrays

🏎️ Performance Improvements

  • Improved performance of read_parquet on string columns
  • Increased default streaming batch size for better performance
  • Improved performance of operations that use Python functions under the hood
  • Reduce memory pressure in execution pipeline and improve performance
  • Process a subset of datetime properties in BodoSeries via Arrow Compute in C++ to improve performance