Skip to content

User-Defined Functions (UDFs)

While Pandas and other APIs can be extremely expressive, many data science and data engineering use cases require additional functionality beyond what is directly offered. In these situations, many programmers create User Defined Functions, or UDFs, which are Python functions designed to compute on each row or groups of rows depending on the context.

Using UDFs with Bodo

Bodo users can construct UDFs either by defining a separate JIT function or by creating a function within a JIT function (either via a lambda or closure). For example, here are two ways to construct a UDF that advances each element of a Timestamp Series to the last day of the current month.

import pandas as pd
import bodo

@bodo.jit
def jit_udf(x):
    return x + pd.tseries.offsets.MonthEnd(n=0, normalize=True)

@bodo.jit
def jit_example(S):
    return S.map(jit_udf)

@bodo.jit
def lambda_example(S):
    return S.map(lambda x: x + pd.tseries.offsets.MonthEnd(n=0, normalize=True))

S = pd.Series(pd.date_range(start='1/1/2021', periods=100))
pd.testing.assert_series_equal(jit_example(S), lambda_example(S))

UDFs can be used to compute one value per row or group (map functions) or compute an aggregation (agg functions). Bodo provides APIs for both, which are summarized below. Please refer to supported Pandas API for more information.

Map Functions

  • Series.map
  • Series.apply
  • Series.pipe
  • DataFrame.map
  • DataFrame.apply
  • DataFrame.pipe
  • GroupBy.apply
  • GroupBy.pipe
  • GroupBy.transform

Agg Functions

  • GroupBy.agg
  • GroupBy.aggregate

UDF Performance

Bodo offers support for UDFs without the significant runtime penalty generally incurred in Pandas. An example of this is shown in the quick started guide.

Bodo achieves a drastic performance advantage on UDFs because UDFs can be optimized by similar to any other JIT code. In contrast, library based solutions are typically limited in their ability to optimize UDFs.

Additional Arguments

We recommend passing additional variables to UDFs explicitly, instead of directly using variables local to the function defining the UDF. The latter is called the \"captured\" variables case, which is often error-prone and may result in compilation errors.

For example, consider a UDF that appends a variable suffix to each string in a Series of strings. The proper way to write this function is to use the args argument to Series.apply().

import pandas as pd
import bodo

@bodo.jit
def add_suffix(S, suffix):
    return S.apply(lambda x, suf: x + suf, args=(suffix,))

S = pd.Series(["abc", "edf", "32", "Vew3", "er3r2"] * 10)
suffix = "_"
add_suffix(S, suffix)

Alternatively, arguments can be passed by keyword.

@bodo.jit
def add_suffix(S, suffix):
    return S.apply(lambda x, suf: x + suf, suf=suffix)

Note

Not all APIs support additional arguments. Please refer to supported Pandas API for more information on intended API usage.

Apply with Pandas Methods and Numpy ufuncs

In addition to UDFs, the apply API can also be used to call Pandas methods and Numpy ufuncs. To execute a Pandas method, you can provide the method name as a string.

import pandas as pd
import bodo

@bodo.jit
def ex(S):
    return S.apply("nunique")

S = pd.Series(list(np.arange(100) + list(np.arange(100))))
ex(S)

Numpy ufuncs can either be provided with a string matching the name or with the function itself.

import numpy as np
import pandas as pd
import bodo

@bodo.jit
def ex_str(S):
    return S.apply("sin")

def ex_func(S):
    return S.apply(np.sin)

S = pd.Series(list(np.arange(100) + list(np.arange(100))))
pd.testing.assert_series_equal(ex_str(S), ex_func(S))

Note

Numpy ufuncs are not currently supported with DataFrames.

Type Stability Restrictions

Bodo's type stability requirements can encounter some limitations when either using DataFrame.apply with different column types or when returning a DataFrame.

Differently Typed Columns

DataFrame.apply maps user provided UDFs to each row of the DataFrame. In the situation where a DataFrame has columns of different types, the Series passed to the UDF will contain values with different types. Bodo internally represents these as a Heterogeneous Series. This representation has limitations in the Series operations it supports. Please refer to the supported operations for heterogeneous series for more information.

Returning a DataFrame

In Pandas, Series.apply or DataFrame.apply there are multiple ways to return a DataFrame instead of a Series. However, for type stability reasons, Bodo can only infer a DataFrame when returning a Series whose size can be inferred at compile time for each row.

Note

If you provide an Index, then all Index values must be compile time constants.

Here is an example usingSeries.apply to return a DataFrame.

import pandas as pd
import bodo

@bodo.jit
def series_ex(S):
    return S.apply(lambda x: pd.Series((1, x)))

S = pd.Series(list(np.arange(100) + list(np.arange(100))))
series_ex(S)

If using a UDF that returns a DataFrame in Pandas through another means, this behavior will not match in Bodo and may result in a compilation error. Please convert your solution to one of the supported methods if possible.