User-Defined Functions (UDFs)¶
While Pandas and other APIs can be extremely expressive, many data science and data engineering use cases require additional functionality beyond what is directly offered. In these situations, many programmers create User Defined Functions, or UDFs, which are Python functions designed to compute on each row or groups of rows depending on the context.
Using UDFs with Bodo¶
Bodo users can construct UDFs either by defining a separate JIT function or by creating a function within a JIT function (either via a lambda or closure). For example, here are two ways to construct a UDF that advances each element of a Timestamp Series to the last day of the current month.
import pandas as pd import bodo @bodo.jit def jit_udf(x): return x + pd.tseries.offsets.MonthEnd(n=0, normalize=True) @bodo.jit def jit_example(S): return S.map(jit_udf) @bodo.jit def lambda_example(S): return S.map(lambda x: x + pd.tseries.offsets.MonthEnd(n=0, normalize=True)) S = pd.Series(pd.date_range(start='1/1/2021', periods=100)) pd.testing.assert_series_equal(jit_example(S), lambda_example(S))
UDFs can be used to compute one value per row or group (map functions) or compute an aggregation (agg functions). Bodo provides APIs for both, which are summarized below. Please refer to supported Pandas API for more information.
Bodo offers support for UDFs without the significant runtime penalty generally incurred in Pandas. An example of this is shown in the quick started guide.
Bodo achieves a drastic performance advantage on UDFs because UDFs can be optimized by similar to any other JIT code. In contrast, library based solutions are typically limited in their ability to optimize UDFs.
We recommend passing additional variables to UDFs explicitly, instead of directly using variables local to the function defining the UDF. The latter is called the \"captured\" variables case, which is often error-prone and may result in compilation errors.
For example, consider a UDF that appends a variable suffix to each
string in a Series of strings. The proper way to write this function is
to use the
args argument to
import pandas as pd import bodo @bodo.jit def add_suffix(S, suffix): return S.apply(lambda x, suf: x + suf, args=(suffix,)) S = pd.Series(["abc", "edf", "32", "Vew3", "er3r2"] * 10) suffix = "_" add_suffix(S, suffix)
Alternatively, arguments can be passed by keyword.
Not all APIs support additional arguments. Please refer to supported Pandas API for more information on intended API usage.
Apply with Pandas Methods and Numpy ufuncs¶
In addition to UDFs, the
apply API can also be used to call Pandas
methods and Numpy ufuncs. To execute a Pandas method, you can provide
the method name as a string.
import pandas as pd import bodo @bodo.jit def ex(S): return S.apply("nunique") S = pd.Series(list(np.arange(100) + list(np.arange(100)))) ex(S)
Numpy ufuncs can either be provided with a string matching the name or with the function itself.
import numpy as np import pandas as pd import bodo @bodo.jit def ex_str(S): return S.apply("sin") def ex_func(S): return S.apply(np.sin) S = pd.Series(list(np.arange(100) + list(np.arange(100)))) pd.testing.assert_series_equal(ex_str(S), ex_func(S))
Numpy ufuncs are not currently supported with DataFrames.
Type Stability Restrictions¶
Bodo's type stability requirements can encounter some limitations when
DataFrame.apply with different column types or when
returning a DataFrame.
Differently Typed Columns¶
DataFrame.apply maps user provided UDFs to each row of the DataFrame.
In the situation where a DataFrame has columns of different types, the
Series passed to the UDF will contain values with different types. Bodo
internally represents these as a Heterogeneous Series. This
representation has limitations in the Series operations it supports.
Please refer to the supported operations for heterogeneous series for
Returning a DataFrame¶
DataFrame.apply there are multiple ways
to return a DataFrame instead of a Series. However, for type stability
reasons, Bodo can only infer a DataFrame when returning a Series whose
size can be inferred at compile time for each row.
If you provide an Index, then all Index values must be compile time constants.
Here is an example using
Series.apply to return a DataFrame.
import pandas as pd import bodo @bodo.jit def series_ex(S): return S.apply(lambda x: pd.Series((1, x))) S = pd.Series(list(np.arange(100) + list(np.arange(100)))) series_ex(S)
If using a UDF that returns a DataFrame in Pandas through another means, this behavior will not match in Bodo and may result in a compilation error. Please convert your solution to one of the supported methods if possible.