In many situations, Bodo can save the binary resulting from the compilation of a function to disk, to be reused in future runs. This avoids the need to recompile functions the next time that you run your application.
Recompiling a function is only necessary when it is called with new input types, and the same applies to caching. In other words, an application can be run multiple times and process different data without having to recompile any code if the data types remain the same (which is the most common situation).
Caching works in most (but not all) situations, and is disabled by default. See caching limitations below for more information.
To cache a function, we only need to add the option
the JIT decorator:
import time @bodo.jit(cache=True) def mean_power_speed(): df = pd.read_parquet("data/cycling_dataset.pq") return df[["power", "speed"]].mean() t0 = time.time() result = mean_power_speed() if bodo.get_rank() == 0: print(result) print("Total execution time:", round(time.time() - t0, 3), "secs")
[stdout:0] power 102.078421 speed 5.656851 dtype: float64 Total execution time: 3.606 secs
The first time that the above code runs, Bodo compiles the function and
caches it to disk. In subsequent runs, it will recover the function from
cache and the execution time will be much faster as a result. You can
try this out by running the above code multiple times, and changing
data/cycling_dataset.pq is located in the Bodo tutorial repo.
Cache Location and Portability¶
In most cases, the cache is saved in the
inside the directory where the source files are located. The variable
NUMBA_DEBUG_CACHE can be set to
1 in order to see where exactly
the cache is and whether it is being written to or read from.
On Jupyter notebooks, the cache directory is called
is located in
for more information on these and other alternate cache locations. For
example, when running in a notebook:
import os import IPython cache_dir = IPython.paths.get_ipython_cache_dir() + "/numba_cache" print("Cache files:") os.listdir(cache_dir)
Cache files: ['ipython-input-bce41f829e09.mean_power_speed-4444615264.py38.nbi', 'ipython-input-bce41f829e09.mean_power_speed-4444615264.py38.1.nbc']
Cached objects work across systems with the same CPU model and CPU features. Therefore, it is safe to share and reuse the contents in the cache directory on a different machine. See here for more information.
The cache is invalidated automatically when the corresponding source
code is modified. One way to observe this behavior is to modify the
above example after it has been cached a first time, by changing the
name of the variable
df. The next time that we run the code, Bodo
will determine that the source code has been modified, invalidate the
cache and recompile the function.
It is sometimes necessary to clear the cache manually (see caching limitations below). To clear the cache, the cache files can simply be removed.
Tips for Reusing the Cache¶
As explained above, caching is invalidated for a function any time any of the source code in the file changes. If we define a function and call it in the same file, and modify the arguments passed to the function, the cache will be invalidated.
Caching File IO¶
For example: a typical use case is calling an IO function with a different file name.
@bodo.jit(cache=True) def io_call(file_name): ... io_call("mydata.parquet")
The above function would need to be recompiled if the argument to
io_call changes from
mydata.parquet. By separating into separate
files the function call from the function definition, the function
definition does not need to be recompiled for each function call with
new arguments. The cached IO function will work for a change in file
name so long as the file schema is the same. For example, the below code
import IO_function from IO_functions IO_function(file_name)
would not need to recompile
IO_function each time
IO_function is isolated from that code change.
Caching Notebook Cells¶
For IPython notebooks the function to be cached should be in a separate cell from the function call.
@bodo.jit(cache=True) def io_call(file_name): ...
io_call(file_name) io_call(another_file_name) ...
If a cell with a cached function is modified, then its cache is invalidated and the function must be compiled again.
Current Caching Limitations¶
Caching does not recognize changes in Bodo versions, and cached files from different versions may not work, thus requiring manual clearing of the cache.
Changes in compiled functions are not seen across files. For example, if we have a cached Bodo function that calls a cached Bodo function in a different file, and modify the latter, Bodo will not update its cache (and therefore run with the old version of the function).
Global variables are treated as compile-time constants. When a function is compiled, the value of any globals that the function uses are embedded in the binary at compilation time and remain constant. If the value of the global changes in the source code after compilation, the compiled object (and cache) will not rebind to the new value.