This section provides tips on measuring performance of Bodo programs. It is important to keep the following in mind when measuring program run time:
- Every program has some overhead, so large data sets may be necessary for useful measurements.
- Performance can vary from one run to another. Several measurements are always needed.
- It is important to use a sequence of tests with increasing input size, which helps understand the impact of problem size on program performance.
- Testing with different data (in terms statistical distribution and skew) can be useful to see the impact of data skew on performance and scaling.
- Simple programs are useful to study performance factors. Complex programs are impacted by multiple factors and their performance is harder to understand.
- Longer computations typically provide more reliable run time information.
Measuring execution time of Bodo functions¶
Since Bodo-decorated functions are JIT-compiled, the compilation time is non-negligible but it only happens the first time a function is compiled. Compiled functions stay in memory and don't need to be re-compiled, and they can also be cached to disk (see caching) to be reused across different executions.
To avoid measuring compilation time, place timers inside the functions. For example:
""" calc_pi.py: computes the value of Pi using Monte-Carlo Integration """ import numpy as np import bodo import time n = 2 * 10**8 def calc_pi(n): t1 = time.time() x = 2 * np.random.ranf(n) - 1 y = 2 * np.random.ranf(n) - 1 pi = 4 * np.sum(x**2 + y**2 < 1) / n print("Execution time:", time.time()-t1, "\nresult:", pi) return pi bodo_calc_pi = bodo.jit(calc_pi) print("python:") calc_pi(n) print("\nbodo:") bodo_calc_pi(n)
The output of this code is as follows:
Bodo's parallel speedup can be measured similarly:
""" calc_pi.py: computes the value of Pi using Monte-Carlo Integration """ import numpy as np import bodo import time @bodo.jit def calc_pi(n): t1 = time.time() x = 2 * np.random.ranf(n) - 1 y = 2 * np.random.ranf(n) - 1 pi = 4 * np.sum(x**2 + y**2 < 1) / n print("Execution time:", time.time()-t1, "\nresult:", pi) return pi calc_pi(2 * 10**8)
Launched on eight parallel cores:
And the time it takes can be compared with Python performance. Here, we
5.06/0.57 ~= 9x speedup (from parallelism and sequential
Measuring sections inside Bodo functions¶
We can add multiple timers inside a function to see how much time each section takes:
""" calc_pi.py: computes the value of Pi using Monte-Carlo Integration """ import numpy as np import bodo import time n = 2 * 10**8 def calc_pi(n): t1 = time.time() x = 2 * np.random.ranf(n) - 1 y = 2 * np.random.ranf(n) - 1 t2 = time.time() print("Initializing x,y takes: ", t2-t1) pi = 4 * np.sum(x**2 + y**2 < 1) / n print("calculation takes:", time.time()-t2, "\nresult:", pi) return pi bodo_calc_pi = bodo.jit(calc_pi) print("python: ------------------") calc_pi(n) print("\nbodo: ------------------") bodo_calc_pi(n)
The output is as follows:
Note that Bodo execution took longer in the last example than previous ones, since the presence of timers in the middle of computation can inhibit some code optimizations (e.g. code reordering and fusion). Therefore, one should be cautious about adding timers in the middle of computation.
Disabling JIT Compilation¶
Sometimes it is convenient to disable JIT compilation without removing
jit decorators in the code, to enable easy performance comparison
with regular Python or perform debugging. This can be done by setting
the environment variable
1, which makes the jit
decorators act as if they perform no operation. In this case, the
invocation of decorated functions calls the original Python functions
instead of compiled versions.
Bodo distributes and processes equal amounts of data across cores as much as possible. There are certain cases, however, where depending on the statistical properties of the data and the operation being performed on it, some cores will need to process much more data than others at certain points in the application, which limits the scaling that can be achieved. How much this impacts performance depends on the degree of imbalance and the impact the affected operation has on overall execution time.
For example, consider the following operation:
df has one billion rows,
A only has 3 unique values, and we
are running this on a cluster with 1000 cores. Although the work can be
distributed to a certain extent, the final result for each group of
has to be computed on a single core. Because there are only 3 groups,
during computation of the final result there will only be at most three
Scaling can be measured as the speedup achieved with n cores compared to running on a single core, that is, the ratio of execution time with 1 core vs n cores.
For a fixed input size, the speed up achieved by Bodo with increasing number of cores (also known as strong scaling) depends on a combination of various factors: size of the input data (problem size), properties of the data, compute operations used, and the hardware platform's attributes (such as effective network throughput).
For example, the program above can scale almost linearly (e.g. 100x
speed up on 100 cores) for large enough problem sizes, since the only
communication overhead is parallel summation of the partial sums
np.sum on each processor.
On the other hand, some operations such as join and groupby may require
communicating significant amounts of data across the network, depending
on the characteristics of the data and the exact operation (e.g.
groupy.apply, inner vs outer
requiring fast cluster interconnection networks to scale to large number
Load imbalance, as described above, can also significantly impair scaling in certain situations.