Performance Measurement¶
This page provides tips on measuring performance of Bodo programs. It is important to keep the following in mind when measuring program run time:
- Every program has some overhead, so large data sets may be necessary for useful measurements.
- Performance can vary from one run to another. Several measurements are always needed.
- Longer computations typically provide more reliable run time information.
- It is important to use a sequence of tests with increasing input size, which helps understand the impact of problem size on program performance.
- Testing with different data (in terms statistical distribution and skew) can be useful to see the impact of data skew on performance and scaling.
- Simple programs are useful to study performance factors. Complex programs are impacted by multiple factors and their performance is harder to understand.
Measuring execution time of Bodo functions¶
Since Bodo-decorated functions are JIT-compiled, the compilation time is non-negligible but it only happens the first time a function is compiled. Compiled functions stay in memory and don't need to be re-compiled, and they can also be cached to disk (see caching) to be reused across different executions.
To avoid measuring compilation time, place timers inside the functions. For example:
"""
calc_pi.py: computes the value of Pi using Monte-Carlo Integration
"""
import numpy as np
import bodo
import time
n = 2 * 10**8
def calc_pi(n):
t1 = time.time()
x = 2 * np.random.ranf(n) - 1
y = 2 * np.random.ranf(n) - 1
pi = 4 * np.sum(x**2 + y**2 < 1) / n
print("Execution time:", time.time()-t1, "\nresult:", pi)
return pi
bodo_calc_pi = bodo.jit(calc_pi)
print("python:")
calc_pi(n)
print("\nbodo:")
bodo_calc_pi(n)
The output of this code on a single core is as follows:
python:
Execution time: 5.060443162918091
result: 3.14165914
bodo:
Execution time: 2.165610068012029
result: 3.14154512
Bodo's parallel speedup can be measured similarly:
"""
calc_pi.py: computes the value of Pi using Monte-Carlo Integration
"""
import numpy as np
import bodo
import time
@bodo.jit
def calc_pi(n):
t1 = time.time()
x = 2 * np.random.ranf(n) - 1
y = 2 * np.random.ranf(n) - 1
pi = 4 * np.sum(x**2 + y**2 < 1) / n
print("Execution time:", time.time()-t1, "\nresult:", pi)
return pi
calc_pi(2 * 10**8)
Launched on eight parallel cores:
And the time it takes can be compared with Python performance. Here, we
have a 5.06/0.57 ~= 9x
speedup (from parallelism and sequential
optimizations).
In addition, SPMD launch mode is recommended for performance measurements since it has lower overheads.
Measuring sections inside Bodo functions¶
We can add multiple timers inside a function to see how much time each section takes:
"""
calc_pi.py: computes the value of Pi using Monte-Carlo Integration
"""
import numpy as np
import bodo
import time
n = 2 * 10**8
def calc_pi(n):
t1 = time.time()
x = 2 * np.random.ranf(n) - 1
y = 2 * np.random.ranf(n) - 1
t2 = time.time()
print("Initializing x,y takes: ", t2-t1)
pi = 4 * np.sum(x**2 + y**2 < 1) / n
print("calculation takes:", time.time()-t2, "\nresult:", pi)
return pi
bodo_calc_pi = bodo.jit(calc_pi)
print("python: ------------------")
calc_pi(n)
print("\nbodo: ------------------")
bodo_calc_pi(n)
The output is as follows:
python: ------------------
Initializing x,y takes: 3.9832258224487305
calculation takes: 1.1460411548614502
result: 3.14156454
bodo: ------------------
Initializing x,y takes: 3.0611653940286487
calculation takes: 0.35728363902308047
result: 3.14155538
Note
Note that Bodo execution took longer in the last example than previous ones, since the presence of timers in the middle of computation can inhibit some code optimizations (e.g. code reordering and fusion). Therefore, one should be cautious about adding timers in the middle of computation.
Disabling JIT Compilation¶
Sometimes it is convenient to disable JIT compilation without removing
the jit
decorators in the code, to enable easy performance comparison
with regular Python or perform debugging. This can be done by setting
the environment variable NUMBA_DISABLE_JIT
to 1
, which makes the jit
decorators act as if they perform no operation. In this case, the
invocation of decorated functions calls the original Python functions
instead of compiled versions.
Load Imbalance¶
Bodo distributes and processes equal amounts of data across cores as much as possible. There are certain cases, however, where depending on the statistical properties of the data and the operation being performed on it, some cores will need to process much more data than others at certain points in the application, which limits the scaling that can be achieved. How much this impacts performance depends on the degree of imbalance and the impact the affected operation has on overall execution time.
For example, consider the following operation:
Where df
has one billion rows, A
only has 3 unique values, and we
are running this on a cluster with 1000 cores. Although the work can be
distributed to a certain extent, the final result for each group of A
has to be computed on a single core. Because there are only 3 groups,
during computation of the final result there will only be at most three
cores active.
Expected Scaling¶
Scaling can be measured as the speedup achieved with n cores compared to running on a single core, that is, the ratio of execution time with 1 core vs n cores.
For a fixed input size, the speed up achieved by Bodo with increasing number of cores (also known as strong scaling) depends on a combination of various factors: size of the input data (problem size), properties of the data, compute operations used, and the hardware platform's attributes (such as effective network throughput).
For example, the program above can scale almost linearly (e.g. 100x
speed up on 100 cores) for large enough problem sizes, since the only
communication overhead is parallel summation of the partial sums
obtained by np.sum
on each processor.
On the other hand, some operations such as join and groupby may require
communicating significant amounts of data across the network, depending
on the characteristics of the data and the exact operation (e.g.
groupby.sum
, groupby.nunique
, groupy.apply
, inner vs outer join
, etc.),
requiring fast cluster interconnection networks to scale to large number
of cores.
Load imbalance, as described above, can also significantly impair scaling in certain situations.