Benchmarks¶
In addition to the integrated tests which track code correctness we have a suite of benchmarks that track performance.
Running the benchmarks¶
Because performance is system specific we currently only support running the benchmarks on the LLNL machines Quartz and Lassen. If you are on either of these machines the script benchmarks/runBenchmarks.py
can be used to run the benchmarks.
> python ../benchmarks/runBenchmarks.py --help
usage: runBenchmarks.py [-h] [-t TIMELIMIT] [-o TIMINGCOLLECTIONDIR]
[-e ERRORCOLLECTIONDIR]
geosxPath outputDirectory
positional arguments:
geosxPath The path to the GEOSX executable to benchmark.
outputDirectory The parent directory to run the benchmarks in.
optional arguments:
-h, --help show this help message and exit
-t TIMELIMIT, --timeLimit TIMELIMIT
Time limit for the entire script in minutes, the
default is 60.
-o TIMINGCOLLECTIONDIR, --timingCollectionDir TIMINGCOLLECTIONDIR
Directory to copy the timing files to.
-e ERRORCOLLECTIONDIR, --errorCollectionDir ERRORCOLLECTIONDIR
Directory to copy the output from any failed runs to.
At a minimum you need to pass the script the path to the GEOSX executable and a directory to run the benchmarks in. This directory will be created if it doesn’t exist. The script will collect a list of benchmarks to be run and submit a job to the system’s scheduler for each benchmark. This means that you don’t need to be in an allocation to run the benchmarks. Note that this is different from the integrated tests where you need to already be in an allocation and an internal scheduler is used to run the individual tests. Since a benchmark is a measure of performance to get consistent results it is important that each time a benchmark is run it has access to the same resources. Using the system scheduler guarantees this.
In addition to whatever outputs the input would normally produce (plot files, restart files, …) each benchmark will produce an output file output.txt
containing the standard output and standard error of the run and a .cali
file containing the Caliper timing data in a format that Spot can read.
Note
A future version of the script will be able to run only a subset of the benchmarks.
Specifying a benchmark¶
A group of benchmarks is specified with a standard GEOSX input XML file with an extra Benchmarks
block added at the top level. This block is ignored by GEOSX itself and only used by the runBenchmarks.py
script.
<Benchmarks>
<quartz>
<Run
name="OMP"
nodes="1"
tasksPerNode="1"
autoPartition="On"
timeLimit="10"/>
<Run
name="MPI_OMP"
nodes="1"
tasksPerNode="2"
autoPartition="On"
timeLimit="10"
strongScaling="{ 1, 2, 4, 8 }"/>
<Run
name="MPI"
nodes="1"
tasksPerNode="36"
autoPartition="On"
timeLimit="10"
strongScaling="{ 1, 2, 4, 8 }"/>
</quartz>
<lassen>
<Run
name="OMP_CUDA"
nodes="1"
tasksPerNode="1"
autoPartition="On"
timeLimit="10"/>
<Run
name="MPI_OMP_CUDA"
nodes="1"
tasksPerNode="4"
autoPartition="On"
timeLimit="10"
strongScaling="{ 1, 2, 4, 8 }"/>
</lassen>
</Benchmarks>
[Source: benchmarks/SSLE-small.xml]
The Benchmarks
block consists of a block for each machine the benchmarks are to run on. Currently the only options are quartz
and lassen
.
The Run
block¶
Each machine block contains a number of Run
blocks each of which specify a family of benchmarks to run. Each Run
block must have the following required attributes
name
: The name of the family of benchmarks, must be unique among all the otherRun
blocks on that system.nodes
: An integer which specifies the base number of nodes to run the benchmark with.tasksPerNode
: An integer that specifies the number of tasks to launch per node.
Each Run
block may contain the following optional attributes
threadsPerTask
: An integer specifying the number of threads to allocate each task.timeLimit
: An integer specifying the time limit in minutes to pass to the system scheduler when submitting the benchmark.args
: containing any extra command line arguments to pass to GEOSX.autoPartition
: EitherOn
orOff
, not specifyingautoPartition
is equivalent toautoPartition="Off"
. When auto partitioning is enabled the script will compute the number ofx
,y
andz
partitions such that the the resulting partition is close to a perfect cube as possible, ie with 27 tasksx = 3, y = 3, z = 3
and with 36 tasksx = 4, y = 3, z = 3
. This is optimal when the domain itself is a cube, but will be suboptimal otherwise.strongScaling
: A list of unique integers specifying the factors to scale the number of nodes by. IfN
number are provided thenN
benchmarks are run and benchmarki
usesnodes * strongScaling[ i ]
nodes. Not specifyingstrongScaling
is equivalent tostrongScaling="{ 1 }"
.
Looking at the example Benchmarks
block above on Lassen one benchmark from the OMP_CUDA
family will be run with one node and one task. Four benchmarks from the MPI_OMP_CUDA
family will be run with one, two, four and eight nodes and four tasks per node.
Note that specifying a time limit for each benchmark family can greatly decrease the time spent waiting in the scheduler’s queue. A good rule of thumb is that the time limit should be twice as long as it takes to run the longest benchmark in the family.
Adding a benchmark problem¶
To add a new group of benchmarks you need to create an XML input file describing the problem to be run. Then you need to add the Benchmarks
block described above which specifies the specific benchmarks. Finally add a symbolic link to the input file in benchmarks
and run the benchmarks to make sure everything works as expected.
Viewing the results¶
Each night the NightlyTests repository runs the benchmarks on both Quartz and Lassen, the timingFiles
directory contains all of the resulting caliper output files. If you’re on LC then these files are duplicated at /usr/gapps/GEOSX/timingFiles/
and if you have LC access you can view them in Spot. You can also open these files in Python and analyse them (See Opening Spot caliper files in Python).
If you want to run the benchmarks on your local branch and compare the results with develop you can use the benchmarks/compareBenchmarks.py
python script. This requires that you run the benchmarks on your branch and on develop. It will print out a table with the initialization time speed up and run time speed up, so a run speed up of of 2x means your branch runs twice as fast as develop where as a initialization speed up of 0.5x means the set up takes twice as long.
Note
A future version of the script will be able to pull timing results straight from the .cali
files so that if you have access to the NightlyTests timing files you won’t need to run the benchmarks on develop. Furthermore it will be able to provide more detailed information than just initialization and run times.