OpenMP

From Crypto++ Wiki
Jump to navigation Jump to search

OpenMP (Open Multi-Processing) is an application programming interface that supports shared memory multi-threaded programming on most platforms, including AIX, Solaris, Linux, OS X, and Windows. This wiki article will explain how to build the library with OpenMP support and provide an example program.

OpenMP pragmas surface in various source files, including nbtheory.cpp and scrypt.cpp. Crypto++ supports OpenMP when it is profitable to do so, like modern Key Derivation Functions with parallelism. Other classes, like Rabin-Williams signatures, has OpenMP but it is not profitable because it does not speed up calculations.

If you are using OpenMP on Linux then see GCC Issue 58378, Protect libgomp against child process hanging after a Unix fork().

Compiler Options

Here is a list of compiler options for the compilers the project regularly tests. The option should be added to CXXFLAGS. Crypto++ uses the compiler to drive link, so the compiler will add the correct OpenMP libraries, too.

SunCC requires -O3 or above. SunCC will change the optimization level automatically if it is below -O3.

Compiler Option
MSVC /openmp
GCC -fopenmp
Clang -fopenmp=libomp
ICC -fopenmp
ICX -qopenmp
SunCC -xopenmp=parallel
XLC -qsmp=omp
PGI -mp
Tru64 -omp

Building the Library

You must use -fopenmp on Linux to build the library with OpenMP support using GCC and Clang. Windows should use /openmp option. Below is an example of building with OpenMP on Fedora. The makefile detects -fopenmp on Linux platforms and adds -lgomp if it is missing.

$ CXXFLAGS="-DNDEBUG -g2 -O3 -fopenmp" make -j 4
g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c integer.cpp
...
g++ -o cryptest.exe -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe adhoc.o test.o
bench1.o bench2.o validat0.o validat1.o validat2.o validat3.o validat4.o datatest.o
regtest1.o regtest2.o regtest3.o dlltest.o fipsalgt.o ./libcryptopp.a -lgomp

After building the library you can test it with cryptest.exe v. Notice the message "OpenMP version 201511, 4 threads" when OpenMP is in effect. The version number comes directly from the _OPENMP macro.

$ ./cryptest.exe v
Using seed: 1548150602
OpenMP version 201511, 4 threads

Testing Settings...

passed:  Your machine is little endian.
passed:  sizeof(byte) == 1
passed:  sizeof(word16) == 2
passed:  sizeof(word32) == 4
passed:  sizeof(word64) == 8
passed:  sizeof(word128) == 16
passed:  sizeof(hword) == 4, sizeof(word) == 8, sizeof(dword) == 16
passed:  cacheLineSize == 64
hasSSE2 == 1, hasSSSE3 == 1, hasSSE4.1 == 1, hasSSE4.2 == 1, hasAESNI == 1,
hasCLMUL == 1, hasRDRAND == 1, hasRDSEED == 1, hasSHA == 0, isP4 == 0
...

The "OpenMP version..." message is due to the following code in test.cpp:

void PrintSeedAndThreads(const std::string& seed)
{
    std::cout << "Using seed: " << seed << std::endl;

#ifdef _OPENMP
    int tc = 0;
    #pragma omp parallel
    {
        tc = omp_get_num_threads();
    }

    std::cout << "OpenMP version " << (int)_OPENMP << ", ";
    std::cout << tc << (tc == 1 ? " thread" : " threads") << std::endl;
#endif
}

Sample Program

The following is a sample program from the Scrypt wiki page. Scrypt benefits using OpenMP in its Smix or ROmix function. In the example below, the security parameters cost=(1<<20), blockSize=12, paralellization=4 were selected to demonstrate OpenMP profitability. 1<<20 is a cost factor of approximately 1 million.

$ cat test.cxx
#include "cryptlib.h"
#include "secblock.h"
#include "scrypt.h"
#include "osrng.h"
#include "files.h"
#include "hex.h"

#include <iostream>
#include <omp.h>

int main()
{
    int threads = 1;
    #pragma omp parallel
    {
        threads = omp_get_num_threads();
    }

    using namespace CryptoPP;
    AutoSeededRandomPool prng;

    SecByteBlock key(64), salt(16*1024);
    prng.GenerateBlock(key, key.size());
    prng.GenerateBlock(salt, salt.size());

    Scrypt scrypt;
    scrypt.DeriveKey(key, key.size(), key, key.size(), salt, salt.size(), 1<<20, 12, 4);

    std::cout << "Threads: " << threads << std::endl;

    std::cout << "Key: ";
    StringSource(key, 16, true, new HexEncoder(new FileSink(std::cout)));
    std::cout << "..." << std::endl;

    return 0;
}

The program should be compiled with the same options used to build the library (or, the library should be built with the same options that will be used for the program). Below the two important ones, -O3 -fopenmp, are used for both the library and the program.

$ g++ -I. -O3 -fopenmp test.cxx ./libcryptopp.a -o test.exe

Below are the results from a 6-core Ice Lake machine with a Core i5-1035G1 CPU @ 3.5 GHz. Recall that the test program uses paralellization=4. The multithreaded OpenMP version of the program is about 3x faster than the single threaded version:

$ time OMP_NUM_THREADS=1 ./test.exe
Threads: 1
Key: 4A433D4B73F5EA27400D6EC001CA3C9E...

real    0m13.516s
user    0m13.061s
sys     0m0.438s

$ time OMP_NUM_THREADS=2 ./test.exe
Threads: 2
Key: 9EF3B82A54E57ADA188E0CD83B80C24A...

real    0m7.889s
user    0m14.684s
sys     0m0.966s

$ time OMP_NUM_THREADS=4 ./test.exe
Threads: 4
Key: 512B6C8C44CCCA1A1E67C990178EB2CC...

real    0m4.869s
user    0m16.618s
sys     0m2.136s

$ time OMP_NUM_THREADS=6 ./test.exe
Threads: 6
Key: 0A23D30D90C4E2212A7DF2FCA8A67412...

real    0m4.831s
user    0m16.634s
sys     0m2.022s

$ time OMP_NUM_THREADS=8 ./test.exe
Threads: 8
Key: 033BE7CE45D127234261905E36761BB9...

real    0m4.894s
user    0m16.705s
sys     0m2.206s

$ time OMP_NUM_THREADS=12 ./test.exe
Threads: 12
Key: A8BC687006F55B67D2F283FBA242736A...

real    0m4.848s
user    0m16.652s
sys     0m2.059s

Performance

The Scrypt example showed that OpenMP is profitable for programs that are naturally parallelizable. However, there are costs associated with OpenMP that non-omp algorithms will pay.

The table below shows GMAC and VMAC performance in both configurations on a Skylake Core-i5 6400 @ 2.7 GHz. Below, bigger GiB/second is better; and smaller Cycle/byte is better.

Skylake with OpenMP
Algorithm GiB/second Cycle/byte
GMAC(AES) 8.009 GiB/s 0.32 cpb
VMAC(AES)-64 9.676 GiB/s 0.27 cpb
VMAC(AES)-128 5.110 GiB/s 0.50 cpb
Skylake without OpenMP
Algorithm GiB/second Cycle/byte
GMAC(AES) 8.445 GiB/s 0.30 cpb
VMAC(AES)-64 10.964 GiB/s 0.23 cpb
VMAC(AES)-128 6.237 GiB/s 0.41 cpb

Geometric Average

Geometric Average is the way Crypto++ aggregates benchmark results for different algorithms. In the tables blow, bigger Throughput is better. A variance of 3 to 5 is typical, and we consider it noise. But a change of 100 or 180 is not, and it indicates a non-trivial drop in performance.

Here are the results of running the full Benchmark suite on a Skylake Core i5-6400 CPU @ 2.70GHz

Skylake Geometric Average
Configuration Throughput
With OpenMP 1100.504107
Without OpenMP 1286.652288

Here are the results of running the full Benchmark suite on a Coffee Lake Core i7-8700 @ 3.2 GHz.

Coffee Lake Geometric Average
Configuration Throughput
With OpenMP 1414.931019
Without OpenMP 1583.779185

Here are the results of running the full Benchmark suite on a Ice Lake Core i5-1035G1 @ 3.5 GHz.

Ice Lake Geometric Average
Configuration Throughput
With OpenMP 1311.111786
Without OpenMP 1451.539969

Here are the results of running the full Benchmark suite on an ARM ODROID-XU4 @ 2.0 GHz.

ODROID Geometric Average
Configuration Throughput
With OpenMP 172.617632
Without OpenMP 188.200157