Link Time Optimization

From Crypto++ Wiki
Jump to navigation Jump to search

Link Time Optimization, or LTO, is a GCC-compatible feature that allows the compiler to retain its internal representation of a program or module and use it later with different compilation units to perform optimizations during linking. Also see Link Time Optimization on the GCC wiki.

Generally speaking, you should not use Link Time Optimization for Crypto++. There are three reasons for the recommendation. First, we don't want the linker changing object files or the executables produced during link. The linker's job is to combine object files, not attempt to peephole optimize them.

Second, the tooling does not handle extended inline ASM properly when using Link Time Optimization. It appears the tooling does not track register usage properly. This is unexpected since GCC inline assembly requires a program to declare input operands, output operands and clobbers in the ASM template.

Third, Link Time Optimizations causes the library to slow down. Based on our Benchmark results, and with all other things being equal, the library performance gets worse. But be sure to Benchmark your program to determine if it is profitable to use LTO.

If you chose to use LTO then you must add -DCRYPTOPP_DISABLE_ASM to CXXFLAGS.

Also see the following:

A related topic is Bitcode, where the library is compiled to an intermediate representation that will eventually change.

Performance

We are not aware of any GCC documentation on performance benefits with benchmarks numbers to substantiate the claims. GCC's Link Time Optimization page does not provide them. We suspect LTO does not benefit most programs in a measurable way.

With respect to Crypto++, running the full Benchmark suite on a Skylake machine at 2.7 GHz results in an overall drop in performance. The numbers are provided below. In the table below, bigger Geometric Average Throughput is better.

A variance of 0 to 3 is typical for Geometric Average Throughput. We consider 3 or less simply noise due to interrupts and task switching. However, 25 indicates a problem, and it is usually something we investigate to determine the cause of the drastic drop in performance. In this case, there's nothing to investigate since we know LTO is causing the performance loss.

Geometric Average
Configuration Throughput
With LTO 1261.012811
Without LTO 1286.652288

GCC Options

You can build the library with LTO using the following GCC options. In addition to the GCC options, you must change AR to gcc-ar and RANLIB to gcc-ranlib. Crypto++ drives link through the compiler so you don't need to do anything with LD. The same CXXFLAGS are used for compile and link.

$ AR=gcc-ar RANLIB=gcc-ranlib \
  CXXFLAGS="-DNDEBUG -O2 -flto=6 -g -fPIC -pthread" make -j 4
Using testing flags: -DNDEBUG -O2 -flto=6 -g -fPIC -pthread
g++ -DNDEBUG -O2 -flto=6 -g -fPIC -pthread -pipe -c cryptlib.cpp
g++ -DNDEBUG -O2 -flto=6 -g -fPIC -pthread -pipe -c cpu.cpp
g++ -DNDEBUG -O2 -flto=6 -g -fPIC -pthread -pipe -c integer.cpp
...
gcc-ar r libcryptopp.a cryptlib.o cpu.o integer.o ...
gcc-ranlib libcryptopp.a
...
g++ -o cryptest.exe -DNDEBUG -O2 -flto=6 -g -fPIC -pthread -pipe adhoc.o test.o
bench1.o bench2.o bench3.o datatest.o dlltest.o fipsalgt.o validat0.o validat1.o
 validat2.o validat3.o validat4.o validat5.o validat6.o validat7.o validat8.o va
lidat9.o validat10.o regtest1.o regtest2.o regtest3.o regtest4.o ./libcryptopp.a

Clang Options

You can build the library with LTO using the following Clang options. In addition to the Clang options, you must change AR to llvm-ar and RANLIB to llvm-ranlib. Crypto++ drives link through the compiler so you don't need to do anything with LD. The same CXXFLAGS are used for compile and link.

$ AR=llvm-ar RANLIB=llvm-ranlib \
  CXX=clang++ CXXFLAGS="-DNDEBUG -O2 -flto -g -fPIC -pthread" make -j 4
Using testing flags: -DNDEBUG -O2 -flto -g -fPIC -pthread
clang++ -pipe -DNDEBUG -O2 -flto -g -fPIC -pthread -c cryptlib.cpp
clang++ -pipe -DNDEBUG -O2 -flto -g -fPIC -pthread -c cpu.cpp
clang++ -pipe -DNDEBUG -O2 -flto -g -fPIC -pthread -c integer.cpp
...
llvm-ar r libcryptopp.a cryptlib.o cpu.o integer.o 3way.o adler32.o algebra.o algparam.o
allocate.o arc4.o aria.o aria_simd.o ariatab.o asn.o authenc.o base32.o base64.o ...
make: llvm-ar: Command not found
make: *** [GNUmakefile:1438: libcryptopp.a] Error 127
make: *** Waiting for unfinished jobs....

GCC ARM Platform

Here is what the GCC LTO error looks like on ARM platforms.

g++ -o cryptest.exe -DNDEBUG -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-str
ong -funwind-tables -fasynchronous-unwind-tables -flto=6 -g -fpic -fPIC -pthread
 -fopenmp adhoc.o test.o bench1.o bench2.o bench3.o datatest.o dlltest.o fipsalg
t.o validat0.o validat1.o validat2.o validat3.o validat4.o validat5.o validat6.o
 validat7.o validat8.o validat9.o validat10.o regtest1.o regtest2.o regtest3.o r
egtest4.o ./libcryptopp.a  -lgomp
pubkey.h:640:26: warning: type ‘struct TF_ObjectImpl’ violates the C++ One Defin
ition Rule [-Wodr]
 class CRYPTOPP_NO_VTABLE TF_ObjectImpl : public TF_ObjectImplBase<BASE, SCHEME_
OPTIONS, KEY_CLASS>
                          ^
pubkey.h:640:26: note: a different type is defined in another translation unit
 class CRYPTOPP_NO_VTABLE TF_ObjectImpl : public TF_ObjectImplBase<BASE, SCHEME_
OPTIONS, KEY_CLASS>
                          ^
pubkey.h:651:11: note: a different type is defined in another translation unit
...

make[1]: *** [/tmp/cc1QfZK2.ltrans17.ltrans.o] Error 1
/usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h: In function ‘BLAKE2_Compr
ess32_NEON’:
/usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h:10401:47: fatal error: You
 must enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use these
 intrinsics.
   return (uint8x16_t)__builtin_neon_vld1v16qi ((const __builtin_neon_qi *) __a);
...
                                               ^
compilation terminated.