BASE+SIMD

From Crypto++ Wiki
Jump to navigation Jump to search

BASE+SIMD is a compilation model where algorithms with special architectural needs are placed in a separate source file and that receives extra compilation options to enable instruction set architectures (ISA). The separate source file avoids cross-pollinating cpu features into a straight C++ implementation intended to run on a minimally featured machine. The BASE file uses standard C++ flags, while the SIMD file provides hardware acceleration like Altivec, SSE, NEON, CRC, AES, CLMUL and SHA.

The library switched to BASE+SIMD at version Crypto++ 6.0 to better support distros. Also see Issue 380, PR 461 and Commit 5272744410d0. The change was necessary for two reasons. First, the library stopped using -march=native by default. Second, GCC, Clang and ARM toolchains have slightly different behavior. GCC i686 and x86_64 ISA features were always available, even without options like -msse4, -maes and -msha. ARM and Clang had different behavior, and the ISA features were only available if the options to enable them were present on the command line.

Crypto++ honors a user's CXXFLAGS, but it always adds the required arch flags when compiling a SIMD file because they are required for a compilation. Additionally, Autotools and CMake project files also add the required architectural options. Also see the GNU Coding Standards, Section 7.2.3 Variables for Specifying Commands.

In addition to supporting Clang and GCC on Linux, the library must also support unusual compilers like SunCC and different platforms like AIX, Solaris and Windows from the same source files. The mash-up of requirements makes the support tricky because things must work with multiple platforms and compilers.

Some related pages are GNUmakefile and Linux (Command Line).

BASE+SIMD

In the BASE+SIMD model there are two files for an algorithm. There is a BASE file like crc.cpp that provides a standard C++ implementation. It uses the default CXXFLAGS and nothing more. There is also a SIMD file like crc-simd.cpp which provides architecture dependent acceleration, like SSE4.2 or CRC instructions. The SIMD file requires additional compiler options for the platform.

As an example, below is the compilation of the CRC source files on a x86_64 platform. Notice crc-simd.cpp requires -msse4.2 on IA-32 platforms.

$ make
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c integer.cpp
...
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -msse4.2 -c crc-simd.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c crc.cpp
...

Other platforms may require different options. For example, Aarch64 provides CRC32 and CRC-32C acceleration, and the architectural flag of interest is -march=armv8-a+crc as shown below.

$ make
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c integer.cpp
...
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -march=armv8-a+crc -c crc-simd.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c crc.cpp
...

SIMD Files

The following is a list of SIMD files as of Crypto++ 8.2. Each file listed below requires an architectural option.

$ ls -1 *_simd.cpp
aria_simd.cpp
blake2b_simd.cpp
blake2s_simd.cpp
chacha_simd.cpp
cham_simd.cpp
crc_simd.cpp
gcm_simd.cpp
gf2n_simd.cpp
keccak_simd.cpp
lea_simd.cpp
neon_simd.cpp
ppc_simd.cpp
rijndael_simd.cpp
shacal2_simd.cpp
sha_simd.cpp
simeck_simd.cpp
simon128_simd.cpp
simon64_simd.cpp
sm4_simd.cpp
speck128_simd.cpp
speck64_simd.cpp
sse_simd.cpp

GNUmakefile

As stated earlier, the GNUmakefile always supplies the required architecture option for a SIMD file. Additionally, the Autotools and CMake project files also provide the architectural options when compiling a SIMD file. The makefile uses a feature test to provide the architectural options.

The feature test is a little unusual, but it simply looks for diagnostic messages from the compiler. The way it works is, all whitespace in a compiler message is converted to newlines, and then the number of newlines are counted. If line count is greater then 0, then the feature test fails.

The reason for the pattern is, compiler return codes are not standard. Some compilers issue a warning but return success when a feature is not available. Instead of relying on the return code, we simply look for compiler diagnostic messages.

The feature test was crafted after "dark and silent cockpits", meaning no messages are good, and any message in bad. Airplane cockpits are similar: no warning lights and no buzzers are good; and warning lights and buzzers are bad.

Below is from the makefile and its handling of the CRC flag for Intel-based machines. Other architectures, like Aarch64, have similar code.

SSE42_FLAG = -msse4.2

...
TPROG = TestPrograms/test_x86_sse42.cxx
TOPT = $(SSE42_FLAG)
HAVE_OPT = $(shell $(CXX) $(TCXXFLAGS) $(ZOPT) $(TOPT) $(TPROG) -o $(TOUT) 2>&1 | tr ' ' '\n' | wc -l)

ifeq ($(strip $(HAVE_OPT)),0)
    CRC_FLAG = $(SSE42_FLAG)
else
    SSE42_FLAG =
endif

...
ifeq ($(SSE42_FLAG),)
    CRYPTOPP_CXXFLAGS += -DCRYPTOPP_DISABLE_SSE4
endif

...
crc-simd.o : crc-simd.cpp
    $(CXX) $(strip $(CRYPTOPP_CXXFLAGS) $(CXXFLAGS) $(CRC_FLAG) -c) $<

...
%.o : %.cpp
    $(CXX) $(strip $(CRYPTOPP_CXXFLAGS) $(CXXFLAGS) -c) $<

Both Autotools and CMake fall victim to "... some compilers issue a warning but return success when a feature is not available." The Crypto++ Autotools and CMake projects try to work around the build system failures by supplying its own function to try a compile.

Arch Options

The following table lists the architectural flags required for the SIMD files. The flags are GCC's style of options, but LLVM's Clang and Intel's ICC will consume the flags.

The list of files and options are current as of Crypto++ 8.2. They may become stale over time as additional files are added. If a source file is missing from the list then just run make and see what the GNUmakefile uses for the file.

SIMD File i686 & x86_64 ARM NEON AArch64 PowerPC
Table 1: GCC style architecture flags
aria_simd.cpp -mssse3 -march=armv7-a -mfpu=neon  
 
-mcpu=power4 -maltivec
blake2_simd.cpp -msse4.1 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power4 -maltivec
chacha_simd.cpp -msse2 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power4 -maltivec
chacha_avx.cpp -mavx2  
 
 
 
 
 
cham_simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power4 -maltivec
crc_simd.cpp -msse4.2  
 
-march=armv8-a+crc  
 
gf2n_simd.cpp -mpclmul -march=armv7-a -mfpu=neon -march=armv8-a+crypto -mcpu=power8 -maltivec
keccak_simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power8 -maltivec
lea_simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power4 -maltivec
neon_simd.cpp  
 
-march=armv7-a -mfpu=neon -march=armv8-a  
 
ppc_simd.cpp  
 
 
 
 
 
-mcpu=power4 -maltivec
rijndael_simd.cpp -msse4.1 -maes -march=armv7-a -mfpu=neon -march=armv8-a+crypto -mcpu=power8 -maltivec
sha_simd.cpp -msse4.2 -msha  
 
-march=armv8-a+crypto -mcpu=power8 -maltivec
shacal2_simd.cpp -msse4.2 -msha  
 
-march=armv8-a+crypto -mcpu=power8 -maltivec
simon64_simd.cpp -msse4.1 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power7 -maltivec
simon128_simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power8 -maltivec
speck64_simd.cpp -msse4.1 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power7 -maltivec
speck128_simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power8 -maltivec
sm4_simd.cpp -msse4.1 -maes -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power8 -maltivec
sse_simd.cpp -msse2*  
 
 
 
 
 

* i686 requires -msse2 option. x86_64 does not need the flag because SSE2 is part of amd64's core instruction set.

ARM NEON also requires a floating point ABI like -mfloat-abi=hard or -mfloat-abi=softfp.

If compiling with IBM XL C/C++ use -qarch=pwr4 -qaltivec and -qarch=pwr8 -qaltivec.

Multiversioning

GCC has a feature called Function Multiversioning, which allows software to provide different versions of a function based on an Instruction Set Architecture (ISA). Function multiversioning first appeared in GCC for x86_64 in GCC 5, while Aarch64 multiversioning appeared in GCC 6. An example from the GCC online manual is shown below.

Multiversioning does not work well for the Crypto++ library for several reasons. First, multiversioning is too new and does not provide enough coverage in the field. GCC 4 is still common in the wild, especially on ARM and MIPS boards. In fact many of the machines at the GCC compile farm provide GCC 4.8.5 or 4.9.2 as the default compiler.

Second, GCC lacks function multiversioning completely for some platforms, like PowerPC, MIPS and SPARC. PowerPC, MIPS and SPARC would require BASE+SIMD.

Third, some versions of Clang do not support function multiversioning well, and other compilers don't support it at all. Support for Clang and some other compilers like SunCC would require BASE+SIMD.

Fourth, multiversioning is too incomplete, even with the latest compilers. For example, the example below will fail to compile with GCC 10 and Clang 12:


Finally, multiversioning is too buggy. We tried the experiment and a round of testing. It failed miserably.

__attribute__ ((target ("default")))
template <class T>
int foo ()
{
  // The default version of foo.
  return 0;
}

__attribute__ ((target ("sse4.2")))
template <class T>
int foo ()
{
  // foo version for SSE4.2
  return 1;
}

Crypto++ 5.6.2

Crypto++ versions prior to 6.0 used a different method to make instructions available. The method is detailed below, and it shows why we had to switch to BASE+SIMD. Effectively Crypto++ source code used the following pattern to provide hardware accelerated algorithms (the real code is a little messier):

#if defined(__AES__) || (_MSC_FULL_VER >= 150030729)
// Include standard Intel SSE headers for declarations
# define CRYPTOPP_AESNI_AVAILABLE 1
# include <wmmintrin.h>
#elif (GCC_VERSION >= 40400) || (__INTEL_COMPILER >= 1110)
// Mirror the Intel functions but use inline ASM for missing pieces
# define CRYPTOPP_AESNI_AVAILABLE 1
...
__inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_aesimc_si128 (__m128i a)
{
    __m128i r;
    asm ("aesimc %1, %0" : "=x"(r) : "xm"(a));
    return r;
}
...
#else
// AESNI not available. Straight C++ will be used
#endif

The pattern above worked great on i686 or x86_64 for GNU GCC, Intel ICC and Microsoft's MSVC++. GCC, ICC and MSVC++ always made the instructions available with as little as -march=native. GCC, ICC and MSVC++ were the original compilers the library supported from the 1990's and early 2000's.

The pattern failed miserably on other i686 or x86_64 compilers, including Clang and SunCC. Clang required -maes to be explicitly passed on the command line. Clang would not accept -march=native and enable AESNI even on a machine with AESNI. The result was many compilers, including Clang and SunCC, used the C++ implementation. Specialized support for compilers like Clang and SunCC did not arrive until about 2016.

Other architectures, like ARM and NEON, ARMv8, and POWER8 do not make the architectures available unless the appropriate architecture switch is present, even with -march=native. So the additional architectures were broke out of the box using the original x86 pattern. Specialized support for architectures like ARM and NEON, ARMv8, and POWER8 did not arrive until about 2016.

All of this trouble would have been avoided if the compilers simply made the instructions available out of the box for user code. It is one thing for GCC to use a particular instruction set when generating its own code from C++; but its an entirely different story when a programmers asks for specific instruction to be generated, like a aesimc.

You can see an example of the old pattern in Crypto++ 5.6.2 cpu.h.

Clang and OS X

We are aware of one problem when using BASE+SIMD. On an old OS X machine with an updated Clang, Clang allows higher ISAs to cross-pollinate into BASE code by way of global constructors. The old OS X machine is an early Core2 Duo a.k.a. MacBook Pro from 2011 or so. The updated Clang comes from MacPorts and includes version 5.0 and 6.0.

For example, instead of compiling only the function ChaCha_OperateKeystream_AVX2 with AVX2, Clang compiles the entire source file chacha_avx.cpp using AVX2. The translation unit includes global constructors so AVX and AVX2 will be used outside of our guarded functions. If the binary is run on a down-level machine then the program will segfault with a SIGILL.

Testing with GCC 6 does not reveal the problem. GCC was tested on OS X and Debian. Both machines were early Core2 Duo machines.

Also see Issue 751, SIGILL on older OS X with new Clang compiler due to global ctor ISA and Restrict global constructors to base ISA on the LLVM-dev mailing list.