BASE+SIMD

From Crypto++ Wiki
Jump to navigation Jump to search

BASE+SIMD is a compilation model where algorithms with special architectural needs are placed in a separate source file and that receives extra compilation options to enable instruction set architectures (ISA). The separate source file avoids cross-pollinating cpu features into a straight C++ implementation intended to run on a minimally featured machine. The BASE file uses standard C++ flags, while the SIMD file provides hardware acceleration like Altivec, SSE, NEON, CRC, AES, CLMUL and SHA.

The library switched to BASE+SIMD at version Crypto++ 6.0 to better support distros. Also see Issue 380, PR 461 and Commit 5272744410d0. The change was necessary for two reasons. First, the library stopped using -march=native by default. Second, GCC, Clang and ARM toolchains have slightly different behavior. GCC i686 and x86_64 ISA features were always available, even without options like -msse4, -maes and -msha. ARM and Clang had different behavior, and the ISA features were only available if the options to enable them were present on the command line.

Crypto++ honors a user's CXXFLAGS, but it always adds the required arch flags when compiling a SIMD file because they are required for a compilation. Additionally, Autotools and CMake project files also add the required architectural options. Also see the GNU Coding Standards, Section 7.2.3 Variables for Specifying Commands.

In addition to supporting Clang and GCC on Linux, the library must also support unusual compilers like SunCC and different platforms like AIX, Solaris and Windows from the same source files. The mash-up of requirements makes the support tricky because things must work with multiple platforms and compilers.

BASE+SIMD

In the BASE+SIMD model there are two files for an algorithm. There is a BASE file like crc.cpp that provides a standard C++ implementation. It uses the default CXXFLAGS and nothing more. There is also a SIMD file like crc-simd.cpp which provides architecture dependent acceleration, like SSE4.2 or CRC instructions. The SIMD file requires additional compiler options for the platform.

As an example, below is the compilation of the CRC source files on a x86_64 platform. Notice crc-simd.cpp requires -msse4.2 on IA-32 platforms.

$ make
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c integer.cpp
...
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -msse4.2 -c crc-simd.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c crc.cpp
...

Other platforms may require different options. For example, Aarch64 provides CRC32 and CRC-32C acceleration, and the architectural flag of interest is -march=armv8-a+crc as shown below.

$ make
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c integer.cpp
...
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -march=armv8-a+crc -c crc-simd.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c crc.cpp
...

SIMD Files

The following is a list of SIMD files as of Crypto++ 7.1. Each file listed below requires an architectural option.

$ ls -1 *-simd.cpp
aria-simd.cpp
blake2-simd.cpp
cham-simd.cpp
crc-simd.cpp
gcm-simd.cpp
lea-simd.cpp
neon-simd.cpp
ppc-simd.cpp
rijndael-simd.cpp
shacal2-simd.cpp
sha-simd.cpp
simeck-simd.cpp
simon128-simd.cpp
simon64-simd.cpp
sm4-simd.cpp
speck128-simd.cpp
speck64-simd.cpp
sse-simd.cpp

GNUmakefile

As stated earlier, the GNUmakefile always supplies the required architecture option for a SIMD file. Additionally, the Autotools and CMake project files also provide the architectural options when compiling a SIMD file.

Below is from the makefile and its handling of the CRC flag for Intel-based machines. Other architectures, like Aarch64, has similar code.

HAVE_CRC = $(shell echo | $(CXX) $(CXXFLAGS) -msse4.2 -dM -E adhoc.cpp | $(GREP) -i -c __SSE4_2__)
ifeq ($(HAVE_CRC),1)
    CRC_FLAG = -msse4.2
endif

...
crc-simd.o : crc-simd.cpp
    $(CXX) $(strip $(CXXFLAGS) $(CRC_FLAG) -c) $<

...
%.o : %.cpp
    $(CXX) $(strip $(CXXFLAGS) -c) $<

Arch Options

The following table lists the architectural flags required for the SIMD files. The flags are GCC's style of options, but LLVM's Clang and Intel's ICC will consume the flags.

The list of files and options are current as of Crypto++ 7.1. They may become stale over time as additional files are added. If a source file is missing from the list then just run make and see what the GNUmakefile uses for the file.

SIMD File i686 & x86_64 ARM NEON AArch64 PowerPC
Table 1: GCC style architecture flags
aria-simd.cpp -mssse3 -march=armv7-a -mfpu=neon  
 
-mcpu=power4 -maltivec
blake2-simd.cpp -msse4.1 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power4 -maltivec
cham-simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power4 -maltivec
crc-simd.cpp -msse4.2  
 
-march=armv8-a+crc  
 
gcm-simd.cpp -mssse3 -mpclmul -march=armv7-a -mfpu=neon -march=armv8-a+crypto -mcpu=power8 -maltivec
lea-simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power4 -maltivec
neon-simd.cpp  
 
-march=armv7-a -mfpu=neon -march=armv8-a  
 
ppc-simd.cpp  
 
 
 
 
 
-mcpu=power4 -maltivec
rijndael-simd.cpp -msse4.1 -maes -march=armv7-a -mfpu=neon -march=armv8-a+crypto -mcpu=power8 -maltivec
shacal2-simd.cpp -msse4.2 -msha  
 
-march=armv8-a+crypto -mcpu=power8 -maltivec
sha-simd.cpp -msse4.2 -msha  
 
-march=armv8-a+crypto -mcpu=power8 -maltivec
simon64-simd.cpp -msse4.1 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power7 -maltivec
simon128-simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power8 -maltivec
speck64-simd.cpp -msse4.1 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power7 -maltivec
speck128-simd.cpp -mssse3 -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power8 -maltivec
sm4-simd.cpp -msse4.1 -maes -march=armv7-a -mfpu=neon -march=armv8-a -mcpu=power8 -maltivec
sse-simd.cpp -msse2*  
 
 
 
 
 

* i686 requires -msse2 option. x86_64 does not need the flag because SSE2 is part of amd64's core instruction set.

ARM NEON also requires a floating point ABI like -mfloat-abi=hard or -mfloat-abi=softfp.

If compiling with IBM XL C/C++ use -qarch=pwr4 -qaltivec and -qarch=pwr8 -qaltivec.

Multiversioning

GCC has a feature called Function Multiversioning, which allows software to provide different versions of a function based on an Instruction Set Architecture (ISA). Function multiversioning first appeared in GCC for x86_64 in GCC 5, while Aarch64 multiversioning appeared in GCC 6. An example from the GCC online manual is shown below.

Multiversioning does not work well for the Crypto++ library for several reasons. First, multiversioning is too new and does not provide enough coverage in the field. GCC 4 is still common in the wild, especially on ARM and MIPS boards. In fact many of the machines at the GCC compile farm provide GCC 4.8.5 or 4.9.2 as the default compiler.

Second, GCC lacks function multiversioning completely for some platforms, like PowerPC, MIPS and SPARC. PowerPC, MIPS and SPARC would require BASE+SIMD. Third, some versions of Clang do not support function multiversioning well, and other compilers don't support it at all. Support for Clang and some other compilers like SunCC would require BASE+SIMD.

__attribute__ ((target ("default")))
int foo ()
{
  // The default version of foo.
  return 0;
}

__attribute__ ((target ("sse4.2")))
int foo ()
{
  // foo version for SSE4.2
  return 1;
}

__attribute__ ((target ("arch=atom")))
int foo ()
{
  // foo version for the Intel ATOM processor
  return 2;
}

...

Crypto++ 5.6.2

Crypto++ versions prior to 6.0 used a different method to make instructions available. The method is detailed below, and it shows why we had to switch to BASE+SIMD. Effectively Crypto++ source code used the following pattern to provide hardware accelerated algorithms (the real code is a little messier):

#if defined(__AES__) || (_MSC_FULL_VER >= 150030729)
// Include standard Intel SSE headers for declarations
# define CRYPTOPP_AESNI_AVAILABLE 1
# include <wmmintrin.h>
#elif (GCC_VERSION >= 40400) || (__INTEL_COMPILER >= 1110)
// Mirror the Intel functions but use inline ASM for missing pieces
# define CRYPTOPP_AESNI_AVAILABLE 1
...
__inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_aesimc_si128 (__m128i a)
{
    __m128i r;
    asm ("aesimc %1, %0" : "=x"(r) : "xm"(a));
    return r;
}
...
#else
// AESNI not available. Straight C++ will be used
#endif

The pattern above worked great on i686 or x86_64 for GNU GCC, Intel ICC and Microsoft's MSVC++. GCC, ICC and MSVC++ always made the instructions available with as little as -march=native. GCC, ICC and MSVC++ were the original compilers the library supported from the 1990's and early 2000's.

The pattern failed miserably on other i686 or x86_64 compilers, including Clang and SunCC. Clang required -maes to be explicitly passed on the command line. Clang would not accept -march=native and enable AESNI even on a machine with AESNI. The result was many compilers, including Clang and SunCC, used the C++ implementation. Specialized support for compilers like Clang and SunCC did not arrive until about 2016.

Other architectures, like ARM and NEON, ARMv8, and POWER8 do not make the architectures available unless the appropriate architecture switch is present, even with -march=native. So the additional architectures were broke out of the box using the original x86 pattern. Specialized support for architectures like ARM and NEON, ARMv8, and POWER8 did not arrive until about 2016.

All of this trouble would have been avoided if the compilers simply made the instructions available out of the box for user code. It is one thing for GCC to use a particular instruction set when generating its own code from C++; but its an entirely different story when a programmers asks for specific instruction to be generated, like a aesimc.

You can see an example of the old pattern in Crypto++ 5.6.2 cpu.h.

Clang and OS X

We are aware of one problem when using BASE+SIMD. On an old OS X machine with an updated Clang, Clang allows higher ISAs to cross-pollinate into BASE code by way of global constructors. The old OS X machine is an early Core2 Duo a.k.a. MacBook Pro from 2011 or so. The updated Clang comes from MacPorts and includes version 5.0 and 6.0.

For example, instead of compiling only the function ChaCha_OperateKeystream_AVX2 with AVX2, Clang compiles the entire source file chacha_avx.cpp using AVX2. The translation unit includes global constructors so AVX and AVX2 will be used outside of our guarded functions. If the binary is run on a down-level machine then the program will segfault with a SIGILL.

Testing with GCC 6 does not reveal the problem. GCC was tested on OS X and Debian. Both machines were early Core2 Duo machines.

Also see Issue 751, SIGILL on older OS X with new Clang compiler due to global ctor ISA and Restrict global constructors to base ISA on the LLVM-dev mailing list.