Among Bytes

Benchmarking ML-DSA Signature Generation: Understanding Rejection Sampling Performance

Kris Kwiatkowski — Mon, 23 Feb 2026 00:00:00 GMT

Introduction

As post-quantum cryptography (PQC) adoption accelerates, developers face new challenges in deploying signature schemes like ML-DSA on constrained devices-from embedded IoT devices to edge servers with limited computational resources. Unlike traditional signature algorithms such as RSA or ECDSA, which typically exhibit relatively predictable signing latency on a fixed platform, ML-DSA introduces an architectural feature rejection sampling that makes signing time inherently variable. This probabilistic mechanism ensures cryptographic security but creates variable signing latencies that can significantly impact system performance.

In this post, we’ll explore what rejection sampling means for ML-DSA benchmarking, share concrete performance metrics, and discuss best practices for measuring signing speed on constrained cryptographic modules.

This post extends section 4.1 of IETF draft Adapting Constrained Devices for Post-Quantum Cryptography.

Why ML-DSA Signing is Different

ML-DSA implements the Fiat-Shamir with Aborts construction, which uses rejection sampling as a core mechanism - a design choice rooted in lattice-based cryptography’s unique mathematical properties. Here’s what that means practically:

The Fiat-Shamir with Aborts Construction

Traditional Fiat-Shamir signatures use a public challenge derived from the message and public key. However, lattice-based signatures like ML-DSA need stronger guarantees. The “with Aborts” variant solves this by:

Computing preliminary signature components (called a “y” value in lattice terms)
Deriving a challenge from these components and the message
Computing candidate signature components using the challenge and secret key
Checking norm bounds: Verifying that these components don’t exceed predefined thresholds
Either accepting the signature or aborting and restarting with fresh randomness

What the Norm Bounds Actually Do

The norm bound checks ensure that signature components stay within specific vector magnitude ranges. In lattice cryptography, if signature values are allowed to vary based on the secret key properties, an attacker could observe:

Patterns in signature magnitudes that leak information about the private key
Subtle correlations between multiple signatures that compromise security
Bias in the distribution of valid signatures

By enforcing strict bounds, rejection sampling eliminates these side channels. The tradeoff: some attempts must be discarded, creating the variable latency we’ll see throughout this post.

Why This Creates Variable Performance

After computing candidate signature components, the algorithm checks whether these bounds are satisfied. If they aren’t met - which happens probabilistically - the entire signing attempt is discarded and restarted with fresh randomness.

This approach serves two critical purposes:

Security: Prevents information leakage about the secret key through out-of-range values
Correctness: Ensures signature distributions match security proof assumptions

Unlike traditional algorithms that process a message once and produce a signature, ML-DSA may need to retry the signing process multiple times. This makes predicting signing latency fundamentally different from RSA or ECDSA.

The Numbers: Rejection Probability and Expected Attempts

Here’s where benchmarking gets interesting. The acceptance probability - the chance a signing attempt succeeds on the first try - varies by ML-DSA parameter set:

Acceptance probability - per-attempt probability of successful signing for the given ML-DSA variant.
ML-DSA Variant	Acceptance Probability	Expected Attempts
ML-DSA-44	23.50%	4.255
ML-DSA-65	19.63%	5.094
ML-DSA-87	25.96%	3.852

What this tells us:

ML-DSA-44 (compact variant) succeeds about 1 in 4 times per attempt
ML-DSA-65 (NIST Category 3, most deployed) expects roughly 5 attempts on average
ML-DSA-87 (highest security) actually has the best acceptance probability, requiring ~3.9 attempts

These aren’t guesses - they’re mathematically derived from the algorithm’s structure and parameters defined in FIPS-204 using Equation 5 from Li32, assuming a random bit generator (RBG) as specified in Section 3.6.1.

Factors Affecting Rejection Probability

The probability that any given signing attempt succeeds isn’t fixed - it depends on several factors:

The message being signed: Different messages produce different challenge values
The secret key material: Specific key properties affect norm bound satisfaction probability
Random seed (hedged signing): When FIPS-204 Section 3.4 hedged signing is used, additional randomness affects outcomes
Context string: The optional context parameter (FIPS-204 Section 5.2) influences the challenge derivation

In practice, this means some message-key combinations may require significantly more rejection iterations than others. A particular message signed with a particular key might consistently need more attempts than average, while another pairing might consistently succeed quickly.

Understanding the Distribution

The expected number of attempts is only part of the story. Due to the geometric distribution of the rejection-sampling loop, we need to understand the “tail” of the distribution-what happens in worst-case scenarios.

The Mathematical Model

For benchmarking and capacity planning, the rejection-sampling loop is well modeled as a geometric distribution with acceptance probability :

Each attempt either succeeds (with probability p = acceptance probability) or fails (with probability )
The number of attempts follows a geometric distribution
The expected total attempts = (the reciprocal of acceptance probability)

Using this model, we can calculate the cumulative distribution function (CDF)-the probability of completing signing within exactly N iterations.

ML-DSA Cumulative Distribution

The CDF expresses the probability that the signing process completes within at most a given number of iterations.

The data shows significant variation across ML-DSA variants:

First attempt success: Only 19.6% to 26% of signing operations succeed on the first try
Within 5 iterations: About two-thirds (67-78%) of operations complete by the 5th attempt
Within 10 iterations: Most operations (88-95%) complete within 10 attempts
Tail behavior: Even after 11 iterations, a small fraction (3-9%) of operations still need more attempts

Expected Number of Attempts for the given ML-DSA variant.
Iterations	ML-DSA-44	ML-DSA-65	ML-DSA-87
1	23.50%	19.63%	25.96%
2	41.48%	35.41%	45.18%
3	55.23%	48.09%	59.41%
4	65.75%	58.28%	69.95%
5	73.80%	66.47%	77.75%
6	79.96%	73.05%	83.53%
7	84.67%	78.34%	87.80%
8	88.27%	82.59%	90.97%
9	91.03%	86.01%	93.31%
10	93.14%	88.76%	95.05%
11	94.75%	90.96%	96.34%

This demonstrates the importance of dimensioning systems with adequate retry budget. While ML-DSA-44 and ML-DSA-87 show faster convergence than ML-DSA-65, all variants exhibit the same geometric tail behavior-rare but real outliers that extend beyond typical case scenarios.

Practical Implications for Constrained Devices

For battery-powered IoT devices and embedded systems, this variability matters significantly:

Latency Unpredictability

Consider a concrete example: suppose a single rejection-sampling iteration takes 100 microseconds on your embedded device.

Best case (1 iteration): 100 microseconds
Expected case (5 iterations): 500 microseconds
95th percentile (11 iterations): 1,100 microseconds
99th percentile (21 iterations): 2,100 microseconds

For time-critical applications like IoT gateways expecting 1 millisecond response times, this becomes problematic. If your system budgets for average-case performance (500 μs) and occasionally encounters 99th percentile cases (2,100 μs), you’ll miss deadlines approximately 1% of the time. In production systems handling thousands of signatures per day, that 1% isn’t negligible.

Energy Consumption Variability

Power consumption scales directly with iteration count. On battery-powered devices:

A “fast” signature (1 iteration) might consume 50 mJ
The same signature might consume 250 mJ at expected case (5 iterations)
Rare outliers (21 iterations) might consume 1,050 mJ

For devices relying on energy harvesting or with tight power budgets, this 20x variation between best and 99th percentile cases creates significant uncertainty. Devices must either:

Over-provision battery capacity for worst-case scenarios
Implement aggressive power limiting that reduces throughput
Accept occasional failed signing operations when power budgets are exceeded

Impact on TLS Handshakes

In TLS 1.3 with ML-DSA, the server performs a signature during the handshake. On a mobile IoT device over cellular:

Expected signing: ~500 μs (manageable within handshake timing)
Occasional outliers: ~2,100 μs (visible latency increase; user-perceptible in some scenarios)
Compounded with network latency and cryptographic verification, outlier cases can extend handshakes by 10-20+ milliseconds

For LTE IoT connections, this can push handshakes from 200ms to 220ms - noticeable but usually acceptable. However, on slower networks or with multiple signature operations, the impact multiplies.

System Design Considerations

Real-time systems must allocate resources for 99th percentile (21 iterations), not average-case (5 iterations), unless they can tolerate occasional missed deadlines
Energy-harvesting devices need to either buffer energy or implement adaptive signing strategies
Communication protocols should not assume signing is faster than network operations
Firmware updates and key generation (which don’t use rejection sampling) can be significantly faster than signing, creating performance asymmetry

Best Practices for Benchmarking ML-DSA Signing

If you’re benchmarking ML-DSA implementations on constrained devices, don’t fall into the trap of reporting a single timing number. Here’s what to measure:

1. Single-Iteration Signing Time

Measure the time for signature operations that complete in a single rejection-sampling iteration. This captures the best-case performance and shows the efficiency of the core algorithm without retry overhead. It isolates the fundamental speed of your cryptographic implementation and makes it comparable across different hardware platforms.

2. Average Signing Time

Report the average across a large number of signing operations using independent messages and randomness. Alternatively, report the time corresponding to the expected number of iterations (shown in the table above). This reflects real-world performance that users will actually experience, accounting for the natural variation in rejection attempts.

3. Iteration Reporting

The most important step: make the signing function report the actual number of rejection iterations used. This enables:

Accurate averaging of multiple signing operations
Correlation of timing/energy measurements with iteration count
Identification of anomalies or implementation issues

Comparing to Traditional Signatures

To illustrate why rejection sampling benchmarking is different, consider RSA or ECDSA:

Signing time is deterministic: You can measure a single 2048-bit RSA signature and get the same runtime within microseconds every time
Energy consumption is predictable: An ECDSA-P256 signature consumes nearly identical energy regardless of message or key
Performance metrics are straightforward: Report a single timing number; it accurately represents all signing operations

The choice of metric dramatically affects system design. Budget for average-case and 1% of your operations will timeout. Budget for 99% case and you’re over-provisioning resources by 4-5x.

Conclusion

The rejection sampling in ML-DSA’s signing operations is a carefully engineered security feature, not a limitation. It’s fundamental to how lattice-based signatures provide provable security against known attacks. But it does require a thoughtfully different approach to performance evaluation than you might expect from traditional signature algorithms. It is worth to note that:

Performance is probabilistic, not deterministic. A single timing measurement is meaningless. Instead, you need to understand the distribution of signing times.
The expected overhead is manageable. Averaging 4-5 iterations for ML-DSA-65 is reasonable. The core signing operation (one iteration) executes in acceptable time on modern embedded hardware.
You can predict and measure it precisely. Using the geometric distribution model and FIPS-204 parameters, you now have the mathematical framework to estimate signing time distributions without extensive benchmarking.
System design must account for variability. Real-time systems, battery-powered devices, and time-sensitive protocols need to budget for 99th percentile cases, not average-case performance.
Signing only. The mechanism applies only to the signing operation. This abort/retry mechanism mechanism doesn’t apply to key generation and verification.

Making SmartFusion2 Productive in Brownfield Systems

Kris Kwiatkowski — Mon, 22 Dec 2025 00:00:00 GMT

Introduction

SmartFusion2 is an interesting platform: an FPGA tightly coupled with a Cortex‑M3 microcontroller, security features baked into silicon, and a toolchain that reflects its long industrial heritage. It is powerful—but it can also feel heavy if your primary goal is simply to get code running, talk over UART, and start experimenting.

This post describes a software-first workflow for working with the Microcontroller Subsystem (MSS) on Microchip SmartFusion2, based on hands‑on work with the M2S090TS evaluation board. The emphasis is deliberately on getting productive quickly, especially for software and security engineers who do not want to live inside FPGA tools.

Rather than documenting every register or Libero click-path, this article focuses on the decisions, trade-offs, and minimal setup that make the platform usable and predictable in practice.

Context: An Older Platform and Brownfield Devices

SmartFusion2 is not a new platform. It has been deployed in real products for years, often in long-lived industrial, infrastructure, and security-sensitive systems. This matters, because a large part of its relevance today comes from brownfield deployments, not greenfield designs.

In engineering terms, brownfield devices are systems that:

Are already deployed or close to deployment
Have fixed hardware constraints
Cannot be redesigned freely without high cost or risk
Must be extended, maintained, or upgraded in place

This is in contrast to greenfield designs, where hardware, software, and tooling choices can be made from scratch. For brownfield devices, the problem is rarely “design the perfect system.” Instead, it is:

How to add new functionality without changing hardware
How to modernise software workflows on top of legacy platforms
How to introduce new security mechanisms without destabilising a proven system

SmartFusion2 fits squarely into this category. Many teams encounter it not because they would choose it today, but because it is already part of an existing product or certification boundary.

The approach described in this article is shaped by that reality: it assumes fixed hardware, aging tooling, and long product lifetimes, and focuses on making such systems workable and productive rather than ideal.

Philosophy: Software First, Hardware Fixed

The guiding idea behind this setup is simple:

Fix the hardware early, keep it minimal, and let software move fast.

SmartFusion2 allows deep hardware customisation, but recompiling FPGA designs is slow, license-gated, and unnecessary for early development. By freezing a small, known-good MSS configuration and distributing it as a ready-to-flash image, software developers can iterate without touching Libero at all.

A Minimal and Repeatable Development Platform

The evaluation setup is intentionally designed to remove friction during early development. All interaction with the board is handled through a single USB connection. The kit exposes a built-in FlashPro programmer, so no external probes or adapters are required to establish a usable development environment.

In practice, the setup reduces to three fixed elements:

One USB cable used for both programming and UART console access
A jumper configuration that allows flashing of both the FPGA fabric and firmware
A known, static DIP-switch configuration

Figure 1

Once configured, the board can remain in this state for the entire development cycle.

On the hardware side, the FPGA design is deliberately minimal. Only the components required to make the MSS usable are enabled: a Cortex-M3 clocked at 166 MHz, APB buses running at full core speed, a single UART for console output, GPIO-mapped LEDs for visible execution state, and one GPIO routed as an external trigger for measurement and debugging. There is no custom logic, no accelerators, and no unused peripherals. The result is a predictable execution environment that behaves identically on every boot.

The FPGA design is developed using Libero SoC, Microchip’s integrated FPGA design environment for SmartFusion2. Libero is used to configure the FPGA fabric, MSS peripherals, clocks, and pin assignments, and to generate the final FPGA bitstream. It is a comprehensive but heavyweight toolchain, typically operated by hardware teams, and it requires licenses, long build times, and detailed device-level knowledge. Libero produces both the FPGA bitstream and the associated firmware artifacts, which can then be used to build a BSP. Firmware engineers can subsequently continue software development in SoftConsole IDE or, for more low-level workflows, directly edit the code (e.g., in vim) and build it using a GCC-based ARM toolchain.

To keep the workflow software-centric, the FPGA can be programmed using FlashPro Express rather than a full Libero project. The hardware is delivered as a pre-built programming job, which avoids licensing requirements and lengthy synthesis or place-and-route steps. Every developer works against an identical hardware configuration, and flashing the FPGA becomes a one-time operation that takes seconds (well, maybe longer…).

The firmware follows the same philosophy. It does only what is necessary to confirm that the platform is alive: initialise clocks and GPIOs, bring up a UART console, and provide a visible heartbeat via an LED. If UART output is visible and the LED toggles, the system is ready. Anything beyond that belongs in application code, not in bring-up firmware.

Two firmware build modes are supported: a debug configuration that runs from SRAM for fast iteration, and a release configuration that runs from on-chip non-volatile memory for deployment-like testing. This split keeps development efficient without sacrificing realism.

Enabling UART

Using SmartFusion2 from Linux works well in practice, but one detail regularly trips people up: UART access via the on-board FTDI device. Microchip’s documentation is very complete for Windows, but Linux workflows are less well covered (even though Microchip support is excellent). The issue is not the hardware, but how Linux binds drivers to the FTDI interfaces by default.

The evaluation board exposes a multi-interface FT4232H USB device. From a hardware perspective, several virtual serial channels are available. From the Linux kernel’s perspective, however, only one of those interfaces is automatically bound to the ftdi_sio driver, while the remaining three are not.

The FT4232H device connected to the SmartFusion2 micro-USB port sets up four virtual ports. Under Linux, the root device is listed as:

Bus 003 Device 005: ID 1514:2008 Actel Embedded FlashPro5

and each individual interface as

/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 480M
    |__ Port 4: Dev 5, If 0, Class=Vendor Specific Class, Driver=, 480M
    |__ Port 4: Dev 5, If 1, Class=Vendor Specific Class, Driver=, 480M
    |__ Port 4: Dev 5, If 2, Class=Vendor Specific Class, Driver=ftdi_sio, 480M
    |__ Port 4: Dev 5, If 3, Class=Vendor Specific Class, Driver=, 480M

By default, Linux binds ftdi_sio only to interface If 2. In practice, the remaining interfaces can also be bound to ftdi_sio. You can force the driver to match them as follows:

echo 1514 2008 | sudo tee /sys/bus/usb-serial/drivers/ftdi_sio/new_id

This command causes the kernel to rebind the remaining FTDI interfaces, which can be verified in dmesg. After that, UART access is available via the /dev/ttyUSBx device corresponding to interface If 3 (the last FTDI interface).

(Credit goes to a colleague who helped identify this behaviour.)

Flashing Firmware from the Linux Command Line

Flashing firmware from the command line is not officially supported by the standard toolchain; Microchip recommends using SoftConsole. The GUI relies on OpenOCD (bundled with SoftConsole) and GDB, both of which connect to the on-chip debugger via USB.

A command-line workflow is essential for automation, CI, and reproducible Linux-based development. While it is possible to reverse-engineer the OpenOCD invocations used by the GUI, there is a cleaner alternative.

The programmer is supported by the pyOCD toolset, which provides a practical and well-maintained solution for flashing and debugging SmartFusion2 devices using an external probe.

SmartFusion2 evaluation boards expose a standard RVI debug header, which allows the use of common ARM debug probes such as:

Keil ULINK (CMSIS‑DAP)
Other CMSIS‑DAP compliant probes
SEGGER J‑LINK

A key drawback of using an external programmer is that jumper J8 must be moved to position 2–3 in order to program the FPGA bitstream using FlashPro 4 or 5.

pyOCD supports these probes out of the box and provides a clean, scriptable interface suitable for automation.

In practice, the workflow looks like this:

Connect the external probe to the RVI header
Install pyOCD on the host system
Install the CMSIS device pack for the SmartFusion2 target
Flash binaries directly from the command line

Once set up, flashing a firmware image becomes a single command, which integrates naturally into Makefiles, CMake builds, or CI pipelines. This avoids reliance on vendor GUIs while remaining robust and repeatable.

In practice, this looks as follows:

HW configuration: The user connects ULINK2 into the RVI port, connects pins 1-2 of jumper J8 and J31 (as below).
SW configuration: Once pyOCD is installed, the user needs to download a CMSIS package for M2S090 board. No special udev rules are required.

 pyocd pack install m2s090
Downloading packs (press Control-C to cancel):
    Microsemi.M2Sxxx.1.0.65
Downloading descriptors (001/001)

> pyocd list -p
  #   Probe/Board                           Unique ID   Target
----------------------------------------------------------------
  0   Keil Software Keil ULINK2 CMSIS-DAP   V0010M9E    n/a

With this setup, flashing becomes straightforward:

pyocd flash --target m2s090 app/hello.bin
pyocd reset --target m2s090 -m hw

The reset command resets the board, and the -m hw option ensures that the device does not enter debug mode.

For measurement and analysis, UART is often not the most convenient interface.

As an alternative, SEGGER J-LINK or CMSIS-DAP programmers can be used. The RTT interface provided by J-LINK enables fast data transfer between the device and the host, which can be useful for collecting data for constant-time analysis, serving as an alternative to UART.

In addition, the board includes a Trace Port Interface Unit (TPIU) supporting ITM and ETM (Instruction Trace Module / Embedded Trace Module), which can also be used as an alternative to RTT.

Using usbip for Remote Access

In shared lab environments, it is often useful to access SmartFusion2 boards remotely—for example, from CI servers or developer machines without physical USB access. It also helps avoid accidents, such as spilling coffee on expensive hardware.

In such cases, usbip can be used to export a USB-attached debug probe (and, if needed, the FlashPro interface) from a remote host and reattach it on the client machine over the network.

This enables:

Remote firmware flashing using pyOCD
Centralised lab hardware shared across multiple users
Integration of physical boards into CI systems

Configuration is very simple: * Server side (machine with the USB device physically plugged in)

sudo modprobe usbip_core usbip_host
sudo usbipd -D
sudo usbip list -l
sudo usbip bind -b

The usbip list -l command lists available BUSID values, which can be exported over IP using usbip bind -b.

Client side:

sudo modprobe usbip_core vhci_hcd
sudo usbip list -r 
sudo usbip attach -r  -b

Similarly, usbip list -r lists the USB devices exported by the remote server.

As a practical note, usbip uses TCP port 324, so make sure this port is allowed through the firewall.

When combined with pyOCD and stable UART device naming (via udev), usbip allows SmartFusion2 boards to be treated much like network‑attached test equipment. This setup is particularly useful for regression testing, automated measurements, and long‑running experiments.

As with any USB‑over‑IP solution, latency and reliability depend on the network, but for flashing and debug control the approach works well in practice.

Conclusion

SmartFusion2 can feel intimidating at first, especially if approached from an FPGA-centric mindset. Treated instead as a microcontroller platform with a fixed hardware personality, it becomes much easier to work with.

Despite being an older platform, SmartFusion2 remains highly capable. The MSS provides 80 KB of SRAM (or 64 KB when error correction is enabled), which may sound restrictive by modern standards—but in practice it is sufficient for serious cryptography. In our work, we were able to fit both ML-DSA and ML-KEM comfortably, using significantly less. The resulting implementations run reliably and with respectable performance for a Cortex-M3-class device.

This does, however, require care. On Cortex-M3, certain long multiplication instructions are not constant-time, which means developers must be deliberate when implementing cryptographic arithmetic, especially in security-sensitive contexts. Understanding where the microarchitecture leaks and how to work around it is essential.

That topic deserves a deeper discussion of its own and is a story for another blog.

By keeping hardware minimal, firmware simple, and tooling predictable, the platform becomes stable enough to fade into the background. Once the basics are solid, complexity can be added where it actually matters. Starting simple is what makes that possible.

SF2

Evaluating Intel QAT for Hash-Based Post-Quantum Signature Schemes

Kris Kwiatkowski — Sat, 29 Nov 2025 00:00:00 GMT

Introduction

Intel QuickAssist Technology (QAT) accelerates cryptographic workloads by offloading selected operations to dedicated hardware. This reduces CPU load and improves throughput for applications that rely on TLS, VPN, storage encryption, key exchange, or large-scale hashing.

This post outlines the QAT software stack and evaluates its usefulness for accelerating modern cryptographic implementations.

In this post, I show that QAT hashing is not a good fit for post-quantum signature schemes such as LMS, XMSS, and SLH-DSA: the accelerator only wins for large, contiguous inputs, while these schemes issue many small, sequential hash calls. Instead, QAT remains best suited for bulk TLS/IPsec/storage workloads and asynchronous offload.

QAT software stack

The diagram below shows how applications reach QAT hardware through OpenSSL, the QAT Engine, and Intel’s software libraries:

Application: Consumes cryptographic services (e.g. TLS termination, IPsec, disk encryption, PKI, secure messaging).
OpenSSL: Exposes standard crypto APIs; can dispatch operations to software or QAT hardware transparently.
QAT Engine: OpenSSL engine plugin that offloads supported primitives (RSA, ECDSA, DH, AES-GCM, ChaCha20-Poly1305, SHA variants) to QAT hardware when available.
Intel IPP: Highly optimized CPU software primitives (SIMD, microarchitecture-tuned) for symmetric and asymmetric cryptography when hardware offload is not used.
Multi-Buffer Crypto (IPP-crypto): Batches multiple independent crypto jobs (e.g. parallel RSA, AES-GCM streams) to improve core utilization—useful for high-concurrency servers.
Intel QAT Driver: Kernel + user-space interface to QAT devices. Two branches exist:
- In-tree (QATlib): Aligned with kernel development model; standardized feature management.
- Out-of-tree: Broader feature set for some legacy or extended hardware. Driver utilities (e.g. adf_ctl) configure and monitor accelerator instances. Driver families 1.x and 2.x support different hardware generations.
QAT Hardware: Integrated (selected Xeon SKUs) or discrete PCIe accelerators providing queues for crypto and compression services.
QATlib: User-space library exposing a stable API to submit crypto jobs to QAT hardware or to fall back on software paths.

Key Points

Applications typically call OpenSSL, which can use the QAT Engine to offload to hardware or fall back to software (IPP, IPP-crypto).
QAT Engine is designed specifically for hardware acceleration.
Intel IPP and Multi-buffer Crypto provide CPU-based optimizations when hardware acceleration is unavailable or unnecessary.
Multi-buffer Crypto boosts performance by parallelizing cryptographic operations across multiple data buffers, which is ideal for high-concurrency servers.

Cryptographic support

The list of algorithms supported by QAT is documented in the official QAT documentation. Support varies by hardware generation; the driver detects availability. If an algorithm is unsupported, its context initialization returns CPA_STATUS_UNSUPPORTED. Numeric algorithm IDs are defined in the driver sources here.

For this study, the relevant hashing algorithms provided by the installed hardware are:

SHA2-256
SHA2-512
SHA3-256

(SHAKE is not supported on this device.)

Quantitative analysis

We assess whether QAT hash implementations are useful as building blocks inside post-quantum signature schemes - LMS and XMSS.

Measurements compare a conventional optimized C implementation (running on the host CPU) with one-off QAT hash requests submitted via QATlib. Single-shot calls reflect PQ use cases: many small, latency-sensitive invocations rather than bulk streaming. Inputs up to 4 MB stay within QAT request size limits; larger sizes are not relevant for these schemes.

SHA2-256

In LMS (and similarly XMSS) SHA2-256 dominates runtime in key and signature generation. The inner loop for computing K (RFC 8554) iteratively applies the hash to ≈55‑byte inputs; the sequential dependency prevents parallelization:

     4. Compute the string K as follows:
        for ( i = 0; i < p; i = i + 1 ) {
          tmp = x[i]
          for ( j = 0; j < 2^w - 1; j = j + 1 ) {
            tmp = H(I || u32str(q) || u16str(i) || u8str(j) || tmp)
          }
          y[i] = tmp
        }
        K = H(I || u32str(q) || u16str(D_PBLC) || y[0] || ... || y[p-1])

Characteristics:

Strictly sequential inner chain (no batching benefit). The intermediate value tmp must be computed sequentially and cannot be parallelized.
Very small message size per hash call (55 bytes per invocation).

Benchmark results (time per input, log scale) show QAT incurs large fixed overhead for small buffers; advantage appears only beyond ≈512 KB. This makes QAT hashing unsuitable inside LMS/XMSS constructions: cumulative latency increases sharply when each tiny hash call pays the fixed offload cost.

Consequently, using QAT-based hash functions as components within LMS, XMSS, or SLH-DSA implementations is not advisable, as it would result in a substantial performance penalty.

Conclusion: explore specialized hardware-assisted chaining (e.g. PQPerform-style offload of the iterative compression loop) rather than generic QAT hashing. Such approaches, however, require hardware features not exposed by QAT.

SHA3-256

SHA3-256 offload shows similar scaling; hardware improves absolute times versus SHA2 but still requires large inputs to amortize setup. For PQ schemes with many sub-256‑byte or kilobyte-scale hashes, CPU software remains superior in both latency and energy.

The results are consistent with those observed for SHA2, with SHA3 showing noticeably better hardware performance.

Conclusion: The hardware supports SHA3-256 but not SHAKE. Performance characteristics are similar: QAT only overtakes the CPU at large buffer sizes, making it unsuitable for PQ signature schemes where hashes are small and sequential.

SHA2-512

Used in SLH-DSA. Host software (64‑bit words) outperforms QAT across all tested sizes up to 4 MB. The 64‑bit variant also beats SHA2-256 on the same platform, as expected due to native word operations.

Conclusion: For SHA2-512, QAT never outperforms the host CPU at any tested size. Software remains the optimal choice for SLH-DSA workloads.

Overall findings

QAT hashing favors large, contiguous payloads (bulk TLS record digestion, storage integrity, deduplication).
PQ signature schemes issue numerous small, sequential, data-dependent hash calls—poor match for queue-based accelerator semantics.
Offload overhead dominates for sub‑MB inputs; no throughput crossover in practical PQ parameter ranges.
Optimized host implementations (vectorized SHA2/SHA3) yield lower latency and better scaling for LMS, XMSS, SLH-DSA.

Conclusion: For post-quantum signature workloads on 64‑bit systems, retain optimized CPU hash functions. QAT hashing is not an effective accelerator for their internal iterative constructions.

Technological Fit for QAT

To realize these benefits, applications must integrate QAT asynchronously. Offloading compute-heavy primitives frees CPU cycles and improves throughput, particularly in high-concurrency environments.

A key enabler is asynchronous execution. In the past, Intel invested in OpenSSL by implementing ASYNC_JOB infrastructure. This functionality is based on reactor pattern, which we describe below.

Asynchronous use of the `QAT_engine`. The reactor pattern.

The reactor software design pattern is an event handling strategy that can respond to many potential service requests concurrently. Its key function is to demultiplex incoming requests and dispatch them to the correct request handler. By relying on event-based mechanisms rather than blocking I/O or multi-threading, it’s designed to handle numerous concurrent I/O bound requests with minimal delay. Request handlers (here QAT engines) are registered as callbacks with the event handler for flexibility and separation of concerns.

Reactor

The Reactor pattern excels at managing asynchronous I/O. QAT is accessed asynchronously; requests are submitted, and the application is notified later via polling or a callback. This direct match means a Reactor-based application architecture can effectively handle QAT operations without blocking its main event loop. This is typically used for TLS hardware acceleration: each engine is implemented as a request handler within the reactor. Such a design minimizes connection-establishment latency in environments that handle multiple requests at the same time.

TLS offload

With asynchronous execution available in OpenSSL, Intel performed a measurement study showing performance improvement when offloading TLS. The study goes into great detail on how the measurements were done.

The conclusion that is particularly interesting is that QAT throughput for TLS saturates beyond 16 cores for RSA2K/ECDHE-X25519.

NGINX Webserver Handshake Performance

It would be interesting to see a similar measurement study focused on post-quantum (PQ) schemes, namely the following combinations:

Key Exchange: X25519-MLKEM768, Digital Signature: RSA-2048 This reflects the most commonly used hybrid setup in today’s web traffic.
Key Exchange: MLKEM-768, Digital Signature: ML-DSA-65 A compelling mid-term candidate for fully post-quantum TLS session establishment.
Key Exchange: MLKEM-768, Digital Signature: FN-DSA-512 Arguably the most performant post-quantum option for future web communication.

A comparison of QAT, software implementations, and an implementation on a GPU (e.g. cuPQC) could provide a realistic view of performance trade-offs across hybrid and fully post-quantum scenarios.

QAT ecosystem: IPP and Multi-buffer Crypto

This part describes in more detail the software components used in the QAT ecosystem.

Integrated Performance Primitives (IPP)

Intel IPP is a set of software libraries optimized for Intel processors that provides a variety of cryptographic primitives. They are optimized for latency and throughput by using Intel’s ISA crypto extensions (traditional software acceleration, leveraging SIMD, NI, etc.). IPP implements operations like AES, RSA, ECC, hashing (SHA), and compression algorithms.

Multi-buffer Crypto

A specialized library developed by Intel and often packaged alongside IPP, designed specifically for parallelizing cryptographic operations across multiple independent data buffers simultaneously. It optimizes performance in multi-threaded or asynchronous environments by batching multiple independent cryptographic operations and processing them concurrently. Particularly useful for ciphers like AES-GCM, where latency can be hidden by parallelism.

It’s important to understand that they are designed explicitly for parallel workloads. This solution achieves significantly better throughput by processing multiple buffers concurrently, even within a single thread.

Ideal for high-throughput networking (server side) scenarios, VPNs, and SSL/TLS termination points, with multiple client connections.

Relationship between IPP and Multi-buffer Crypto

IPP provides baseline cryptographic primitives optimized for single-buffer, high-performance CPU execution. Multi-buffer Crypto takes IPP primitives a step further, optimizing for parallel operations on multiple independent data streams.

Multi-buffer Crypto delivers much higher throughput in scenarios where latency can be tolerated and multiple independent tasks can run in parallel.

Reuse within a cryptographic software stack

When integrating QAT (or any accelerator) into a cryptographic software stack, three patterns are common:

Integration into the core crypto stack
Separate companion component alongside the core stack
Plugin-based integration for open-source ecosystems

Integration into the core crypto stack

This approach embeds HW‑accelerated implementations directly into the core library, exposing a unified API over both software and HW assisted paths. While convenient at the interface level, it significantly increases design complexity and long‑term maintenance burden. A cleaner separation is to keep the core limited to portable software (including CPU ISA extensions), and avoid coupling to device‑specific accelerators.

The same conclusion applies to accelerator SDKs targeting other devices (e.g., GPUs): keeping them out of the core library preserves clarity and portability.

Separate companion component

This mirrors a split design: a pure‑software core and a distinct HW‑assisted component, each with its own repository, release cadence, and maintenance workflow. The separation is practical because accelerator support targets a narrower footprint, while the core aims for broad platform coverage. A companion component can provide a user‑space dispatch layer that selects between accelerator backends and CPU‑optimized code, and can be extended to support additional devices over time.

Plugin-based integration for open-source ecosystems

Here, accelerator support is delivered as plugins for widely used open‑source cryptographic libraries and frameworks. We recommend building and maintaining such plugins following the separate‑component strategy above: plugins integrate with an existing cryptographic implementation rather than re‑implementing primitives. These plugins must handle low‑latency, incremental input processing and asynchronous completion.

This approach delivers immediate value by integrating with real applications—for example, web servers for TLS offload or VPN stacks for key establishment—without requiring application rewrites.

In practice, I favor a portable software core plus a separate accelerator companion and plugin-style integrations for ecosystems like OpenSSL, rather than baking accelerator logic into the core library.

Conclusions

The quantitative study presented in this post was conducted using QAT hardware connected via PCIe. While the host machine used is relatively powerful, the PCIe communication introduces latency during data transfers.

Intel’s 4th Generation Xeon Scalable processors, released in 2023, represent a significant architectural advancement. These processors feature integrated QAT acceleration engines and support for the CXL 1.1 (Compute Express Link) standard. QAT performance on these CPUs may differ significantly from our current results. Preliminary analysis suggests that CXL offers a more efficient communication model between CPUs and cryptographic accelerators, making it a better fit for such workloads. Even with CXL, unless latency drops substantially, the underlying pattern is likely unchanged: QAT is effective for bulk workloads but not for the many small, sequential hash invocations typical in PQ signature schemes.

Migration to Post-Quantum Cryptography

Kris Kwiatkowski — Tue, 16 Sep 2025 00:00:00 GMT

The global internet security ecosystem is preparing for one of its biggest shifts in decades: the migration from traditional cryptographic algorithms to post-quantum cryptography. Quantum computing may still be years away from breaking widely deployed algorithms, but the “harvest-now, decrypt-later” (HNDL) threat makes planning and transitioning urgent. Sensitive data encrypted today could be collected and decrypted in the future once cryptographically relevant quantum computers (CRQCs) arrive.

Here are some thoughts on the migration.

Why Start the Migration?

You don’t want to be caught off guard when quantum computers become capable of breaking current cryptographic standards. The migration process is complex and time-consuming, often taking several years to complete. Organizations must first evaluate when and how to begin, considering factors such as:

Data lifetime – how long the information you protect needs to remain confidential.
Migration complexity – how much effort is required across systems, hardware, and vendors.
Quantum threat timeline – how soon a CRQC could become practical.

Even without immediate existential risk, planning and testing today ensures you aren’t forced into a rushed, high-cost migration tomorrow. More complicated systems, such as old, small embedded devices, may take years to update, replace, or ideally redesign with PQC in mind.

Starting this transition early is definitely a good idea, because migration can be complex and time-consuming. But without panic and feeling pressured to start immediately. Be deliberate when selecting products and suppliers - this technology is still maturing and not yet fully commoditized, which means higher costs and potential risks. Open-source solutions can help mitigate some of these challenges, though they come with their own uncertainties. Proprietary options may provide stronger support and stability but can also create dependency on specific vendors and ecosystems. In short, try hard to avoid the “sell-now, forget-later” mindset - this remains a developing field with trade-offs, uncertainties, and costs that must be carefully balanced.

Key Exchange in TLS: The Most Urgent Step

TLS key exchange is the top priority for PQC migration. If session keys are negotiated using quantum-vulnerable algorithms, future quantum computers could decrypt recorded traffic. To mitigate this, PQ/T hybrid key exchange is recommended during the transition. These approaches combine at least one post-quantum and one classical algorithm, so security is maintained as long as one remains unbroken. Hybrid KEMs are also easier to roll out than post-quantum signatures, since they are ephemeral and not linked to long-term identity.

As PQC matures, hybrid KEMs will become less important, and eventually only the post-quantum algorithm will be needed.

The IETF began work on standardizing post-quantum key exchange for TLS in 2019, after a key workshop at Mozilla’s Mountain View offices and early Google-led experiments. This resulted in a framework for new key exchange methods, and the TLS Post-Quantum Experiment showed hybrid KEMs in action.

The first widely adopted Internet draft for hybrid key exchange in TLS is now close to completion. This draft and its extension have become the de facto standard for modern TLS, with implementations already available in OpenSSL, NGINX, and AWS’s AWS-LC library, and are already widely deployed. Jan Schuman from Akamai has a great post on sites already using PQC.

TLS is not the only protocol that needs quantum-safe key exchange. For this reason, there is also an effort to standardize a more generic approach to hybrid KEM constructions for broader use.

Digital Signatures: Important, But Less Urgent

Digital signatures remain a vital component of cryptographic protocols, ensuring authenticity, integrity, and non-repudiation. As such, post-quantum digital signature schemes are necessary for a future-proof internet infrastructure. However, the urgency to deploy them is relatively lower compared to key encapsulation mechanisms (KEMs).

Unlike key exchange, authentication cannot be broken retrospectively, meaning quantum-safe signatures are only needed once cryptanalytically relevant quantum computers become available. As a result, the migration to post-quantum digital signatures is less time-sensitive than for KEMs, allowing for a more deliberate and carefully planned transition. Since post-quantum signature schemes often involve larger keys and signatures, greater computational overhead, and increased implementation complexity, their deployment may incur higher costs - reinforcing the importance of keeping the migration - —as simple and efficient as possible.

Determining whether and when to adopt PQC certificates or PQ/T hybrid schemes may depend on several factors, such as:

Frequency and duration of system upgrades
Operational flexibility to enable or disable algorithms

Deployments with limited flexibility (e.g., embedded systems) benefit significantly from PQ/T hybrid signatures. This approach mitigates the risks associated with delays in transitioning to PQC and provides an immediate safeguard against zero-day vulnerabilities.

While hybrid constructs may seem plausible for long-term security, they also introduce complexity, potential performance overhead, and long-term implications:

The number of possible hybrid combinations leads to interoperability challenges and increased implementation burden.
If one scheme is compromised, forgery is only a concern while the corresponding public key remains trusted.
Long-term protection through hybrids may be limited in practice due to standard key management practices.

There is another risk related to the potential misuse of PQ/T hybrid signatures. Consider this: a deployment may use hybrid signatures to facilitate migration, resulting in a mix of devices - some aware of PQ schemes and some not. Devices unaware of PQ schemes may continue to validate only the traditional signature, while those aware of PQ schemes may validate both signatures. A deployment might continue this approach even after the traditional algorithm has been broken. While this may simplify operations by avoiding re-provisioning of trust anchors, it introduces a significant risk. A CRQC could forge the broken traditional signature component over a message, then combine it with the valid post-quantum component to produce a new composite signature that verifies successfully. This underscores the critical need to retire hybrid certificates containing broken algorithms once CRQCs become available (and always validate both components of a hybrid signature).

The IETF has many experts working on this topic. For example, draft-ietf-lamps-pq-composite-sigs describes how to create and verify composite signatures that combine a post-quantum signature with a classical signature.

Nevertheless, hybrid signatures remain complicated and may not be suitable for all scenarios. Fortunately, they are only needed once quantum computers are capable of breaking current signature algorithms. So, we still have some time to make the authentication migration as smooth as possible.

Infrastructure Costs: What to Expect

This topic is both important and frequently overlooked. Migrating to post-quantum cryptography (PQC) often requires significant updates to existing infrastructure. Careful planning and budgeting are essential, as costs can arise in multiple areas.

The first step is discovery - building a comprehensive inventory of all cryptographic assets and where they are used. This process may require specialized tools and can itself be resource-intensive.

Another major factor is whether updates can be delivered through software patches or require new hardware. For some systems, particularly constrained or niche devices, supporting PQC may require custom development or even physical replacement. Engaging with vendors early is critical to understand available options and associated costs.

Key infrastructure components that will need attention include:

Network protocols – TLS, SSH, and QUIC must be adapted to handle larger PQC artifacts. While PQC KEMs such as ML-KEM often perform competitively in handshakes, their larger message sizes can increase bandwidth use, add round trips, or introduce latency.
Message processing – PQC signature algorithms typically process entire messages rather than digests. This can hurt performance in systems like HSMs that rely on streaming data, unless applications adopt pre-hashing or streaming-friendly designs.
PKI systems – Certificate Authorities (CAs), certificate formats, and trust anchors must all evolve to support PQC. Hybrid certificate formats can ease transition but also add complexity and operational overhead.
Constrained devices – Long-lived systems (e.g., satellites, industrial controllers, smart meters) are especially difficult to update. Limited memory and compute resources may force costly redesigns or replacements, and in-field updates can be logistically challenging.
Hybrid approaches, while helpful for resilience during the transition, can add two layers of cost: first to support dual algorithms (certificates, key management, validation), and later to migrate again once hybrids are no longer needed. In some environments, this two-step process is more expensive than planning a direct migration to PQC at the right moment.
Training and Awareness: Ensuring that staff are knowledgeable about PQC concepts, like KPIs provided by PQC implementations and impact on performance, familiar with tradeoffs of PQ/T hybrid schemes and their implications on migration process are essential. This may involve training programs, workshops, or hiring specialized personnel, all of which contribute to the overall cost.

Long-term savings come from embracing cryptographic agility: designing systems that can switch algorithms without major architectural changes. This reduces the cost of future transitions - but achieving true agility requires upfront investment in both design and standardization.

Final Thoughts

Migrating to post-quantum cryptography is not a single upgrade — it is a long-term process. As mentioned at the beginning of this article, starting this transition early is definitely a good idea.

It is worth recalling that the first IETF draft proposing a PQ/T hybrid key exchange for TLS was published back in 2019 (I agree, it wasn’t serious proposal at the time), nevertheless we are now in 2025, and only recently have PQ/T hybrid standards been finalized by the IETF and started to see adoption by major browsers. That is six years for a single - and relatively simple - use case: key exchange in TLS.

When it comes to digital signatures, the migration will be significantly more complex and time-consuming. Based on current timelines and projections for quantum computing, cryptographically relevant quantum computers are expected to emerge by the end of the next decade. This leaves less than ten years to finalize standards and initiate large-scale migration. Not in TLS, but everywhere. Given that such migration efforts typically span several years, it is clear that we are already behind schedule.

The situation today, however, is quite different from 2019: the first PQC algorithms have been standardized, and there is far greater motivation and momentum to move forward. Nevertheless, much remains to be done. The key is to plan for agility and actively align with emerging standards. Going back to the beginning of this article: don’t wait until the last minute, but also don’t rush into unproven solutions. Balance risk, cost, and operational complexity carefully.

The transition will be uneven: some systems will adopt hybrids, some will wait for pure PQC. Constrained devices may require tailored strategies and highly optimized implementations to match the performance and resource utilization of traditional algorithms. But the direction is clear: a future-proof Internet must stay safe!

Ongoing work in the IETF focuses on general guidance for migration to PQC as well as guidance for constrained devices. Feel free to join the discussions in the PQUIP Working Group.

Note on speed of verification in SLH-DSA

Kris Kwiatkowski — Tue, 03 Sep 2024 00:00:00 GMT

Here I’ll compare on the verification functionality of LMS and SLH-DSA. The XMSS is not mentioned, but as both LMS and XMSS are quite similar in this sense, we probably can observe similar results (XMSS is slightly slower than LMS).

When comparing stateful and stateless hash-based signature schemes, the main benefit of the former is significantly shorter signature sizes and much faster verification. The difference in signature size is significant. I summarised differences in a table below, but in brief, an LMS signature is around 4KB, while a signature with SLH-DSA at a similar security level is closer to <50KB, hence ~10x bigger.

SLH-DSA param	PubKey	Signature	Security	LMS param	SK	PK	Sig
SLH-DSA-SHA2-128s	32	7856	128	-	-	-	-
SLH-DSA-SHA2-128f	32	17088	128	-	-	-	-
SLH-DSA-SHA2-192s	48	16224	192	LMS-SHA2-M24-H25-W8	44	48	1260
SLH-DSA-SHA2-192f	48	35664	192	LMS-SHA2-M24-H25-W1	44	48	5436
SLH-DSA-SHA2-256s	64	29792	256	LMS-SHA2-M32-H25-W8	52	56	1932
SLH-DSA-SHA2-128f	64	49856	256	LMS-SHA2-M32-H25-W1	52	56	9324

For SHA2-based LMS, there are 40 different parameterizations possible. In the table above we used extreme values. Parameterization with the W1 postfix indicates large, but fast verification and the one with W8 postfix is a parameterization that provides small signatures, but slow verification.

When it comes to SLH-DSA – the s postfix indicates small parameter sets and f indicates the fast one. But, contrary to LMS, the verification procedure of s parameterization is faster than f. That is related to the design of the verification algorithm – namely, a shorter signature implies fewer evaluations of the hash function.

To give some numbers, the runtime of function F() dominates the runtime of SLH-DSA. We calculated (average) the number of calls to that function as well as the percentage of time the verification algorithm spends in the F() function. The results are presented in the table below. One should notice that there are much fewer calls to the F() in the case of s variant.

	128f	192f	256f	128s	192s	256s
No of invocations of function `F()`	5908	8620	8633	1886	2751	4067
% time spent in `F()`	94.8%	95.7%	95.2%	88.1%	89.3%	90.9%

Key agreement methods in FIPS

Kris Kwiatkowski — Thu, 01 Jun 2023 00:00:00 GMT

FIPS has multiple ways of claiming CAVP-tested compliance of the key agreement schemes. Each of them corresponds to a different use case, for example, the key agreement may or may not include key derivation. Additionally, FIPS also supports key confirmation (i.e. 56Ar3, 5.9) which can be applied to some key agreements. It is easy to get lost when reading FIPS IG, hence here below I put short summary of differences:

KAS-SSC: Compliance with the agreement on shared secret Z (only). The key agreement scheme is the one mentioned in the SP800-56C r3, Section 6. No key derivation is done after Z is agreed upon.
KAS: Compliance with NIST-approved key agreement AND derivation. Testing is done End-to-End, meaning both operations are done by single security service and a calling sequence is within the module boundary.
KDA: It relates only to the key derivation part, so testing is NOT done End-to-End. This certificate is given when derivation uses one of the KDF’s described in SP800-56C rev1 or rev2.
CVL: It relates only to the key derivation part, so testing is NOT done End-to-End. This certificate is given when derivation uses one of the KDF’s described by the IG 2.4.B.

Note that SP800-56C rev2 is also mentioned by the IG 2.4.B. My understanding is that for example, in the case of TLS v1.3, we do need SP800-56 rev2, but not necessarily KDA certificate. For KDA compliance, software needs to be tested separately.

Example PQ-TLS v1.3: Two goals. 1) to implement the TLS key schedule as per 7.1 of RFC 8446, 2) to allow hybrid, quantum-safe key agreement.

We need a scheme that will be used for generating shared secret Z, so we need KAS-SSC. KAS is not useful as TLS key schedule is a single-extract-multi-expand derivation (800-56C r2, section 5.3). TLS uses key derivation with HKDF (two-step), so we also need KDA or CVL. Only IG 2.4.B. mentions TLS, so we need CVL. Hybrid-PQ TLS is not standardized, so CVL won’t apply here (I think), from the other hand SP800-56C rev2 allows using an auxiliary KAS as an addition to the approved one, hence we also need KDA. Therefore, in this case, we need KAS-SSC, KDA and CVL certificates.

Abbriviation	Meaning
SSC	Shared Secret Computation
KDA	Key Derivation Algorithm
CVL	Component Validation List
KAS	Key agreement Scheme

Gentle introduction to NTRU cryptosystem (part 1)

Sun, 17 Oct 2021 00:00:00 GMT

NTRU cryptosystem is a grandfather of lattice-based encryption schemes. The initial idea was due to Ajtai. His work evolved into a whole area of research with the goal of creating more practical, lattice-based cryptosystems, like the first NTRU-based encryption system and signature scheme due to Hoffstein, Pipher, Silverman, Howgrawe-Graham and Whyte.

The cryptosystem is based on polynomial rings. More precisely, the base is a problem of recovering a sparse polynomial that is a factor of a polynomial modulo $X^n - 1$ in the polynomial ring of some finite field $F_q$.

The article below tries to explain, in easy to understand terms, the basics of NTRU, starting from a brief explanation of what the lattice is. Future articles will introduce a more detailed view of a modern approach to building NTRU-based cryptosystems.

Rings

NTRU operates in a ring of polynomials of degree $N$. The degree of a polynomial is the highest exponent of its variable. For example, $x^7+6x3+11x^2$ has degree of 7. One can add polynomials in the ring in the usual way, by simply adding theirs coefficients modulo some integer. In NTRU this integer is called as $q$. Polynomials can also be multiplied (obviously), and the result of a multiplication is always a polynomial of degree less than $N$. It basically means that exponents of the resulting polynomial are added to modulo $N$. For example:

In other words, polynomial ring arithmetic is very similar to modular arithmetic, but instead of working in a “set of numbers” less than $N$, one works in a set of polynomials with a degree less than $N$.

NTRU scheme: basic idea

To instantiate the NTRU cryptosystem, the following domain parameters must be chosen:

$N$ - degree of the polynomial ring, in NTRU the principal objects are polynomials of degree $N−1$.
$p$ - small modulus, used during key generation and decryption for reducing message coefficients.
$q$ - large modulus, used during algorithm execution for reducing coefficients of the polynomials.

First, we generate a pair of public and private keys. To do that, two polynomials $f$ and $g$ are chosen from the ring in a way that their randomly generated coefficients are much smaller than $q$. Then key generation computes two inverses of the polynomial:

The values $f$ and $f_p$ make up the private key. The public key $pk$ is computed, as follows: The $f_q$ is not part of any key, however it must remain secret.

It might be the case that after choosing $f$, the inverses modulo $p$ and $q$ do not exist. In this case, the algorithm has to start from the beginning and generate another $f$. That’s unfortunate because calculating the inverse of a polynomial is a costly operation. The recent instantiations of some NTRU schemes (like NTRU-HRSS) are design in a way to ensure those inverses always exist. Which makes key generation faster and more reliable.

The encryption of a message $m$ proceeds as follows. First, the message $m$ is converted to a ring element $pt$ (there exists an algorithm for performing this conversion in both directions). During encryption, NTRU randomly chooses one polynomial $b$ called $blinder$. The goal of the blinder is to generate different ciphertexts per encryption. Thus, the ciphertext $ct$ is obtained as:

Decryption looks a bit more complicated, but it can also be easily understood. It uses both the secret value $f$ and $f_p$. To recover the plaintext as:

Taking all that was described above, evaluation done during decryption is something like:

After obtaining $pt$, the message $m$ is recovered by inverting the conversion function.

The underlying hard assumption is that given two polynomials: $f$ and $g$ whose coefficients are short compared to the modulus $q$, it is difficult to distinguish $pk={f g}$ from a random element in the ring. It means that it’s hard to find $f$ and $g$ given only public key $pk$.

Concrete schemes

The original scheme has a long (over 20 years) history. Since then it has been changed multiple times, as a response to account for cryptanalytic advances. Several variants of concrete KEM and signature schemes based on NTRU were proposed during NIST PQC standardization. In the case of KEM, two candidates NTRUEncrypt and NTRU-HRSS-KEM been merged together and end up as a scheme called … well, “NTRU”. The scheme is fairly easy to implement in constant-time with is characterized by performance, allowing it to be used in the production environment. The NTRU-based signature scheme Falcon also came to the last round of the standardization. It’s characterized by very fast execution and relatively small key sizes. Nevertheless, performance efficient, constant-time implementation can be quite complicated. It also seems, patent situation for NTRU schemes is much clearer than in case of other candidates.

Constant-time code verification with Memory Sanitizer

Fri, 09 Jul 2021 00:00:00 GMT

In the cryptography context, the side-channel attacks are about exploiting computer system implementation to gain information about the secret key. First such attacks were introduced by Paul Kocher in his paper called “Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems”. Since then attacks were much improved to include different sources of side-channel information like power consumption, and electromagnetic and acoustic emissions. Since then attacks were much improved to include different sources of side-channel information like power consumption, and electromagnetic and acoustic emissions. Since then, those attacks were proved multiple times to be practical.

Timing attacks are a subset of side-channels in which information about the execution time of a cryptographic primitive is exploited to break the system. One of the most appealing advantages over other side-channel attacks is that those attacks can be applied remotely, through the network. For an attack to be useful, the execution of the cryptographic primitive must depend on secret material. In brief, constant-time implementation is crafted in a way that execution doesn’t depend on secret data. Article by Thomas Pornin, author of BearSSL, explains in detail what it means here.

In many cases, checking if the implementation is constant-time is not trivial and tools are needed. Adam Langley’s developed a tool called ctgrind. The idea is to extend and use a tool detecting Use of Uninitialized Memory (UUM). The tool is called Memcheck, which runs in the Valgrind framework.

Memory Sanitizer (MSan) is another UUM detector. It is a part of the LLVM project, which integrates with the clang compiler. This blog post briefly describes how it }}">works and how to use it for checking the constant-time implementation in a }}">“a toy” code. Finally, after introducing useful }}">helper it shows a }}">result of integrating it with existing cryptographic implementations and }}">compares it to Memcheck.

How does the UUM detector work?

Techniques used for detecting Use of Uninitialized Memory, implemented in tools like Valgrind’s Memcheck or LLVM’s Memory Sanitizers, has been developed over the years and currently are quite advanced. At high level both Memcheck and MSan are similiar.

The main difference between Memcheck and Memory Sanitizer is that in the case of Valgrind binary instrumentation is done at startup. Memcheck uses the Valgrind framework for instrumenting already compiled binary. In contrast, Memory Sanitizer instruments code at the compilation stage and leverages mechanisms implemented in LLVM, like its intermediate representation. Memcheck combines the detection of UUM and memory addressability bugs in a single tool. In the case of LLVM, these are implemented in two different tools - Memory Sanitizer (MSan) for detecting UUMs and a sibling tool called AddressSanitizer (ASan) for addressability bugs. There are down and upsides to both approaches. The approach taken by MSan allows for execution to be magnitude faster and have almost no startup penalty.

Except for those differences, at a high level, the internals of both Memcheck and MSan are similar. UUM detector tracks the state of every bit of memory or register used by the program. It uses a concept called shadow memory (Valgrind calls shadow memory a VBITS’s), which stores information on whether each bit of memory was properly initialized.

It allows using uninitialized memory if it is “safe” to do so. For example, copying it from one place to the other is not a problem. It reports a problem only when the execution of a program depends on an uninitialized state. For example, when branching, dereferencing a pointer or using uninitialized memory as for array indexing. That’s exactly what we need to test constant-time functions. Uninitialized memory can be propagated to other variables in the code (i.e. copying). To track the propagation, UUMs implement propagation of the shadow memory - when the uninitialized value is used as an operand of a “safe” operation, its state is propagated to the result of that operation. The new shadow value is computed based on the values of the operands and their shadow values.

Let’s summarize those concepts by analysing a concrete example. Function on the left side adds 1 to an argument n and returns the result. On the right side, we see the state of the shadow memory and the current values of variables. Uninitialized memory is marked red and initialized is blue. In this example, the function is called with argument n=6. The 12 least significant bits of an argument n are uninitialized.

UUM detection mechanism

The function creates on a stack temporary variable b and assigns 1 to it. Variable is properly initialized, hence its shadow memory is marked blue. The second variable r is initially uninitialized. In addition, the UUM detector looks at shadow bits of both operands and calculates a new shadow value for variable r. Adding uninitialized to initialized results in uninitialized. So, the shadow memory for r is the same as for the argument n (4 most significant bits have initialized the rest is not). In addition, the operation doesn’t change program flow it doesn’t trigger the UUM.

It should also be noticed that resulting shadow memory depends, on the operation being done as well as the state of shadow memory of both operands.

Origin tracking helps to understand potential problems. It assigns an ID to each variable and serves as an identifier that created uninitialized bits in shadow memory. Once UUM is triggered, the detector uses those IDs to backtrack the origin of uninitialized values and print them in the final report. MSan implements more advanced origin tracking, its report shows lines of code where uninitialized memory was created as well as places at which it was propagated before UUM was triggered (see the result of running “toy example” above). In the case of MemorySanitizer, it is enabled by providing -fsanitize-memory-track-origins= flag to the compiler, in case of Valgrind it is --track-origins=yes option.

Toy example

A constant-time table lookup is an important tool in secure, cryptographic implementations. For instance, it is used for the implementation of elliptic curves-based schemes, like ECDH - widely deployed on the Internet and used by HTTPS connection key exchange schemes. In that system, the (rational) point on the elliptic curve is multiplied by a scalar. A scalar is used as a secret key and hence it must be protected against leaks. The optimized version of such multiplication may use, so-called window technique. In this case, to speed up computation, an algorithm starts with pre-computing small multiplies of a point (like , for fixed window size ) and stores them in some table. Then, scalar-by-point multiplication consists of slicing the binary representation of a scalar into equal -bit long pieces, iterate over such split and use those pieces for point multiplication (see here for a detailed description of a technique). Most importantly at each iteration, an algorithm gets a small multiply of a point for from the table. That is a time variable operation - as the value may be loaded from different locations (CPU register, cache or RAM). An attacker can then try to guess secret scalar by exploiting those time differences. Hence, lookups done be secure implementations need to be implemented in a constant-time manner.

The toy example shows how to use LLVM’s Memory Sanitizer to detect whether table lookup is constant-time. Program starts with initializing table pow2 with powers of two in a range . Then it reads from the command line to the variable secret, uses it to get from the table and returns the result.

In the first step, we want to ensure that MSan triggers UUM whenever program execution depends on the value of secret variable, if it does - UUM must be triggered. Fortunately, the LLVM’s MemorySanitizer API offers such a possibility, the __msan_allocated_memory marks address ranges as containing undefined or defined data, exactly what we need. All the MemorySanitizer API functions are located in the msan_interface.h header, which needs to be included in the code. Let’s look at the example below:

#include 
#include 
#include 

#include 

#define POW2_NUM 64
static uint64_t pow2[POW2_NUM];

static inline uint64_t select_n(uint64_t n) {
    return pow2[n];
}

int main(int argc, const char* argv[]) {
    uint64_t ret, secret;
    // Initialize a table with powers of 2
    for (size_t i=0; i<64; i++) {
        pow2[i] = 1ULL << i;
    }

    secret = atoi(argv[1]);

    // Denote "secret" variable as uninitialized
    __msan_allocated_memory(&secret,sizeof(secret));
    // Time dependent operation possible load from cache or memory
    ret = select_n(secret);

    // Denote memory as defined to eliminate false possitive, due
    // to non constant-time implementation of printf
    __msan_unpoison(&secret, sizeof(secret));
    // Denote also 'ret' in case shadow bits were propagated
    __msan_unpoison(&ret, sizeof(ret));
    printf("2^%lu = %lu\n", secret, ret);
}

In that code, the select_n function performs memory lookup in a table pow2. Just before the function is called, 64-bits (8 bytes) of memory storing secret is denoted as uninitialized. That’s done by __msan_allocated_memory function. Then program calls the select_n function and at this point, UUM should be triggered. To avoid reporting a false positive error (caused by printf), the secret and ret are marked as defined, just after select_n returns. The result of the run is as expected:

> clang -g -fsanitize=memory -fsanitize-memory-track-origins=2 -fno-omit-frame-pointer test.c
> ./a.out 47
==1307703==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x55f795fe2a1c in select_n /home/kris/test.c:11:12
    #1 0x55f795fe269f in main /home/kris/test.c:26:11
    #2 0x7fc798ac8b24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
    #3 0x55f795f6114d in _start (/home/kris/a.out+0x2014d)

  Uninitialized value was stored to memory at
    #0 0x55f795fe299e in select_n /home/kris/test.c:10
    #1 0x55f795fe269f in main /home/kris/test.c:26:11
    #2 0x7fc798ac8b24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

  Memory was marked as uninitialized
    #0 0x55f795fbd1bb in __msan_allocated_memory (/home/kris/a.out+0x7c1bb)
    #1 0x55f795fe2647 in main /home/kris/test.c:24:5
    #2 0x7fc798ac8b24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

SUMMARY: MemorySanitizer: use-of-uninitialized-value /home/kris/test.c:11:12 in select_n
Exiting

Execution correctly triggers UUM at line 11, which is precisely where table lookup is done. Sanitizer also reports some information about origin of the problem. Generation of that information is enabled by -fsanitize-memory-track-origins=2 flag and proves to be quite useful during designing functions with constant-time execution.

Detection works that’s excellent. Let’s try to use it now on a code that’s constant time and see if UUM is not triggered. Following implementation is functionally equivalent to select_n, but now table lookup is done in constant time. Namely, it always goes thru all the elements of the table. The function calculates a mask variable, which sets all the bits only when element n is processed. Then thanks to logical & value is copied to the variable ret.

static inline uint64_t const_select_n(uint64_t n) {
    uint64_t mask, sign, i, ret = 0;
    sign = 1ULL << (63 - n);
    // Always iterate over all elements
    for (i=0; i<POW2_NUM; i++, sign<<=1) {
        // Arithmetical shift right propagates MSB if
        // set. Thanks to 'sign' set above, this is
        // done only once during whole iteration.
        mask = ((int64_t)sign) >> 63;
        // With correctly set mask only one value
        // is assigned to 'a' variable
        ret |= pow2[i] & mask;
    }
    return ret;
}

We can now swap the call to select_n with a call to const_select_n. Such implementation doesn’t trigger UUM anymore, as execution doesn’t depend on uninitialized data - the program always reads the whole table. On the flip side, the implementation of const_select_n is much more complicated to analyse, so tools are needed.

> clang -g -fsanitize=memory -fsanitize-memory-track-origins=2 -fno-omit-frame-pointer test.c
> ./a.out 47
2^47 = 140737488355328
>

Utility called `ct_check.h`

Both, the Memcheck and Memory Sanitizer provide programmatic API, that can be used to design constant-time code. The ct_check provides a unified API for using both of those tools. A flag is used at compile time to control which tool to use. At the development stage, I use both tools- MSan is faster and gives more information in the final report, checks by Memcheck are more granular. Such a wrapper allows writing code only once and hence it is quite useful.

The ct_check.h exposes following functions:

API	Description
`ct_poison`	Marks bytes as uninitialized. Switches on constat time checks for certain memory regions. It is wrapper around `__msan_allocated_memory` and `VALGRIND_MAKE_MEM_UNDEFINED`
`ct_purify`	Marks bytes as initialized. Switches off constat time checks (operation opposite to `ct_poison`)
`ct_print_shadow`	Prints state of shadow bits for uninitialized memory region.
`ct_expect_uum`	Instructs the compiler that it expects UUM after a call to this function. It works only with LLVM, useful for testing.
`ct_require_uum`	Ensures that UUM was before reaching this function. It works only with LLVM, useful for testing. Usually used in blocks `ct_expect_uum(); do_non_ct_stuff(); ct_require_uum();`

With that set of functions, I’ve used tests implemented by A. Lagnley to ensure the correctness of MSan and ctgrind are the same. Implementation of those tests with ct_check.h is here, but indeed, results are the same.

Applying `ct_check.h` to the existing implementation

Instead of toy-code, let’s now take an existing, modern cryptographic implementation, which was vulnerable to timing attacks and see if Memory Sanitizer can detect a problem in vulnerable code. Quantum-safe cryptographic implementations are currently my main focus, so I’ll apply it to one of Key Encapsulation Mechanism (KEM) submitted to NIST for post-quantum standardization. All the code presented below comes from PQC library available on Github (branch called blog/frodo_constant_time_issue).

A KEM is defined by 3 algorithms. A key generation returning pair of public and private keys, encapsulation algorithm which uses the public key to return shared secret in plain form and in encrypted form as ciphertext. Finally, decapsulation algorithm, that takes ciphertext and secret key as an input and returns shared secret, which then can be used for symmetric encryption (i.e. in TLS). To avoid leaking the secret key, the decapsulation function must ensure that the operation done on the private key is constant-time. This problem has been reported in FrodoKEM and exploited in recent paper. In that work, the authors propose (section 3) a generic side-channel technique that can be applied to recover the secret key of (LWE-based) KEM. Then (in section 4) describes how to use that technique to recover the FrodoKEM key. I highly recommend the paper (or video) to anybody interested in secure cryptographic implementations.

The following, variable time, implementation allowed attack to succeed.

// https://github.com/kriskwiatkowski/pqc/blob/e57a8915834e08998f1a93f3d111cfaf3fcd94a7/src/kem/frodo/frodokem640shake/clean/kem.c#L229
int PQCLEAN_FRODOKEM640SHAKE_CLEAN_crypto_kem_dec(uint8_t *ss, const uint8_t *ct, const uint8_t *sk) {
    ...
    if (memcmp(Bp, BBp, 2*PARAMS_N*PARAMS_NBAR) == 0 &&
        memcmp(C, CC, 2*PARAMS_NBAR*PARAMS_NBAR) == 0) {
        // Load k' to do ss = F(ct || k')
        memcpy(Fin_k, kprime, CRYPTO_BYTES);
    } else {
        // Load s to do ss = F(ct || s)
        // This branch is executed when a malicious ciphertext is decapsulated
        // and is necessary for security. Note that the known answer tests
        // will not exercise this line of code but it should not be removed.
        memcpy(Fin_k, sk_s, CRYPTO_BYTES);
    }

The ciphertext is a concatenation of two parts ciphertext = Bp || C. The decapsulation function, implemented by FrodoKEM, uses a secret key to decrypt the ciphertext, encrypt it again and compares a result with ciphertext received (see Fujisaki-Okamoto transform). That is what is being done in the code above. Values, BBp and CC represent ciphertext that was recomputed during decapsulation. Those values are compared with received ciphertext. If the comparison succeeds, the shared secret sk_kis returned, otherwise the function returns some random value.

There are two problems related to variable time execution:

comparison uses memcmp: this function is not constant-time - it fails as soon as it detects the first difference
it is used in short-circuit evaluation: in case first memcmp returns a value different than 0, second memcmp is not called. Hence that’s also not constant-time behaviour.

The first issue is already enough to recover the private key. Let’s see if Memory Sanitizer will help to design constant-time implementation. I’m using PQC library, which integrates both variable and constant time decapsulation in FrodoKEM/640. Let’s start with a unit test:

// Uses GTEST and C++
TEST(Frodo, CtDecaps) {

    // Get descriptor of an algorithm
    const pqc_ctx_t *p = pqc_kem_alg_by_id(PQC_ALG_KEM_FRODOKEM640SHAKE);

    // Initialize buffers for KEM output
    std::vector<uint8_t> sk(pqc_private_key_bsz(p));
    std::vector<uint8_t> pk(pqc_public_key_bsz(p));
    std::vector<uint8_t> ct(pqc_ciphertext_bsz(p));
    std::vector<uint8_t> ss(pqc_shared_secret_bsz(p));
    bool res;

    // Generate key pair and perform encapsulation
    ASSERT_TRUE(pqc_keygen(p, pk.data(), sk.data()));
    ASSERT_TRUE(pqc_kem_encapsulate(p, ct.data(), ss.data(), pk.data()));

    // Mark secret material as uninitialized, so that variable time implementation causes UUM.
    // First 16 bytes is a shared secret, then next 9616 is just a public key, and then next
    // 10240 is another part of secret material (a secret matrix S used by FrodoKEM). Both
    // shared secret and matrix S not leak, but it is OK to do variable-time operations on
    // public key.
    ct_poison(sk.data(), 16);
    ct_poison((unsigned char*)sk.data()+16+9616, 2*640*8);

    // Decapsulate
    res = pqc_kem_decapsulate(p, ss.data(), ct.data(), sk.data());

    // Purify res to allow non-ct check by ASSERT_TRUE
    ct_purify(&res, 1);
    ASSERT_TRUE(res);
}

The test is compiled with flags enabling Memory Sanitizer and origin tracking. When run, it correctly triggers UUM as expected.


./ut --gtest_filter="Frodo.CtDecaps"
Running main() from /home/kris/repos/pqc/3rd/gtest/googletest/src/gtest_main.cc
Note: Google Test filter = Frodo.CtDecaps
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Frodo
[ RUN      ] Frodo.CtDecaps
Uninitialized bytes in MemcmpInterceptorCommon at offset 0 inside [0x7ffc94382040, 10240)
==3099896==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x559140afe8cd in memcmp (build.msan.debug/ut+0xd08cd)
    #1 0x5591410d5020 in PQCLEAN_FRODOKEM640SHAKE_CLEAN_crypto_kem_dec kem.c:233:9
    #2 0x559140e75c5b in pqc_kem_decapsulate pqapi.c:112:13
    #3 0x559140b30983 in Frodo_CtDecaps_Test::TestBody() test/ut.cpp:148:11
    #4 0x559140de55a4 in void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) (build.msan.debug/ut+0x3b75a4)
    ...
  Uninitialized value was stored to memory at
    #0 0x5591410dec98 in PQCLEAN_FRODOKEM640SHAKE_CLEAN_key_decode /home/kris/repos/pqc/src/kem/frodo/frodokem640shake/clean/util.c:123:18
    #1 0x5591410d44e0 in PQCLEAN_FRODOKEM640SHAKE_CLEAN_crypto_kem_dec /home/kris/repos/pqc/src/kem/frodo/frodokem640shake/clean/kem.c:184:5
    ...

Runs as expected. In this case, Memory Sanitizer proves itself to be useful for the detection of code that’s not constant-time. It would find a bug in FrodoKEM if it was used.

In this case, ct_check.h has detected that uninitialized memory is used at line 229 (call to memcmp). It also gives a lot of additional output, due to origin tracking enabled (I have removed most of it). Now, to make the code constant-time, we must swap usage of memcmp with the implementation that compares bytes in constant-time. Implementation of such function looks like this:

// Compares in constant time two byte arrays of size 'n'
uint8_t ct_memcmp(const void *a, const void *b, size_t n) {
    const uint8_t *pa = (uint8_t *) a, *pb = (uint8_t *) b;
    uint8_t r = 0;
    // XOR bytes in 'a' with corresponding bytes in 'b'. If
    // all bytes are equal, 'r' will be == 0.
    while (n--) { r |= *pa++ ^ *pb++; }
    // Set most significant bit to 1 only if r!=0, otherwise
    // r stays == 0
    r   = (r >> 1) - r;
    r >>= 7;
    // return last byte - 0 means a==b
    return r;
}

After swapping memcmp with ct_memcmp and running with Memory Sanitizer, UUM has not triggered anymore. And in this case, that’s BAD. The first problem is fixed, but the second problem is not - code is still not constant-time. We can verify that by running the same code in Valgrind (thanks to ct_check.h).

> valgrind --tool=memcheck ./ut --gtest_filter="Frodo.CtDecaps"
==3096880== Memcheck, a memory error detector
==3096880== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3096880== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==3096880== Command: ./ut --gtest_filter=*CtDecaps*
==3096880==
Running main() from /home/kris/repos/pqc/3rd/gtest/googletest/src/gtest_main.cc
Note: Google Test filter = *CtDecaps*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Frodo
[ RUN      ] Frodo.CtDecaps
==3096880== Conditional jump or move depends on uninitialised value(s)
==3096880==    at 0x1B721E: PQCLEAN_FRODOKEM640SHAKE_CLEAN_crypto_kem_dec (kem.c:244)
==3096880==    by 0x179C7D: pqc_kem_decapsulate (pqapi.c:112)
==3096880==    by 0x11A4AD: Frodo_CtDecaps_Test::TestBody() (ut.cpp:148)
...

Ok, Memcheck correctly reports an error. Interesting, but why?

Shadow memory propagation: MSan vs Valgrind

It took me some time to understand why it doesn’t work. It turns out that rules for shadow memory propagation are different in Memcheck and LLVM’s Memory Sanitizer. After analysing FrodoKEM, I’ve found out that the root cause boils down to the following example:

#include 
#include 
#include 
#include "common/ct_check.h"

int main(int argc, const char* argv[]) {
    // Store 1 bit of first argument provided at command line
    uint16_t sign = ((uint16_t)atoi(argv[1])) & 1;
    uint16_t s;
    // Use ct_poison and logical AND to mark least significant bit
    // of 2-byte value as uninitialized.
    ct_poison(&sign, 2);
    sign = sign & 1;
    // Print shadow memory - should produce output 01 00.
    // It means that only least significant bit is uninitialized,
    // the rest is properly initialized.
    printf("Shadow memory: \n");
    ct_print_shadow(&sign, 2);
    // On Intel, negation is two's complement operation. So depending
    // on value of 'sign' (0 or 1), shadow propagation may be needed.
    s = (-sign); // Same as 's = ~sign + 1'
    ct_print_shadow(&s, 2);
    // Take a branch depending on uninitialized value. Should trigger UUM.
    s >>= 15;
    ct_print_shadow(&s, 2);
    if (s==0) {
        printf("Branch A taken\n");
    } else {
        printf("Branch B taken\n");
    }
}

The program reads input from the command line and stores it in 16-bit value sign. It marks the least significant bit of sign as uninitialized and then negates the value of sign. Negation on Intel platform is done by Two’s complement operation, hence it is equal to s = ~sign + 1. So, if sign == 0, then ~sign == 0xFFFF and adding value 1 will cause all the bits to flip, hence uninitialized state should be propagated to all the bits (similar operation is done by frodo_sample_n function in FrodoKEM).

Finally, a branch is taken depending on the value of a most significant bit of s. An important point to notice here is that - execution path of the program depends on uninitialized data, so UUM should be triggered. Let’s run, first Memory Sanitizer.

> clang -g -O0 -DPQC_USE_CTSANITIZER -fsanitize=memory -fno-omit-frame-pointer -fno-optimize-sibling-calls test.c
> ./a.out 0
Shadow memory:
01 00
01 00
00 00
Branch A taken
> ./a.out 1
Branch B taken

UUM happy, nothing reported. The ct_print_shadow shows the state of shadow memory - the second line shows that only the least significant bit is marked uninitialized, so after the right (logical) shift, all bits must be properly initialized. Now Memcheck:

> clang -g -O0 -DPQC_USE_CTGRIND -fno-omit-frame-pointer -fno-optimize-sibling-calls test.c
> valgrind --tool=memcheck ./a.out 0
==3156952== Memcheck, a memory error detector
==3156952== Command: ./a.out 0
Shadow memory:
01 00
FF FF
01 00
==3156952== Conditional jump or move depends on uninitialised value(s)
==3156952==    at 0x109239: main (test.c:26)
Branch A taken

As expected UUM is not happy. It is clear that shadow propagation rules are different - memcheck propagates shadow memory following complement’s two operation and Memory Sanitizer uses some less strict rules.

Initially, I thought that’s a bug (reported in GH#1430, GH#1424, GH#1427), but MSan maintainers from Google made me realize (thanks!) that it is a design decision which allows to faster execution. Indeed, looking at shadow propagation rules, described in “MemorySanitizer: fast detector of uninitialized memory use in C++” (see chapter 3.3.1), it seems shadow memory propagation following carry propagation is not implemented for efficiency reasons.

Speed

The graph below shows execution time difference when running FrodoKEM decapsulation without any instrumentation, with compile-time instrumentation Memory Sanitizer and then runtime instrumentation done by Valgrind’s Memcheck. The origin tracking is pretty expensive, so separated results are shown for a run with those enabled and disabled.

The control run shows that decapsulation takes around 3000 ms. With origin tracking disabled, the Memory Sanitizer seems to be 5 times slower comparing to the control run. Memcheck is 25 times slower. Then, with origin tracking enabled, Memory Sanitizer incurs 9x slowdown, in the case of Valgrind it is 59x - that’s a big difference.

Execution of a code instrumented at compile time is much faster. As described in the earlier section, Memcheck does more granular checks, so slower execution is expected. Nevertheless, the difference is significant. It should be noted that results do not include setup time. It means they are slightly biased as the whole runtime instrumentation done by Valgrind is not included in those results.

Conclusion, limitations and future direction

Memory Sanitizer, in some rare cases, is not going to discover uses of uninitialized data. This negatively impacts checking constant-time implementations. But still, it is pretty good at a job. My CI runs a build with Memory Sanitizer anyway, so adding extra checks has zero costs. Nevertheless, at the development stage, I use both, Memcheck and MSan the additional assurance provided by Memcheck is needed. I think it would be useful to have a possibility of controlling rules for shadow memory propagation in Memory Sanitizer. I.e. a compilation flag to use for choosing between more performant or more granular checks.

But what I would like to see in the future is a type system and better integration with a build system. Imagine that all the variables in the code that are used to store sensitive data, could use some kind of “secure” type (or annotation). Then, by introducing special build configuration, we could tell build system to instrument the code in a way that data using “secure” type is automatically marked uninitialized. In case of non constant-time access and with proper unit-tests, UUM detector would automatically report errors. I think it is an interesting feature for modern programming language like Rust, which seems to occupy a space of secure implementations.

Comming back to current state of art, it is also worth to mention that Memory Sanitizer has some additional limitations: * it is supported by Linux/x86_64, NetBSD and FreeBSD only * requires to instrument all memory accesses in the program. This includes standard C++ library (i.e. used by gtest). Instructions here describe how to instrument and integrate libc++ into a project.

Finally, side-channel attacks are much more complicated and there is no single tool which will be able to detect them. But from the other hand problems like the one in FrodoKEM, described above, are pretty basic. Automatic detection of such bugs is possible and and should be done by tools, so that Cryptography Engineers can spent time on more interesting things.

Experimenting with NGINX

Sat, 12 Jun 2021 00:00:00 GMT

This page describes how to enable support for some features on NGINX, i.e. post-quantum schemes or QUIC protocol. Page is updated on as-needed bases, some parts of it may be specific to Debian Linux.

Sources

I’ll get NGINX sources, change it’s the build configuration, re-compile the server and rebuild deb package. To get sources, we have to add the NGINX repositories to the /etc/apt/sources/list.

The following two lines go to the end of a file:

deb https://nginx.org/packages/mainline/debian buster nginx
deb-src https://nginx.org/packages/mainline/debian buster nginx

Then add NGINX public key for verification and download the sources:

sudo wget https://nginx.org/keys/nginx_signing.key
sudo apt-key add nginx_signing.key
sudo apt-get update
sudo apt-get upgrade
sudo apt-get build-dep nginx
sudo apt-get source nginx

Sources should now be downloaded to the nginx directory.

Link NGINX with the BoringSSL

My current setup of HTTP server uses bssl. The choice comes from the fact that it’s simpler, documentation is clearer and it contains more modern features (or those I’m intersted in).

To link NGINX with BoringSSL, one needs to copy the sources to nginx/debian/modules and compile it.

cd nginx/debian/modules/ &&
git clone https://github.com/google/boringssl
mkdir -p boringssl/build && cd boringssl/build
cmake .. && make -j 8

Following step instructs NGINX to use BoringSSL instead of OpenSSL (used by default). To do that, one needs to modify the rules file nginx/debian/rules.

config.status.nginx: config.env.nginx
    cd $(BUILDDIR_nginx) && \
    CFLAGS="" ./configure {...} --with-stream_ssl_preread_module \
    --with-cc-opt="-I$(CURDIR)/debian/modules/boringssl/include $(CFLAGS) -Wno-ignored-qualifiers" \
    --with-ld-opt="-L$(CURDIR)/debian/modules/boringssl/build/ssl \
    -L$(CURDIR)/debian/modules/boringssl/build/crypto""

Example above shows how to modify a “release” target, but if needed, debug target can be modified in exact same way. Notice that -Wno-ignored-qualifiers has been added. That’s because BoringSSL throws compilation warnings, which become errors in the NGINX build.

Adding QUIC support

The QUIC protocol specified by RFC9000 opens new possibilities. Something I would definitely like to try. Clone the newest version and overwrite whatever is currently provided by NGINX.

hg clone -b quic https://hg.nginx.org/nginx-quic
rsync -r nginx-quic/ nginx

Again, the rules file needs to be modified to enable the support. Add --with-http_v3_module --with-http_quic_module --with-stream_quic_module to config.env.nginx and config.env.nginx_debug targets (somewhere after --with-stream_ssl_preread_module).

Package creation

Following commands re-create debian package with NGINX, which the can be installed by dpkg. This two step procedure requires, first to modify nginx/debian/changelog file and add information about changes done to the package. Add something like:

nginx (1.21.0-2~buster) buster; urgency=low

  * 1.21.0-1 adds quic

 -- Kris Kwiatkowski <kris@amongbytes.com>  Tue, 12 Jun 2021 16:01:22 +0300

Next step is to build a package. Build process will use GPG key to sign the package. To specify a key I use -kkris@amongbytes.com which identify secret key used for signing.

To start the build use following command:

sudo dpkg-buildpackage -b -kkris@amongbytes.com

NGINX configuration

Following instructions enable post-quantum in TLS and add support for QUIC protocol (unfortunatelly, PQ in QUIC is not supported).

Post-Quantum support: BoringSSL supports post-quantum key exchange. It can be enabled only in TLS v1.3 and uses a variant of NTRU-HRSS mixed with X25519, called CECPQ2 (detailed description here). To enable that support, following line needs to be added to nginx.conf

    ssl_protocols TLSv1.2 TLSv1.3;      # Enable both TLS 1.3 and 1.2
    ssl_ecdh_curve CECPQ2:X25519:P-256; # Enable PQ key exchange

It is important to add CECPQ2 as a first on that list as well as add some classical key exchange algorithm for backward compatibility. This server supports post-quantum key exchange.

QUIC support: for HTTP/3 over QUIC add following changes to the virtual server config:

server{
    listen 443 http3 quic reuseport;
    listen 443 ssl http2;

    quic_retry on;
    ssl_early_data on;

    http3_max_field_size 5000;
    http3_max_table_capacity 50;
    http3_max_blocked_streams 30;
    http3_max_concurrent_pushes 30;
    http3_push 10;
    http3_push_preload on;

    add_header alt-svc '$quic=":443"; ma=3600';

The add_header alt-svc is need to make sure that the web browser will know that your server supported http/3. Other settings also need to setup in order for the NGINX QUIC not to produce 404 error on your file assets.

OpenVPN authentication hardened with ARM TrustZone

Kris Kwiatkowski — Tue, 12 Jan 2021 00:00:00 GMT

The goal is to connect an embedded device to VPN network. The VPN uses authentication with X.509 certificates, which means that the device needs to store securely a private key. The question is, how to protect the key from being copied? Many ideas have been explored already, in this particular case, I’ll describe the solution which uses secure enclave. The project itself is quite easy to implement and it can serve as a hands-on intro to the ARM TrustZone-based TEEs.

Earlier last year, I needed an implementation of TLS server, which stored private keys in the secure enclave, namely OP-TEE running in the Trusted Execution Environment (TEE), protected by ARM TrustZone. A Similar idea will be used here, with software stack integration, being the main difference. A Previous project integrated the solution with BoringSSL, which requires changing the internals of the library. The preferred solution would not touch the internals of the TLS library, but rather work as a form of a plugin to the existing framework. OpenSSL implements the ENGINE API, which can be (and actually is) used as a way to implement cryptographic backends.

Finally, this is what I want to end up with:

Flow for OpenVPN with private client key in TEE

The private key will be stored in a secure enclave. The OpenVPN calls OpenSSL for cryptographic operations and operations related to TLS. At the init phase, the OpenSSL will load my implementation of the ENGINE API, which I call OpTEE ENGINE. This implements a callback, that’s called by TLS stack, for message signing. Finally, the engine implementation forwards the signing to the OP-TEE, which is the place where private key operation happens.

Security

But firstly, why do I think secure storage provides enough security?

The TEE that I’m claims compliance with GlobalPlatform API. Looking at the GP requirements in this specification (see 2.2.2), the basic requirement regarding secure storage are to:

obviously, encrypt the data (provide confidentiality as well as integrity)
be bound to a device, this one is important. It means that sensitive data can be accessed only by those applications which are running on a particular device and in the particular TEE (there may be multiple TEEs on the same device).
have an ability to hide sensitive keys form the TEE process running in the TEE
allow access to the data only by the TEE application which has created it (btw: TA=Trusted Application, an application running in the TEE).

In my context, it means that VPN private key is stored encrypted and can be used only by a single device. The secure storage can be copied to a different device, but as it is bound to a particular one, it can’t be decrypted there. Key can’t be accessed by malicious TA installed on the same device thanks to access separation. Finally, the TA that owns the key doesn’t have access to the sensitive data, so in case of a bug in the TA, the key doesn’t leak. It may leak in case of bug in the TEE, but in this case, the whole system is probably already compromised.

The spec gives hope for a decent level of security. Looking at implementation details, the Key Manager is a component implemented in OP-TEE, which ensures confidentiality and integrity of the data (see implementation details of secure storage). To provide device binding it uses Hardware Unique Key (HUK), which is defined as symmetric secret key stored in a piece of hardware (often in the SoC itself) of the device and is globally unique. OP-TEE uses it to derive, a key called SSK, which is then used to provide device binding. SSK is created at boot-time and stored in secure memory (never stored on disk):

SSK = HMAC-SHA256(HUK, Chip ID || “some data as salt”)

The SSK is then used to derive TSK key which is unique per TA installed in the TEE. This provides a possibility to allow access to the data only for TA which owns it. Finally, there is a FEK, randomly generated key used for file encryption.

An Important part of this whole story, but just implementation detail: the OP-TEE, as on GitHub, doesn’t actually try to use HUK. Retrieval of the HUK is specific to the SoC and needs to be implemented during integration with the concrete device/platform. Namely, there is a function called tee_otp_get_hw_unique_key, which must be filled with proper code for HUK retrieval. Similarly, to provide secure storage, the “chip ID” needs to be also retrieved, this is done by tee_otp_get_die_id which also needs to be filled with proper code. Currently, OP-TEE uses the stream of 0 bytes, as HUK.

Finally, the secure storage kept in normal world OS filesystem (/data/tee by default on linux). This subsystem uses AES/128. My ultimate goal is to have quantum-resistant TEE and AES/128 is too small to be resist quantum attacks (because of Grover’s algorithm), hence migration to 256-bit symmetric key is needed.

TLS client authentication

The X.509 certificates are used to authenticate a client to a VPN server. In this authentication method, a client sends a certificate and a proof for possession of the private key that corresponds to that certificate. In this case, the private key never leaves TEE, hence the primary functionality of an application running in the TEE, is to create a proof when requested.

Looking at the TLS level (TLSv1.3), the client authentication starts with a server requesting it in TLS Server Hello (4.3.2. of RFC 8446). In response, the client produces a proof by creating following signature:

proof = sign(0x20 byte repeated 32 times || “TLS 1.3, client CertificateVerify” || 0 || transcript hash)

The client uses the same algorithm as the one used when signing X.509 certificate and a private key, to create a signature. Signature is created over a concatenation of strings defined in the RFC (section 4.4.3) and a TLS transcript hash (section 4.4.1). Both, the X.509 certificate and proof are sent back to the server for verification.

Secure world

The Trusted Application is mostly copied from the previous project. In the current state, it is assumed that the key is loaded to TEE at some initial point, and then it is used when Normal World requests signing. An alternative implementation, could create a private key during the first boot and use it to create CSR, which is then signed by the CA and returned to the device. It’s a more complicated process, but this way, one can ensure that the client’s private key never existed anywhere else but on the device.

The demo TA comes with a simple key management app which can be used to install or remove keys from the device. It is also a good place to see how communication from Normal World to Secure World is implemented. Assuming, the TEE is running on the device, and tee-supplicant with Linux driver is loaded in the Normal World (see here for setup), an application can use GlobalPlatform API to send/receive requests to/from TEE. The code would look somehow like that:

    // TEE context
    TEEC_Context ctx;
    // Session with the TA
    TEEC_Session sess;
    // Operation context
    TEEC_Operation op;
    // ID of an app in the TEE
    TEEC_UUID uuid = TA_UUID;

    // Initialize a context connecting us to the TEE
    TEEC_InitializeContext(NULL, &ctx);
    // Open a session to the TA identified by uuid
    TEEC_OpenSession(&ctx, &sess, &uuid,
        TEEC_LOGIN_PUBLIC, NULL, NULL, &err_origin);

    // Initialize operation context 'op' (see github)
    // ...

    // Send command to the TA running in TEE
    TEEC_InvokeCommand(&sess, TA_INSTALL_KEYS, &op, &err_origin);

After opening a session with the TEE on a line 13, the application sets op context, by providing input arguments and setting buffers for the output. Then call to TEEC_InvokeCommand will trigger communication with the TEE. During this process, TA signature verification is done the TA is started. The entry point to the TA is a function called TA_InvokeCommandEntryPoint.

TEE_Result TA_InvokeCommandEntryPoint(void __maybe_unused *sess_ctx,
            uint32_t cmd_id,
            uint32_t param_types, TEE_Param params[4]) {
    (void)&sess_ctx; /* Unused parameter */
    switch (cmd_id) {
    case TA_INSTALL_KEYS:
        return install_key(param_types, params);
    case TA_SIGN_ECC:
        return sign_ecdsa(param_types, params);
    case TA_GET_PUB_KEY:
        return get_public_key(param_types, params);
        ...
    }
}

The TA is instructed by providing cmd_id to run specific logic, like key installation, signing or returning public key (the reason for which is described in next section). When installing the key, the TA will copy private and public key attributes to temporary transient_object and then create a file on persistent storage containing those attributes. The key is identified by key_id received from Normal World.

// Puts the key to the storage
static TEE_Result install_key(uint32_t param_types, TEE_Param params[4]) {
    //...
    TEE_ObjectHandle transient_obj = TEE_HANDLE_NULL;
    // ...
    TEE_AllocateTransientObject(TEE_TYPE_ECDSA_KEYPAIR,
            ecc->x.sz * 8, &transient_obj);
    ATTR_REF(cnt, TEE_ATTR_ECC_PRIVATE_VALUE, ecc->scalar);
    ATTR_REF(cnt, TEE_ATTR_ECC_PUBLIC_VALUE_X, ecc->x);
    ATTR_REF(cnt, TEE_ATTR_ECC_PUBLIC_VALUE_Y, ecc->y);
    TEE_InitValueAttribute(&attrs[cnt++], TEE_ATTR_ECC_CURVE,ecc->curve_id, 0);
    TEE_PopulateTransientObject(transient_obj, attrs, cnt);

    ret = TEE_CreatePersistentObject(
        TEE_STORAGE_PRIVATE,
        key_id, 32,
        TEE_DATA_FLAG_ACCESS_WRITE,
        transient_obj,
        NULL, 0, &persistant_obj);
    // ...
}

When signing, the TA will initialize key_handle - the handler to the key, it’s done by calling TEE_OpenPersistentObject with the key_id. Then, key_handle is used when setting up an operation identified by op (line 13) and finally used for signing (line 14). One should notice, that private key material stays in the TEE, it is never revealed to the TA.

// Performs ECDSA signing with a key from secure storage
static TEE_Result sign_ecds (uint32_t param_types, TEE_Param params[4]) {
TEE_OperationHandle op = TEE_HANDLE_NULL;
TEE_ObjectHandle key_handle;

TEE_OpenPersistentObject(
    TEE_STORAGE_PRIVATE,
    key_id, 32,
    TEE_DATA_FLAG_ACCESS_READ, &key_handle);

// perform ECDSA sigining
TEE_AllocateOperation(&op, TEE_ALG_ECDSA_P256, TEE_MODE_SIGN, 256);
TEE_SetOperationKey(op, key_handle);
TEE_AsymmetricSignDigest(op, NULL, 0,
    params[1].memref.buffer, params[1].memref.size,
    params[2].memref.buffer, &params[2].memref.size);
LOG_RET(ret);

}

The demo code (here) supports ECDSA/p256 only but can be easily extended to provide support for all the schemes used by TLS v1.3.

OpenSSL engine for OP-TEE

One of the goals for this project was the ease the integration with the TLS layer. It should be possible to provide whole functionality as a plugin loaded to any modern version of OpenSSL, code modifications. OpenSSL provides the possibility to extend functionalities by implementing, so-called, ENGINE API. The dynamically loadable library may implement some cryptographic operations (like signing, verification, key generation) and register it by calling ENGINE’s API. When processing a cryptographic operation the OpenSSL uses custom implementation if provided. The general architecture and guide to build OpenSSL engines can be found in an excellent paper called Start your ENGINEs: dynamically loadable contemporary crypto.

In case of engine for OP-TEE, the code structure looks briefly like:

static int OPTEE_ENG_bind(ENGINE *e, const char *id) {
    // ... some initialization code ...

    // Set name and ID of an engine
    ENGINE_set_id(e, OPTEE_ENG_ENGINE_ID);
    ENGINE_set_name(e, OPTEE_ENG_ENGINE_NAME);
    // Call OPTEE_ENG_load_private_key to load the private key
    ENGINE_set_load_privkey_function(e, OPTEE_ENG_load_private_key));
    // Register callback for signing
    ENGINE_set_pkey_meths(e, OPTEE_ENG_pkey_meths);
}
static int OPTEE_ENG_pkey_meths(ENGINE *e, EVP_PKEY_METHOD **pmeth,
    const int **nids, int nid) {
    // Use EVP_PKEY_meth_copy to copy all the callbacks to new_meth
    EVP_PKEY_METHOD *new_meth = EVP_PKEY_meth_new(EVP_PKEY_EC, 0);
    EVP_PKEY_meth_copy(new_meth, EVP_PKEY_meth_find(EVP_PKEY_EC));
    // Set new callback for signing
    EVP_PKEY_meth_set_sign(new_meth, 0, OPTEE_ENG_evp_cb_sign);
    // Return new EVP_PKEY_METHOD struture
    *pmeth = new_meth;
    return 1;
}

// Tell the OpenSSL to call OPTEE_ENG_bind when plugin is loaded
IMPLEMENT_DYNAMIC_BIND_FN(OPTEE_ENG_bind)
IMPLEMENT_DYNAMIC_CHECK_FN()

The OP-TEE engine adds to the OpenSSL with 2 following custom implementations. The OPTEE_ENG_load_private_key extends the functionality of theENGINE_load_private_key function. The former is an ENGINE API function used by the OpenVPN to load private keys. The custom implementation, provided by the optee_eng, checks if a key with the given ID exists in the TEE. It returns initialized EVP_PKEY object, used by the OpenSSL for message signing, during TLS session establishment. Contrary to standard implementation, EVP_PKEY object returned by optee_eng doesn’t store the private key material instead, it keeps an ID corresponding to the private key.

The second functionality is implemented by OPTEE_ENG_evp_cb_sign. This function gets invoked when signing is requested for a key returned by OPTEE_ENG_load_private_key. The EVP_PKEY contains a list of function pointers, implementing singing, verification, key generation, etc. This callback is assigned to a pointer for message signing. Implementation of this function, calls TA in the TEE with an ID of a key and a message to sign. Then control is transferred to sign_ecdsa function implemented by the TA, which initializes handle to the key and calls TEE OS to perform performs ECDSA/p256 signing.

The IMPLEMENT_DYNAMIC_BIND_FN macro binds everything together. It defines an entry point of an engine - a first function that gets executed when the library is loaded to the OpenSSL (OPTEE_ENG_bind in this case). The function sets an identifier and name of an engine and uses ENGINE API to assign the callbacks (line 8 and 18 in the code listing above).

Side note: In case of the private key, the OpenSSL v1.1.1 requires that EVP_PKEY structure contains a public part of a key, otherwise loading of the certificate fails and TLS client won’t be able to initialize the connection. In this program, the public part is stored also in the TEE.

Ok, so dynamic engine provides implementation, but OpenSSL needs to somehow know how to load such a library. Following configuration can be added to the OpenSSL’s config file (/etc/ssl/openssl.cnf on my Linux), so that framework knows where to find the dynamic library when requesting engine load by ID `` in this case.

# Additional content of openssl.cnf

[default_conf]
engines = engine_section

[engine_section]
optee = optee_section

[optee_section]
engine_id = optee
dynamic_path = "/opt/liboptee_eng.so"
init = 1

Let’s try, if it works. On qemu emulating ARMv8 machine I now get:

qemu> openssl
OpenSSL> engine -c -v optee
(optee) OpTEE OpenSSL ENGINE.
 [id-ecPublicKey]

Seems engine can be loaded correctly. Now, when OpenSSL tries to sign a message it needs to do a call to TEE (which is an SMC call to switch CPU into the secure world), get a key from secure storage and return the signature. Also, a crypto operation is now not done by OpenSSL, but by crypto library provided by the OP-TEE OS (in this case it is a fork of LibTomCrypt). All in all, there is a cost of all that dance. Measuring this difference will give some idea and also is a good way to check if the whole flow works correctly. That is done by speed.cc located in the project’s repository. Benchmark runs 2 functions, the SignREE performs signing, fully in the Normal World by calling pure OpenSSL implementation and SignTEE uses optee_eng for singing. I’ve got the following results when running it on HiKey960 (ARM Cortex-A73).

The operation works correctly - the benchmarking code loads optee_eng ENGINE into vanilla OpenSSL and uses only EVP_API. Nevertheless, the slowdown is significant. At this point I need to say, that I haven’t done any more investigation, hence I’m not sure where the slow down comes from exactly. I’ve run release version of the software and used similar settings for the board as described here. I’m pretty sure optimization level in OpenSSL is much better than in LibTomCrypt, hence there is probably lots of room for improvement.

Side note: the benchmark expects to find in TEE a key with a name bench_key. It must be inserted by using key management app optee_keymgnt put bench_key .

Plugging to the OpenVPN

At this point integration with the OpenVPN is very easy. The only requirement is a version 2.5 of the software (which includes this change). That change adds the possibility to use OpenSSL ENGINE to load private key, what’s needed here.

There is a trick that needs to be used here to configure OpenVPN correctly. So, the configuration file specifies has a key parameter which specifies the name of the file with the private key, corresponding to the certificate provided by cert parameter. In case of optee_eng, this is a name of the key stored in the TEE (this name is provided to ENGINE_load_private_key as key_id argument). Additionally, file with the same name must exist in the OpenVPN configuration directory. The OpenVPN will try to use the engine to load a key, only if loading from the file fails. So the file needs to be empty, to make sure the load of a key fails. The configuration needs to also specify engine parameter, to instruct OpenSSL to use the optee_eng. Whole configuration file as used on the client can be found here.

Setting-up OP-TEE image, building & running

The code of the solution is available on github. It was tested against OP-TEE 3.11, running in QEMU and on HiKey960 development board. To build and play with the solution, one requires first to build the OP-TEE itself (instructions here).

To compile the solution:

git clone https://github.com/henrydcase/optee_eng
cd optee_eng
git submodule init && git submodule update
mkdir build && cd build
cmake -DOPTEE_BUILD_DIR=<OPTEE location> -DPLATFORM=qemu ..
make
make install

The is a root directory for OPTEE. The -DPLATFORM specifies a platform for which solution should be built. I’ll use QEMU in this example. The make install command will copy all needed files to the OP-TEE’s build directory.

OP-TEE uses buildroot to create Normal World OS, where examples can be run. By default, the OpenVPN is not enabled. It can be done by applying 2 patches from optee_eng repo:

cd optee_eng
patch -p1 -d <OPTEE location>/buildroot < optee-patches/0001-openvpn-2.4.9-to-2.5.0.patch
patch -p1 -d <OPTEE location>/build < 0002_build_enable_openvpn.patch
cd <OPTEE location>/build
make run

To connect to the VPN, we need a server. The repository contains configuration for server and client, as well as a set of X.509 certificates (to regenerate certificates the create_cert.sh can be used). The command below configures and starts OpenVPN server on the host machine.

> cd optee_eng
> sudo openvpn --cd cfg --config openvpn_srv.conf
2021-02-06 23:39:53 OpenVPN 2.5.0 [git:makepkg/a73072d8f780e888+] x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on Nov  6 2020
2021-02-06 23:39:53 library versions: OpenSSL 1.1.1h  22 Sep 2020, LZO 2.10
2021-02-06 23:39:53 TUN/TAP device tun0 opened
2021-02-06 23:39:53 net_iface_mtu_set: mtu 1500 for tun0
2021-02-06 23:39:53 net_iface_up: set tun0 up
2021-02-06 23:39:53 net_addr_v4_add: 172.16.0.1/16 dev tun0
2021-02-06 23:39:53 UDPv4 link local (bound): [AF_INET][undef]:1194
2021-02-06 23:39:53 UDPv4 link remote: [AF_UNSPEC]
2021-02-06 23:39:53 Initialization Sequence Completed

Once QEMU is started and the user is log-in as root in NWd terminal, the tee-supplicant -d needs to be started. The supplicant makes it possible to communicate from Normal World to Secure World. Then next thing to do is to, is to insert a client key into TEE and start VPN.

> optee_keymgnt put vpn.testlab.com /etc/openvpn/certs/client.key
> rm /etc/openvpn/certs/client.key
> openvpn --cd /etc/openvpn/ --config client.conf
2021-02-09 00:27:27 Initializing OpenSSL support for engine 'optee'
2021-02-09 00:27:27 OpenSSL: error:0909006C:PEM routines:get_name:no start line
2021-02-09 00:27:27 PEM_read_bio failed, now trying engine method to load private key
2021-02-09 00:27:27 TCP/UDP: Preserving recently used remote address: [AF_INET]172.16.0.1:1194
...
2021-02-09 00:27:28 [vpn.testlab.com] Peer Connection Initiated with [AF_INET]172.16.0.1:1194
...

The second terminal displays logs from TEE OS running in parallel to Linux. One should see the following traces there:

# When inserting a key to the TEE
I/TA: New key [F671A1B757] registered
# During TLS handshake
I/TA: Sign for a key ID [F671A1B757] requested
I/TA: Message signed with key ID [F671A1B757]

At this point the VPN tunnel should be correctly created.

Conclusion

Hopefully, this example shows how to utilize ARM TrustZone from OpenSSL to secure private keys for OpenVPN. Ideas similar to implemented by optee_eng can be used with any software using OpenSSL - the same engine can be used by Nginx, ssh-agent or strongSwan on both client and server-side. The solution is fully “pluggable”, it doesn’t require any modification to existing software. It’s worth to notice that such isolation of private keys from internet-facing applications, may help to avoid security incidents. For example, it would be enough to use optee_eng to avoid hearthbleed, as the private key is not stored in the process running OpenSSL library.

As an improvement to this idea, one could think of using PKCS#11 standard for communication with TEE. It wasn’t done here for 2 reasons - PKCS#11 would require TA implementing the standard, which is not finished yet (but ongoing). The other reason is that my ultimate goal (which wasn’t presented here) is to use post-quantum cryptography. Those new schemes are not yet incorporated properly into PKCS#11 standard.

Finally, upcoming OpenSSL 3.0 removes support for ENGINE API completely. Instead, there is a new concept called providers. Hence, implementation of optee_eng for the upcoming version of OpenSSL will look probably slightly different. But from one hand OpenVPN doesn’t support this new version yet and from the other hand, it doesn’t seem 3.0 provides yet similar functionality for loading private keys.

On using Trusted Execution Environment for TLS session signing

Mon, 15 Apr 2019 00:00:00 GMT

Problem description

Typically, a TLS server uses a Certificate and associated Private Key in order to sign TLS session. From now on I’ll call this Private Key a “traffic- private-key”. Both certificate and traffic-private-key form a asymmetric cryptographic key-pair. Revealing the traffic-private-key makes it possible to perform men-in-the-middle type of attacks. Typically traffic-private-key is stored on the server’s hard disk. Even if traffic-private-key is stored in encrypted form, at some point HTTPS server needs to have a possibility to decrypt it in order to use for signing. It means that at runtime the key in plaintext will be available in a memory of a HTTPS process. At this point attacker with an access to the machine may be able to dump memory of the process and learn the traffic-private-key.

Hence, server operators need to take special care in order to make sure traffic-private-keys are not revealed.

This situation gets more complicated in cases when server operator and domain owner are 2 different entities. For example in case of CDN, TLS offloading happens on the edge system - which often is a completely different machine than actual application server. Also it is often the case that servers (physical machines) of CDN provider are spread over the world and are located in remote data centers. Those data centers may be owned by multiple different entities.

In such situations, problem of ensuring that the traffic-private-key is not copied and used by an attacker may be challenging and not obvious to solve. Clients of the CDN may also be concerned about idea of spreading the traffic-private-key over the world.

Solution proposed

For brevity I’m assuming server uses only TLS 1.3 as specified in [RFC8446], but solution can be adapted to any version of TLS.

The idea is to perform TLS session signing inside Trusted Execution Environment. The traffic-private-key will be accessible only to TEE. Additionally, solution ensures that key is stored in encrypted form in trusted storage. The storage is bound to the physical machine and hence copy of the storage can’t be used on some different machine.

The solution as implemented in the PoC and described below is based on ARM TrustZone and it uses open sourced TEE called OP-TEE (see [OP-TEE]), sources of OP-TEE are stored on github (see [OP-TEE-SRC]). OP-TEE was driven by the fact that author is quite familiar with environment nevertheless it can be implemented with other TEEs which provide device bound trusted storage. Author is convinced that Intel SGX with Asylo would be better choice here. Solution makes also heavy use of BoringSSL for handling with TLS traffic.

Points below describe implementation in more details:

Key provisioning server

It is assumed that machine is initially provisioned with a software which acts as a server for traffic-private-key provisioning.

In order to install traffic-private-key on a machine, operator connects to key provisioning server and sends the traffic-private-key to be installed on the machine. This operation is done over TLS connection which uses client authentication. Possition of some form of TLS provisioning is required by the operator. Key provisioning server must be able to verify provisioning key, hence verification-provisioning-key is also preinstalleld.

After sucessuful TLS authentication, operator sends a pair of traffic-private-key and domain name for which the key must be used. This pair is installed on secure storage which accessible from TEE only. TEE ensures traffic-provisioning-key can’t be read from outside of TEE.
TLS session signing

Solution uses BoringSSL to offload TLS traffic. BoringSSL API gives a possibility to register a function which is called during TLS handshake, when server needs to sign a session with traffic-private-key.

It means that there are no modifications needed to BoringSSL in order to use it for signing TLS session with traffic-private-key stored in TEE.

The code which registers signing operation looks like this:
```
void signing_operation(message_to_sign, domain_name, *signature) {
    ... calls TEE for signing ...
}

SSL_PRIVATE_KEY_METHOD private_key_methods {
    .sign = signing_operation
    .decrypt = ...
    .complete = ...
};
SSL_CTX_set_private_key_method(SSL_CTX, &private_key_methods)
```
TLS server calls signing_operation function when TLS session needs to be signed. This function passes message_to_sign and domain_name to the TEE. While in the TEE, the domain_name is used as an index in order to retrieve right traffic-private-key (many domains can be handled by the server). TEE performs signing and signature is returned to the BoringSSL. BoringSSL continues TLS handshake as normal.
Key can’t be used on another machine.

Trusted storage in OP-TEE is bound to the physical device. It means even if the storage is coppied to another device, it won’t be possible to decrypt stored data.

In more detail, OP-TEE implements GlobalPlatform Trusted Storage API. Device binding is one of the requirements for trusted storage. In order to make it possible each device needs to come with preinstalled Hardware Unique Key (HUK).

More details about trusted storage can be found on in OP-TEE documentation (see [OP-TEE-STORAGE]).

It must be mentioned, that in order to use trusted storage, SoC specific customization is needed (see comment in orange at the bottom of [OP-TEE-STORAGE]).

PoC implementation

As mentioned before, implementation uses OP-TEE as a base for TEE. Example was tested with OP-TEE running inside QEMU emulating ARMv8.

PoC is composed of:

admin_cli: Client used for installing the private keys inside TEE. This component is used instead of key provisioning server as such server was not implemented in PoC.
server: It is a TLS offloading server. Server listens on 127.0.0.1:443 and uses BoringSSL to accept and handle TLS connection. Server implements function callback, which calls TEE when private key operation needs to be done. Only ECDSA/P256 sining is currently supported.
ta: Trusted application running inside TEE. The application is responsible for processing requests from admin_cli, which is storing the keys on trusted storage and deleting them if requested. As well as processing signing requests from the server.

The section called “Example of usage” explains how to use the software in details.

Compilation and installation

Following steps need to be taken to install the software:

OP-TEE building. This step is explained in details here. It is required to perform steps 1 to 5. The TARGET (see the building instruction) used by this example is called QEMUv8. In case OP-TEE is started after step 5, it has to be stopped.
Next steps assume that Linux operating system is used and OP-TEE has been cloned to the directory called OPTEE_DIR.
Create directory /tmp/tee_shared
Go to OPTEE_DIR directory.
Clone git clone https://git.amongbytes.com/kris/c3-tls-sign-delegator.git projects
Compile BoringSSL for aarch64 and native system: cd OPTEE_DIR/projects/bssl; make. Makefile is configured to use toolchain build in step 1. This step will also build BoringSSL for host machine, it requires all dependencies for building BoringSSL are installed (see [BORING-BUILD]).
Compile solution: cd OPTEE_DIR/projects/delegator; make

Start process

Starting OP-TEE: User needs to:

Enter build directory: cd OPTEE_DIR/build
Start QEMU with OP-TEE emulation: make QEMU_VIRTFS_ENABLE=y QEMU_USERNET_ENABLE=y QEMU_VIRTFS_HOST_DIR=/tmp/tee_share HOSTFWD=",hostfwd=tcp::1443-:1443" run-only.
Just after qemu starts it will pause with following prompt:

cd /home/hdc/repos/optee/qemuv8/build/../out/bin && /home/hdc/repos/optee/qemuv8/build/../qemu/aarch64-softmmu/qemu-system-aarch64 \
    -nographic \
    -serial tcp:localhost:54320 -serial tcp:localhost:54321 \
    -smp 2 \
    -s -S -machine virt,secure=on -cpu cortex-a57 \
    -d unimp -semihosting-config enable,target=native \
    -m 1057 \
    -bios bl1.bin \
    -initrd rootfs.cpio.gz \
    -kernel Image -no-acpi \
    -append 'console=ttyAMA0,38400 keep_bootcon root=/dev/vda2' \
    -fsdev local,id=fsdev0,path=/tmp/tee_share,security_model=none -device virtio-9p-device,fsdev=fsdev0,mount_tag=host -netdev user,id=vmnic,hostfwd=tcp::1443-:1443 -device virtio-net-device,netdev=vmnic
QEMU 3.0.93 monitor - type 'help' for more information
(qemu)

User continues the process by entering c

(qemu) c

After a while 2 additional terminals should appear - one terminal labeld as “Normal”, running linux and another terminal labeled as “Secure” with output from the TEE.

In the “Normal World” terminal user needs to mount file system to share data between guest and host machine. Following command needs to be used:
```
mount -t 9p -o trans=virtio host /mnt
```
">
Install Trusted Application inside OP-TEE:

In the “Normal” terminal invoke:
```
sh /mnt/out/etc/tee_install
```
">

At this point installation and startup process is complated and solution can be used.

Example of usage

Installing a key on secure storage

First step is to install the key on secure storage. Ideally this step is done by “Key provisioning server”. Nevertheless, this PoC doesn’t implement such server. Instead admin_cli can be used to install the key.

In the “Normal” terminal, go to /mnt/out/ and invoke
```
cd /mnt/out
# ./admin_cli/admin_cli put www.test.com etc/ecdsa_256.key
```
This command installs private key for www.test.com. In the “Secure” terminal you should see a message E/TA: install_key:156 Storing a key. After this step etc/ecdsa_256.key can be removed.

">

Start a TLS server and perform TLS handshake:

With private key installed TLS server can be started. In the “Normal” terminal invoke

> cd /mnt/out
> ./server/server

Server will start listening on 127.0.0.1:1443. In the host machine one can try to connect to the TLS server:

> cd OPTEE_DIR

> ./projects/bssl/src/build.native/tool/bssl client -connect 127.0.0.1:1443 -server-name "www.test.com"
Connecting to 127.0.0.1:1443
Connected.
Version: TLSv1.3
Resumed session: no
Cipher: TLS_AES_128_GCM_SHA256
ECDHE curve: X25519
Signature algorithm: ecdsa_secp256r1_sha256
Secure renegotiation: yes
Extended master secret: yes
Next protocol negotiated:
ALPN protocol:
OCSP staple: no
SCT list: no
Early data: no
Cert subject: CN = www.dmv.com
Cert issuer: C = FR, ST = PACA, L = Cagnes sur Mer, OU = Domain Control Validated SARL, CN = Domain Control Validated SARL

Trial to access different domain fails as traffic-private-key is not available.

">

Extensions to the idea

First of all - key storage is bound to the device. In order to use stolen key for MITM, attacker needs to steal whole machine, which is much more difficult and easier to control. In order to implement such solution user doesn’t need expensive HSM, but it can simply use Intel with SGX and Asylo. Also it’s easy to imagine some extensions to this idea. For example instead of calling TEE each time for session signing during TLS handshake, one could imagine that solution can use “Delegated Credentials for TLS” (see https://tools.ietf.org/html/draft-rescorla-tls-subcerts-02). In this case TEE would be responsible for generating short lived certificates and TLS server would request such certificate every fixed amount of time (every few minutes). This idea could be combined with another – instead of storing traffic-private-key in multiple machines, one could imagine storing a key in some central location with more restricted access (but still in TEE). Combining those two ideas improves security of traffic-private-key storage without degrading time needed to perform TLS handshake. It must be noticed that “Delegated Credentials for TLS” are already implemented in BoringSSL.

i2c-stub: Playing with I2C on linux

Kris Kwiatkowski — Fri, 08 Feb 2019 00:00:00 GMT

Recently I had a chance to play with i2c-stub. The goal was to send and receive encrypted data to/from I2C connected device. I didn’t want to play with real I2C device, so I needed to emulate it somehow, which is possible with i2c-stub on linux. Here below is description how it was done.

Requirements

The solution needs to be implemented in C and have following functionalities * Possibility to connect to I2C slave * Send encrypted data * Receive and decrypt data * Possibility to check connection status

The code

The code itself is here. To compilie with gcc simply download and make.

Initialization

In order to use the code (read/write data to I2C) I’m using i2c-stub linux module and i2c-tools package (ArchLinux). i2c-stub creates a fake I2C adapter(Controller/Master) and emulates i2C hardware (using array to store data). We also will need i2c-dev module as a frontend.

Following command will load the module, initialize slave device with an address 0x03, and read it’s initial state:

> modprobe i2c-dev
> modprobe i2c-stub chip_addr=0x03

> i2cdetect -l
i2c-1   i2c         i915 gmbus dpc                      I2C adapter
i2c-2   i2c         i915 gmbus dpd                      I2C adapter
...
i2c-8   smbus       SMBus stub driver                   SMBus adapter <-- this one
...

> i2cdump -y 8 0x03
No size specified (using byte-data access)
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................

We can see that module was loaded. i2cdetect as detected is as a character device /dev/i2c-8 and i2c-dump shows memory state of slave device with an ID 0x03.

Sending data

Test program has -s option that needs to be used in order to send data. As an argument, device ID needs to be provided.

> ./bin/main -s 8 && echo $?
0

On a success program return 0. We can now verify if data has been stored in the i2c device with i2cdump.

     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 1f f8 47 8e 7f 24 1d 2b 47 ca 64 be ce 0a 3f bd    ??G??$?+G?d?????
10: 08 1c 05 87 b0 31 6c 85 46 94 6f c8 9e 49 dd b2    ?????1l?F?o??I??
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................

Before sending program uses Poly1305-ChaCha20 to encrypt and authenticate data.

Receiving data

Test program has -r option to indicate that user want’s to receive data from I2C device. On exit program prints received data.

[root@cryptoden final]# ./bin/main -r 8
RECEIVED DATA:
HELLO WORLD!!!

As data is authenticated any change to data stored in the I2C will result in decryption error. In order to see this behaviour one can dump the I2C memory, change it, load to I2C and try to read again. Let’s see this:

> i2cdump -y 8 0x03 b > dump
> cat dump
[root@cryptoden ~]# cat dump
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 1f f8 47 8e 7f 24 1d 2b 47 ca 64 be ce 0a 3f bd    ??G??$?+G?d?????
10: 08 1c 05 87 b0 31 6c 85 46 94 6f c8 9e 49 dd b3    ?????1l?F?o??I??
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................

## Here I modify 32-nd byte from b3 to b2 and load data to i2c-stub
> i2c-stub-from-dump 0x03 dump
256 byte values written to 8-0003

## Trying to read
> ./bin/main -r 8
[i2c_recv() src/i2c.c:165] Error occured when decrypting
[i2c_recv() src/i2c.c:172] ERROR: can't receive encrypted data
[test_receive() src/main.c:88] Error occured when receiving data

Testing

Program has a -t option which can be used to test program and see that connection stays persistent after connecting to the device.

> ./bin/main -t 8 && echo $?
0

Additional links

https://robot-electronics.co.uk/i2c-tutorial
http://alokprasad7.blogspot.com/2018/01/fake-i2c-device-i2c-stub.html
https://electronicayciencia.github.io/wPi_soft_i2c/
https://learn.sparkfun.com/tutorials/i2c/all
https://electronicayciencia.github.io/wPi_soft_i2c/

Building HTTPS server with quantum-safe TLS in baby steps

Wed, 10 Oct 2018 00:00:00 GMT

This post is a step-by-step instruction to build HTTPS web server which uses quantum-resistant algorithm for TLS key exchange. Solution is written in Go language. For a web server I’ve used Caddy. Caddy uses TLS implementation from standard Go library. As this doesn’t support any of quantum-resistant algorithms I’ll change it with another implementation.

The quantum-resistant algorithm of my choice is SIDH. Quite some effort has been already put into security research of SIDH. Basics of the algorithm are explained in more details by @LVH in his blog post as well as on Wiki. I’ll use Cloudflare’s implementation of SIDH available from here. One interesting characteristic of the algorithm is that it can be used as a drop-in replacement for the ECDH.

Finally, I’ll need TLS implementation which supports SIDH. This has been done in tls-tris. The library provides built-in support for SIDH/P503-X25519 - key exchange based on IETF draft Hybrid ECDHE-SIDH Key Exchange for TLS. This is done over TLS v1.3, which library now also supports. The tls-tris is code compatible with TLS implementation from standard Go, which makes it possible to simply swap one implementation with the other.

Step 1: SIDH support in TLS

With tls-tris it is possible to perform SIDH key exchange. In order to link application (Caddy in this case) with TLS tris one needs to swap TLS implementation in standard library and recompile Go from source. All this is implementated in the Makefile that comes with the library. Binaries are placed in tls-tris/_dev/GOROOT folder, the GOROOT needs to point to this folder.

Let’s first download needed sources:

# Create some workspace
WORKSPACE=/tmp/workspace/
mkdir -p ${WORKSPACE}

# Get Go 1.10 sources (if not already done)
cd ${WORKSPACE} 
wget https://dl.google.com/go/go1.10.4.linux-amd64.tar.gz -O - | tar -xz

# Clone tls-tris 
git clone https://github.com/cloudflare/tls-tris

Next step is to build Go with TLSv1.3 and SIDH support. This is done automatically and makefile implements all needed steps. The build-all target will basically:

Swap standard library TLS with tls-tris
Download SIDH crypto library and vendor it to ${WORKSPACE}/go/src/vendor directory. This way SIDH is available as it would be a part of standard library.

# By specifying GOROOT_ENV makefile knows where to look for Go sources
cd tls-tris; GOROOT_ENV=${WORKSPACE}/go make -f _dev/Makefile build-all

Finally GOROOT needs to be adjusted:

export GOROOT=${WORKSPACE}/tls-tris/_dev/GOROOT/linux_amd64

Step 2: Patching Caddy

By default Caddy supports TLS up to version 1.2. Thanks to steps above, TLSv1.3 and Hybrid SIDH/503-X25519 key exchange are ready to use. Nevertheless, some code changes to Caddy are needed in order to benefit from those new features.

Let’s start by cloning the sources:

mkdir -p /tmp/gopath
export GOPATH=/tmp/gopath
go get github.com/mholt/caddy/caddy 
go get github.com/caddyserver/builds
cd $GOPATH/src/github.com/mholt/caddy

Now we need to add 2 lines of code in order to: 1. Enable TLS v1.3: In file caddytls/config.go Caddy keeps a map of supported TLS protocols. The TLSv.1.3 needs to be added to this map. 2. Enable SIDH: In the same file Caddy also keeps map of supported curves. The scheme called tls.HybridSIDHp503Curve25519 needs to be added to this map.

The whole diff should look something like that:

> git diff
diff --git a/caddytls/config.go b/caddytls/config.go
index 8cf61e4..fc7510d 100644
--- a/caddytls/config.go
+++ b/caddytls/config.go
@@ -583,6 +583,7 @@ var SupportedProtocols = map[string]uint16{
        "tls1.0": tls.VersionTLS10,
        "tls1.1": tls.VersionTLS11,
        "tls1.2": tls.VersionTLS12,
+       "tls1.3": tls.VersionTLS13,
 }
 
 // GetSupportedProtocolName returns the protocol name
@@ -682,6 +683,7 @@ var supportedCurvesMap = map[string]tls.CurveID{
        "P256":   tls.CurveP256,
        "P384":   tls.CurveP384,
        "P521":   tls.CurveP521,
+       "SIDH/503-X25519": tls.HybridSIDHp503Curve25519,
 }
 
 // List of all the curves we want to use by default.

With those changes applied, Caddy can be built:

# Once again, just to make sure right Go version is used
export GOROOT=${WORKSPACE}/tls-tris/_dev/GOROOT/linux_amd64

# And let's build caddy
cd $GOPATH/src/github.com/mholt/caddy/caddy
go run build.go

Caddy configuration and server bring up

At this point hybrid post quantum key exchange should be supported by Caddy. Obviusly that’s not a default configuration, so one needs to tell Caddy to use it. Caddy is configuring by providing Caddyfile. In this file max version of the protocol is set to “tls1.3”. Also elliptic curve preferences are changed by specifying “SIDH/503-X25519” as key exchange algorithm. My minimal Caddyfile looks like this:

localhost:2015 {
        tls self_signed
        tls {
                protocols tls1.2 tls1.3
                curves X25519 P256 "SIDH/503-X25519"
        }
        log stdout
        proxy / http://www.amongbytes.com
}

This file is placed in the same folder where “caddy” binary lives, namely $GOPATH/src/github.com/mholt/caddy/caddy/Caddyfile. It will basically open a port 2015 on localhost and forward all traffic to http://www.amongbytes.com. The self_signed certificate will be used in this test configuration. Let’s start it:

> cd $GOPATH/src/github.com/mholt/caddy/caddy
> ./caddy
Activating privacy features... done.
https://localhost:2015

Looks like it’s working. Now it would be good to actually test if post quantum key exchange is working. There are probably not too many browsers supporting SIDH/503-X25519 (if any). Nevertheless tls-tris contains a patch for boringssl which adds SIDH and is used for interoperability testing. We can reuse it to test our setup.

cd ${WORKSPACE}

# Patch for BoringSSL adding SIDH/P503-X25519 support
wget https://raw.githubusercontent.com/cloudflare/tls-tris/master/_dev/boring/sidh_ff433815b51c34496bb6bea13e73e29e5c278238.patch

# Clone and checkout BoringSSL. I'm using specific commit as I want to make sure patch applies correctly
git clone https://boringssl.googlesource.com/boringssl
cd boringssl
git fetch && git checkout ff433815b51c34496bb6bea13e73e29e5c278238 
patch -p1 < ../sidh_ff433815b51c34496bb6bea13e73e29e5c278238.patch

# When building, make sure EXP_SIDH is defined as it enables SIDH
mkdir build && cd build; cmake -DEXP_SIDH=1 -GNinja .. && ninja

Assuming server is running we can now check if quantum resistant TLS handshake works.

> ./tool/bssl client -curves x25519sidh503 -connect localhost:2015                                         
Connecting to [::1]:2015
Connected.
  Version: TLSv1.3
  Resumed session: no
  Cipher: TLS_AES_128_GCM_SHA256
  ECDHE curve: x25519sidh503
  Signature algorithm: ecdsa_secp256r1_sha256
  Secure renegotiation: yes
  Extended master secret: yes
  Next protocol negotiated: 
  ALPN protocol: 
  OCSP staple: no
  SCT list: no
  Early data: no
  Cert subject: O = Caddy Self-Signed
  Cert issuer: O = Caddy Self-Signed

BoringSSL reports that ECHDE curve used for key exchange was “x25519sidh503”. It seems post quantum key exchange works just all right!

Conclusion and next steps

I’m neither an expert nor big fan of Golang. Nevertheless, I think it is great language for performing experiments - especially those related to networking. The setup presented here is used for experiments related to post-quantum cryptographic primitive implementation. I’m using it both on ARM and Intel and thanks to Golang’s build tools, compilation process is very simple, fast and quite easy to perform.

The SIDH looks interesting as a quantum resistant replacement for ECDH. Nevertheless, KEM construction providing IND-CCA2 security sounds to me like something worth trying. In the next steps I will be experimenting with other algorithms - my non exhaustive list contains Round5 presenting very interesting results, something NTRU based, Kyber and CSIDH. Also at further step I’ll try to tackle quantum-resistant signature schemes.

TrustZone Overview

Kris Kwiatkowski — Wed, 30 May 2018 00:00:00 GMT

Slides from presentation given at the Cloudflare office in London.

The goal of the presentation was to introduce main concepts behaind Trusted Execution Environment on ARM and how it could potentially be used on the server side.

mbedTLS vs BoringSSL on ARM

Kris Kwiatkowski — Thu, 19 Apr 2018 00:00:00 GMT

Goals and assumptions

Goal is to choose most suitable TLS library that could be statically linked with an application. The application will be runing on modern mobile operating system and variety of ARM CPUs. I’m interested in client side of the TLS only. Ideal library is the one which ensures the best security, implements algorithms optimized for speed and compiles to reasonably small binary. Additionally I assume I can control both sides of the connection, meaning I’m free to choose a cipher(s) to be used for both - symmetric and assymetric encryption (without using PSK). I also have some requirements regarding licences and being open-source.

I’ve identified two libraries which seem to met those requirements:

mbedTLS - is a library formerly known as PolarSSL. It makes it fairly easy for developers to include cryptographic and TLS capabilities in embedded products. It is highly configurable, so that facilitating TLS functionality may have very small minimal coding footprint. It is currently maintained by ARM.
BoringSSL - is a fork of OpenSSL maintained and used by Google. It is a default TLS library used by Android OS (starting from version M), Chrome as well as used on Cloudflare systems. I has advantage of being originated from OpenSSL - it means that library got a lot of reviews and testing.

Testing application

It’s a implementation of simple C-based test application, which compiles and links against library under test and run on ARMv8 platform running Android operating system. The app is composed of client and server. As I’m only interested in client side of the TLS end, we fix the server to always use same library (it’s based on BoringSSL). Server is configured to support only TLSv1.2 (as 1.3 is not supported by mbedTLS, yet [16]). In order to start a server, user provides an argument which specifies cetificate type to be used (RSA, ECDSA or EdDSA based). Once run it always enforces same cipher suite to be used - for example in case of RSA it will be ECDHE key agreement with RSA signature and AES/256 in GCM AEAD mode.

Client application is the one which I want to benchmark. I have implemented one which uses mbedTLS API and links with this library and similar one for BoringSSL. Client always establishes TCP connection in blocking mode (simplicity). It implements 3 different tests:

Handshake : during this test client opens TCP connection and performs many handshake without closing the connection. Performance of this test depends on key type used for certificate signing and symmetric key agreement algorithm (as well as elliptic curve used), hence this test is performed multiple times, once for each certificate type
Write: clients opens TCP connection and sends few hundred megabytes of data. This test is done mostly to assess performance of symmetric encryption
Read: clients opens TCP connection and sends a request to the server which sends back few hundred megabytes of data. This test is done mostly to assess performance of symmetric decryption

Details regarding testing environment

Software version

Library Commit

BoringSSL eb7c3008

mbedTLS 4ca9a457
Compiler and environment settings

Name Setting

Compiler aarch64-linux-android-clang5.0 (as Google is deprecating gcc )

ABI arm64-v8a

NDK version 16b

Android Native API level 27

Android Build type Release
Testing platform

Hardware platform used for testing is a HiKey620 development board. It is powered by Kirin 620 SoC (8 x ARM Cortex-A53) from HiSilicon. It is running Android 8 from AOSP (see build details in Appendix B here ). Details about the board can be found here and here.

Details of the environment used:
```
Linux localhost 4.9.29-g23875fc #1 SMP PREEMPT Tue Jul 4 14:25:00 CEST 2017 aarch64
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
```

Library	Commit
BoringSSL	`eb7c3008`
mbedTLS	`4ca9a457`

Name	Setting
Compiler	aarch64-linux-android-clang5.0 (as Google is deprecating `gcc` )
ABI	arm64-v8a
NDK version	16b
Android Native API level	27
Android Build type	Release

Preparation step

Following script is used to set-up platform for benchmarking. Most important step is to fix CPU frequency so that it is not auto-regulated by things like EAS [11].

# Number of CPUs on the board
NUM_CPU=8
# CPU scaling governor.
GOVERNOR=userspace
# Requested CPU frequency
MAX_FREQ=1200000


adb root
adb remount
# Prevent system from suspending
adb shell "echo temporary > /sys/power/wake_lock"
# Probably useful only on qcom, but anyway...
adb shell stop thermal-engine
adb shell stop mpdecision

for ID in `seq 0 $((NUM_CPU-1))`
do
adb shell "echo 1 > /sys/devices/system/cpu/cpu${ID}/online"
adb shell "echo ${GOVERNOR} > /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_governor"
adb shell "echo ${MAX_FREQ} > /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_setspeed"
done

for ID in `seq 0 $((NUM_CPU-1))`
do
adb shell "cat /sys/devices/system/cpu/cpu${ID}/online"
adb shell "cat /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_governor"
adb shell "cat /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_cur_freq"
done

Binary size reduction

mbedTLS

mbedTLS makes it possible to select features of TLS library before compile time. Configuration template is available in config.h file and is managed by definining or disabling number of preprocessor symbols (look for MBEDTLS_CONFIG_FILE for more details). This is an easy way for developers to include cryptographic and (optional) SSL/TLS capabilities in their products, facilitating those functionalities with a minimal coding footprint. Indeed, it is interesting feature for memory constrained devices (for example microcontrollers).

mbedtls compilation produces 3 separated libraries - crypto, ssl and x509 library. Compilation also outputs number of test binaries.

As a first step I have applied set of obvious size optimization provided by compiler (-Os) and stripped all the symbols (they can be stored in separated file if needed). I also applied -ffuntion-sections and --fdata-sections options to the compiler. This will cause compiler to place each function or data item into its own section. Then thanks to -Wl,--gc-sections linker will be able to chose only those sections which are actually used, which makes resulting binary much smaller (one can add -Wl,--print-gc-sections in order to see removed sections). This optimization may produce unexpected results, so I strongly advice to look at documentation and get familiar with the details of this optimization.

In a second step I have changed config.h file and removed capabilities which are not needed by our client application, leaving following capabilities only:

TLS client side code
TLS v1.2
AES-GCM used as a symmetric cipher
RSA, ECDSA and ECDH with curves P-256
SHA-256 and SHA-512
Code for key pre-sharing has been removed
Some additional features required by the client code

In a third step I’ve removed support of RSA, which from one hand isn’t actually necessarily, as I control both sides of a connection, and from the other hand it’s interesting how much binry size get’s reduced.

Step	Optim	`libmbedx509.a`	`libmbedtls.a`	`libmbedcrypto.a`	Test app
0	Initial size (with -O2)	96K	260K	604K	464K
1	Removal of data and function sections, strip, -Os	68K	132K	380K	272K
2	Disabling not needed capabilities	52K	52K	236K	128K
3	Disabling RSA	40K	48K	208K	108K

The test client has been reduced more than 4 times and indeed to very small size. Further reductions are possible (see [8] for ideas), nevertheless at this point I’m satisfied with the size and I don’t think it is possible to change it much. Also removing RSA reduces a size only by 20 bytes, so I’ve decided to keep RSA and pay a little penalty.

Also it’s worth noting that 48KB for TLSv1.2 implementation is really small memory footprint. Very interesting for small devices which implement most of needed crypto in hardware.

BoringSSL

Similar experiment as bove has been done with BoringSSL. This library doesn’t offer so many configuration possibilities as mbedTLS, nevertheless it provides some.

In the first step I’ve applied exactly same compiler flags as in case of mbedTLS (-Os, symbol strip, indexing data&function sections).

In a second step I’ve applied OPENSSL_SMALL=1 configuration option. This tells the compiler to use algorithm implementation which is optimized for size rather than for speed (see [12] for more details).

In a third step I’ve tried to remove assembly implementation. For some algorithms this causes huge performance degradation as some optimizations are written in assembly as well as hardware acceleration needs a “glue” code which is written in assembly. Nevertheless, it is interesting step when comparing against mbedTLS, as it doesn’t have any such optimizations currently (see [6]).

BoringSSL provides concept of crypto buffers which can be used instead of some functions from memory hungry X509 and ASN.1 implementation. This feature together with indexing data&function sections (done in first step) greatly reduces binary size. We have used it in step 4. In step 5 we go a bit further - need for X509 and ASN.1 can be complatelly removed, assuming user provides custom certificate verification function. My client doesn’t implement such function, but from one hand it shouldn’t be very complicated to implement such function and also code size of such function won’t change much final binary size. Hence it’s interesting to see a result of size reduction.

In last step I’ve tried to introduce more aggressive changes to the comment out (with preprocessor symbols) RSA and DH implementation.

Step	Optim	`libcrypto.a`	`libssl.a`	Test App
0	Initial size	12.4M	7.6M	7.5M
1	Removal of data and function sections, strip, -Os	1244K	356K	796K
2	OPENSSL_SMALL=1	1200K	356K	756K
3	OPENSSL_NO_ASM	1184K	356K	736K
4	BoringSSL crypto buffers	1184K	356K	700K
5	Complate elimination of X.509 and ASN.1 code	1184K	356K	392K
6	Disabling RSA and DH	1144K	352K	356K

I’m positivelly surprised by the fact that it is possible to remove X509 and ASN.1 code, it gives you really small library. At the moment I don’t want to implement my own certificate verification function and I want to perform certificate verification during performance benchmarking. But it’s worth noting that with fairly small cost BoringSSL can be reduced almost twice to the binary size that’s a bit more than 3 times bigger than the one produced with mbedTLS, which is quite interesting.

Removing ASM hits performance a lot - so I will keep it. Removing RSA and DH gives on 36KB smaller binary, but it introduces very high maintenance cost - it will be hard and error prone to apply those changes to the code after updating library to newer version. As a side note - OpenSSL has a switch which removes RSA (OPENSSL_NO_RSA), FWIW it might be that this code could be ported to BoringSSL.

Finally for my further analysis I’ll apply steps 1,2 and 4 (and I’ll encourage again to apply step 5).

Notes on size reduction

Something that wasn’t tried is a Link Time Optimization feature which may provide binary with reduced size (see [3], [4] and [5]).
- It might be interesting to see how different results will be when using this features instead/with section indexing
I’ve calculated also size of shared libraries for boring ssl - libcrypto.so: 1072KB; libssl.so: 276KB
mbedTLS doesn’t implement hardware acceleration, so performance won’t be as good as for BoringSSL. I wonder if it would make sense to take exremly small SSL implementation from mbedTLS and use crypto from BoringSSL.

Performance comparison comparison

Results from tools provided by the libraries

Both libraries provide tools for benchmarking. This subsection compares results reported by those tools. I compare default compilation against binary I got after applying tricks which reduce size of client application.

mbedTLS: default vs reduced

mbedTLS provides a tool for performance benchmarking called benchmark. The table below shows results for most interesting algorithms (for results of all algorithms see Appendix C here.

Algo	Reduced	Default (-O2)
SHA-256	46809 KiB/s	52044 KiB/s
AES-GCM-128	16399 KiB/s	16398 KiB/s
AES-GCM-256	14287 KiB/s	14286 KiB/s
RSA-2048	652 public/s	653 public/s
RSA-2048	17 private/s	17 private/s
RSA-4096	168 public/s	168 public/s
RSA-4096	3 private/s	3 private/s
ECDSA-secp256r1	189 sign/s	195 sign/s
ECDHE-secp256r1	57 handshake/s	60 handshake/s
ECDH-secp256r1	77 handshake/s	81 handshake/s
ECDHE-Curve25519	41 handshake/s	41 handshake/s
ECDH-Curve25519	80 handshake/s	82 handshake/s

One thing to notice is that (for algorithms above) there is no much difference bewteen applying -Os and -O2 as -Os enables all -O2 optimizations that do not typically increase code size. Also it’s worth to notice performance difference between static and ephemeral ECDH. It seems to be quite weird and probably root cause should be studied further.

BoringSSL: default vs reduced

Performance results are provieded by bssl speed tool from BoringSSL. Table with most interesting algorithms (for results of all algorithms see Appendix C here.

Operation	Reduced	Default (-O2)
RSA 2048 signing	(59.5 ops/sec)	(108.1 ops/sec)
RSA 2048 verify	(2377.5 ops/sec)	(4078.3 ops/sec)
RSA 4096 signing	(8.3 ops/sec)	(14.9 ops/sec)
RSA 4096 verify	(668.0 ops/sec)	(1088.4 ops/sec)
AES-128-GCM (1350 bytes)	(17675.4 ops/sec): 23.9 MB/s	(291430.3 ops/sec): 393.4 MB/s
AES-256-GCM (1350 bytes)	(14792.6 ops/sec): 20.0 MB/s	(254718.5 ops/sec): 343.9 MB/s
ChaCha20-Poly1305 (1350 bytes)	(33108.8 ops/sec): 44.7 MB/s	(67622.8 ops/sec): 91.3 MB/s
SHA-256 (8192 bytes)	(6824.7 ops/sec): 55.9 MB/s	(63214.7 ops/sec): 517.9 MB/s
SHA-512 (8192 bytes)	(14014.7 ops/sec): 114.8 MB/s	(14759.6 ops/sec): 120.9 MB/s
RNG (8192 bytes)	(4058.7 ops/sec): 33.2 MB/s	(55705.4 ops/sec): 456.3 MB/s
ECDH P-256 operations	(594.7 ops/sec)	(642.8 ops/sec)
ECDSA P-256 signing	(1396.6 ops/sec)	(1738.5 ops/sec)
ECDSA P-256 verify	(672.1 ops/sec)	(704.2 ops/sec)

Comparing `mbedTLS` and `BoringSSL` based client

Default compilation

Those results represent as close to best possible performance that we should expect on ARMv8 when using BoringSSL as a client.

Performance:

Test	mbedTLS	BoringSSL
Handshakes - RSA_2048 (x200)	0m21.69s	0m03.12s
Handshakes - ECDSA_256 (x200)	0m24.34s	0m01.38s
Write - ECDSA_256 (AES-GCM)	0m16.28s	0m03.94s
Read - ECDSA_256 (AES-GCM)	0m17.49s	0m03.92s

I could find following reasons for difference in performance:

BoringSSL contains support for ARMv8 crypto extensions implemented in hardrware (AES, PMULL, SHA256), which mbedTLS doesn’t support yet [6]. BoringSSL also uses vector instructions (NEON) for some algorithms, NEON can be find on both v7 (optional) and v8 (mandatory) ARMs. Nevertheless algorithms used in this test do not use NEON. But, Poly1305-ChaCha20 uses NEON and this is important because it could optimize devices based on ARMv7. Those devices do not offer hardware accelerated AES and hence if AES is used on such devices, it will be much slower. Poly-ChaCha implementation is only available in the BoringSSL. One more comment on hardware support - it is discovered at runtime and BoringSSL will fallback to software implementation (or NEON and then software) in case CPU doesn’t support required extension.
BoringSSL client supports X25519 curve. From the other hand, mbedTLS doesn’t support this curve in TLS (it supports it only as a primitive [10]). In the test above mbedTLS usedNIST P-384. Implementation of arithmetic on x25519 curve is much more efficient than than P-384. It’s obviously wrong to compare two different curves - one of the tests below enforces usage of P-256.
It seems mbedTLS does more I/O - it sends more TCP packets than BoringSSL

exchanged TCP packets were generally bigger (for example ClientHello, 470B - mbedTLS and 213B - BoringSSL)
mbedTLS sends “Client Key Exchange” and “Change Cipher Spec” in separated TCP packets, which is not a case for BoringSSL

According to mbedTLS forum, every TLS message is sent using the send bio callback. The default implementation is that every packet sent is sent separately. One could supply custom send callback, that will concatenate every possible message, and will send as one TCP packet. Nevertheless, this wasn’t done during this analysis.

Following two tests try to build libraries and TLS clients with different profiles, hopefully eliminating as much as possible some of differences described above.

Software implementation only

For this test I’ve built BoringSSL client which uses only crypto implemented in software and doesn’t use hardware acceleration. Those results should help to understand how BoringSSL will behave on CPUs which don’t provide such features.

Test	mbedTLS	BoringSSL
Handshakes - RSA_2048 (x200)	0m20.89s	0m04.89s
Handshakes - ECDSA_256 (x200)	0m23.80s	0m01.66s
Write - ECDSA_256 (AES-GCM)	0m16.41s	0m13.79s
Read - ECDSA_256 (AES-GCM)	0m17.51s	0m13.72s

Ok, so mostly symmetric encryption is affected.

Enforcing usage of NIST P-256 curve for ECDHE

This test enforces usage of curve NIST P-256. This mostly affect handshake time and eliminates some differences seen in first performance test.

Test mbedTLS BoringSSL

Handshakes - RSA_2048 (x200) 0m16.88s 0m03.56s

Handshakes - ECDSA_256 (x200) 0m20.26s 0m01.89s

Test	mbedTLS	BoringSSL
Handshakes - RSA_2048 (x200)	0m16.88s	0m03.56s
Handshakes - ECDSA_256 (x200)	0m20.26s	0m01.89s

Other things

BoringSSL seems to be a better choice, let see what else it offers.

Using EdDSA with X25519 for ECDHE

During course of action, I’ve found out that BoringSSL offers possibility to use Ed25519 with TLSv1.2. Results below show differences in performing 500 handshakes with Ed25519, ECDSA/P-256 and RSA/2048. CA certificate is still RSA/2048 (same as it was used for other tests).

Performance:

Handshake x500	TLS handsh.	PubKey	Sign size	Degradation
Handshakes - Ed25519	0m02.72s	256 bits	512 bits
Handshakes - ECDSA	0m03.47s	256 bits	512 bits	27.6%
Handshakes - RSA	0m07.83s	2048 bits	2048 bits	287.9%

It’s worth noticing that Ed25519 and ECDSA offer same security level and RSA/2048 is a bit weaker. Nevertheless, Ed25519 certificates are not yet very popular.

TLS 1.3 & 0-RTT

Only BoringSSL supports TLS 1.3, at the moment it implements latest draft of the standard (28). Gains from using TLS v1.3 (and 0-RTT) are well described in [13].

Out of scope / left for further analysis:

Few points there were not checked:

32bit (armeabi-v7a) code may be smaller and still run on ARM64. Thumb mode (variable-length instruction set) will produce even more compact code. Thumb mode is default setting in NDK
Something I havn’t checked is a power consumption, which is important in case of mobile application. It’s not complicated but requires specific hardware (see [18] and [19]). I assumed that thing which executes in smaller amount of time will consume less. But this assumption should be verified, as it’s probably not always true.
Implementing hardware acceleration mbedTLS is obvious improvement which should be considered. See here for more details. It is also highly time consuming task.
mbedTLS supports so called “alternative” implementation. One idea on using it would be to swap existing implementation of ECC with either smaller or faster implementation (for smaller implementation I would recomend uECC, which can be as small as 4KB). Other option could be to use small SSL implementation from mbedTLS and fast crypto implementation from BoringSSL or NaCL [17].
mbedTLS has a configuration option called MBEDTLS_SSL_MAX_CONTENT_LEN which determines the size of internal I/O buffer. Playing with this value may help improve performance or reduce size.
Performance of Poly-ChaCha on ARMv7

Conclusion

My preference goes to BoringSSL for following reasons:

It offers much better performance on ARM
It offers more features like TLSv1.3 and Curve25519
It compiles to binary size which is reasonable. Smallest possible resulting library is 3 times bigger than the one based on mbedTLS, overall result is just 350KB. The difference between smallest possible mbedTLS based client and BoringSSL one is just 248KB. Let say the library will be linked to each and every application on the phone. Assuming user has has 100 apps on a phone, the difference in size is 24MB, which nowadays is negligible. Also ccording to report by Statista [14], on average users have 27 apps instaled on the phone (which is less an argument and more interesting information).
BoringSSL is a default TLS library on Android and is a Google product. It means that there is a lot of intrest to make even more secure and fast.
Recently BoringSSL received formally verified implementation of Curve25519 and P-256 (see [15])

It seems both libraries have very different design goals. mbedTLS is made for resource constrained embedded systems, which face challanges in terms of memory availability. Embedded platforms often do not exceed 256KB of RAM, often don’t have memory management units and cannot support virtual memory, as a result dynamic allocation is avoided. I believe for such systems mbedTLS is unbeatable and a great choice.

BoringSSL doesn’t seem to have similar design goal. It seems to be designed for devices which offer more RAM, storage space and in general have much different profile than resource constrained embedded systems. Mobile devices offer all those features and it would be huge mistake not make use of it.

When thinking about software design, there is great difference between aiming for “reasonably small” and “smallest possible bianry size” - those are basically two different goals.

Finally

I would like to thank Ron E. from mbedTLS team for all the answers for my questions.

UPDATE: Recently one of my co-workers has implemented performance improvement for ARMv64. It is small change which give good speedup - see more details (here)[https://github.com/ARMmbed/mbedtls/pull/1964].

Footnotes

[0] Android NDK: reducing binary sizes: https://blog.algolia.com/android-ndk-how-to-reduce-libs-size/
[1] to check: https://stackoverflow.com/questions/6771905/how-to-decrease-the-size-of-generated-binaries
[2] C/C++ reducing size http://ptspts.blogspot.co.uk/2013/12/how-to-make-smaller-c-and-c-binaries.html
[3] “Link time optimization” in https://www.iecc.com/linker/linker11.html
[4] LTO GCC: https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html
[5] LTO LLVM: https://llvm.org/docs/LinkTimeOptimization.html
[6] https://github.com/ARMmbed/mbedtls/pull/1424
[7] “Link time garbage collection” in https://www.iecc.com/linker/linker11.html
[8] https://github.com/android-ndk/ndk/issues/436
[9] https://tls.mbed.org/kb/how-to/reduce-mbedtls-memory-and-storage-footprint
[10] https://github.com/ARMmbed/mbedtls/issues/941
[11] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/EAS
[12] https://boringssl.googlesource.com/boringssl/+/HEAD/BUILDING.md
[13] https://blog.cloudflare.com/introducing-0-rtt/
[14] https://www.apptentive.com/blog/2017/06/22/how-many-mobile-apps-are-actually-used/
[15] https://boringssl.googlesource.com/boringssl/+/HEAD/third_party/fiat/
[16] https://tls.mbed.org/discussions/feature-request/any-plans-for-tls-1-3-support
[17] https://eprint.iacr.org/2018/354/20180418:202819
[18] https://source.android.com/devices/tech/power/component
[19] https://developer.arm.com/products/software-development-tools/ds-5-development-studio/streamline/arm-energy-probe

How run GDB on an Android

Kris Kwiatkowski — Sun, 15 Apr 2018 19:51:13 GMT

Android NDK comes with GDB, somewhere in the NDK folder one can find gdbserver and gdb binaries. The idea is obviously to run gdbserver on the device and then connect to it from local host with gdb. For that to work - both server and client need to have available binary that both are debugging (that’s because both need to have debugging symbols).

Let say I want to debug something which is called main. First step would be to export some variables

# Change line below to wherever you keep NDK
NDK_DIR=/opt/android-ndk

HOST_GDBSERVER=${NDK_DIR}/prebuilt/android-arm64/gdbserver/gdbserver
HOST_GDB=${NDK_DIR}/prebuilt/linux-x86_64/bin/gdb

HOST_APP=/tmp/main
TARGET_APP=/data/app/main
TARGET_GDBSERVER=/data/app/gdbserver
PORT=5039

Then in one terminal I would start gdbserver

adb forward tcp:${PORT} tcp:${PORT}
adb push ${HOST_GDBSERVER} ${TARGET_GDBSERVER}
adb shell ${TARGET_GDBSERVER} :${PORT} ${TARGET_APP}

And gdb in another terminal:

${HOST_GDB} ${HOST_APP}

While in gdb, you can connect to gdb server

target remote :5039

That’s it, easy-peasy. Happy debugging!

Creating certificates for TLS testing

Kris Kwiatkowski — Sun, 15 Apr 2018 00:00:00 GMT

In some cases, it is needed to create your chain of certificates - CA and server (for example TLS testing). There are many descriptions out there on how to do it, nevertheless, I couldn’t find any copy-paste examples which would give me an RSA, ECDSA and EdDSA certificates. Hence, here below, one can find some instructions on how to use openssl to quickly create your certs which, then can then be used during TLS verification.

This post doesn’t explain meaning of configuration used. If such explenation is needed I would suggest reading “Network Security with OpenSSL: Cryptography for Secure Communications”, by J. Viega or looking for required information at this blog.

Configuration file

OpenSSL uses configuration file in order to store information required during certificate creation. Configuraiton file contains things like organization name, address, location, internet address, default hash algorithm used to produce signatures, etc.

Name of both - my example CA and an organization for which server certificate will be created - is called “Cert Testing Organization” with an address www.cert_testing.com.

Here below configuration file used in this example. Copy & paste it to file openssl.cnf:

[ ca ]
# `man ca`
default_ca = CA_default

[ CA_default ]
# Directory and file locations.
dir               = .
certs             = $dir/certs
crl_dir           = $dir/crl
new_certs_dir     = $dir/newcerts
database          = $dir/index.txt
serial            = $dir/serial
RANDFILE          = $dir/private/.rand

# The root key and root certificate.
private_key       = $dir/root.key
certificate       = $dir/root.pem

# For certificate revocation lists.
crlnumber         = $dir/crlnumber
crl               = $dir/crl/intermediate.crl.pem
crl_extensions    = crl_ext
default_crl_days  = 30

# SHA-1 is deprecated, so use SHA-2 instead.
default_md        = sha256

name_opt          = ca_default
cert_opt          = ca_default
default_days      = 9999
preserve          = no
policy            = policy_loose

[ policy_strict ]
# The root CA should only sign intermediate certificates that match.
# See the POLICY FORMAT section of `man ca`.
countryName             = match
stateOrProvinceName     = match
organizationName        = match
organizationalUnitName  = optional
commonName              = supplied
emailAddress            = optional

[ policy_loose ]
# Allow the intermediate CA to sign a more diverse range of certificates.
# See the POLICY FORMAT section of the `ca` man page.
countryName             = optional
stateOrProvinceName     = optional
localityName            = optional
organizationName        = optional
organizationalUnitName  = optional
commonName              = supplied
emailAddress            = optional

[ req ]
# Options for the `req` tool (`man req`).
default_bits        = 4096
distinguished_name  = req_distinguished_name
string_mask         = utf8only

[ req_distinguished_name ]
countryName                     = Country Name (2 letter code)
stateOrProvinceName             = State or Province Name (full name)
localityName                    = Locality Name (eg, city)
organizationalUnitName          = Organizational Unit Name (eg, section)
commonName                      = Common Name

stateOrProvinceName_default     = PACA
countryName_default             = FR
localityName_default            = Cagnes sur Mer
organizationalUnitName_default  = Cert Testing Organization
commonName_default              = Cert Testing Organization
commonName_max                  = 64

[ v3_ca ]
# Extensions for a typical CA (`man x509v3_config`).
subjectKeyIdentifier        = hash
authorityKeyIdentifier      = keyid:always,issuer
basicConstraints            = critical, CA:true
keyUsage                    = critical, digitalSignature, cRLSign, keyCertSign

[ v3_intermediate_ca ]
# Extensions for a typical intermediate CA (`man x509v3_config`).
subjectKeyIdentifier        = hash
authorityKeyIdentifier      = keyid:always,issuer
basicConstraints            = critical, CA:true, pathlen:0
keyUsage                    = critical, digitalSignature, cRLSign, keyCertSign

[ usr_cert ]
# Extensions for client certificates (`man x509v3_config`).
basicConstraints        = CA:FALSE
nsCertType              = client, email
nsComment               = 'Cert Testing Intermediate - Client'
subjectKeyIdentifier    = hash
authorityKeyIdentifier  = keyid,issuer
keyUsage                = critical, nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage        = clientAuth, emailProtection

[ server_cert ]
# Extensions for server certificates (`man x509v3_config`).
basicConstraints        = CA:FALSE
nsCertType              = server
nsComment               = 'Cert Testing Intermediate - Server'
subjectKeyIdentifier    = hash
authorityKeyIdentifier  = keyid,issuer:always
keyUsage                = critical, digitalSignature, keyEncipherment
extendedKeyUsage        = serverAuth
subjectAltName          = @alt_names

[ client_cert ]
# Extensions for server certificates (`man x509v3_config`).
basicConstraints        = CA:FALSE
nsCertType              = client, email
nsComment               = 'Cert Testing EE - Client'
subjectKeyIdentifier    = hash
authorityKeyIdentifier  = keyid,issuer
keyUsage                = critical, nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage        = clientAuth, emailProtection

[ crl_ext ]
# Extension for CRLs (`man x509v3_config`).
authorityKeyIdentifier  = keyid:always

[ ocsp ]
# Extension for OCSP signing certificates (`man ocsp`).
basicConstraints        = CA:FALSE
subjectKeyIdentifier    = hash
authorityKeyIdentifier  = keyid,issuer
keyUsage                = critical, digitalSignature
extendedKeyUsage        = critical, OCSPSigning

[alt_names]
DNS.1   = *.cert_testing.com
IP.1    = 127.0.0.1

Preparation

We will need some directories where output of cert generation will be stored:

mkdir -p private
mkdir -p certs
mkdir -p csr

CA cert creation

CA private key

First step is to create private key of CA cert. Root cert will use RSA keypair with key length of 4096 bits.

OpenSSL will ask for pasword - provide test123.

    openssl genrsa -aes256 -out private/ca.key 4096

or in case of ECDSA certificates:

    openssl ecparam -name prime256v1 -genkey -noout -out private/ca.key
    openssl ec -in private/ca.key -out private/ca.key -aes256

Here second line (encrypting ca.key) is needed only for rest of the article to be copy-paste'able.

Create CA cert

This command will use a key created above and create self-signed CA certificate. Certificate will be valid for 9999 days.

Provide password test123 and hit enter on everything else. openssl will use values defined in openssl.cnf.

     openssl req -config openssl.cnf \
        -extensions v3_ca -new -x509 -days 9999 \
        -key private/ca.key \
        -out certs/ca.cert

One interesting option to notice is ``-extensions v3_ca`` - it is reference to the section with the same name in ``openssl.cnf``. This section tells the ``openssl`` that created certificate must be a CA cert (``CA:true``).

Server cert creation

In this example, certificate signing is done in 3 steps.

Create server certificate private key
Create certificate singing request
Sign the request with CA private key

So let’s do it.

Server’s private key (I skip intermediate certs creation for the brevity).
- RSA/2048 with e=3, for fast verification

        openssl genpkey -algorithm RSA \
            -pkeyopt rsa_keygen_bits:2048 \
            -pkeyopt rsa_keygen_pubexp:3 \
            -out private/rsa_2048.key

* ECDSA/P-256

        openssl genpkey -algorithm EC \
            -pkeyopt ec_paramgen_curve:P-256 \
            -pkeyopt ec_param_enc:named_curve \
            -out private/ecdsa_p256.key

* EdDSA/25519 (supported by newer version of ``openssl`` and in TLS 1.3 only)

        openssl genpkey -algorithm Ed25519 \
            -out private/ed25519.key

Create certificate signing request - intermediary step

     openssl req -config openssl.cnf -new \
       -sha256 \
       -passin pass:test123 \
       -key private/rsa_2048.key \
       -out csr/rsa_2048.csr \
       -days 9999

ECDSA

     openssl req -config openssl.cnf -new \
       -sha256 \
       -passin pass:test123 \
       -key private/ecdsa_p256.key  \
       -out csr/ecdsa_p256.csr \
       -days 9999

EdDSA

     openssl req -config openssl.cnf -new \
       -passin pass:test123 \
       -key private/ed25519.key  \
       -out csr/ed25519.csr \
       -days 9999

Create server cert

Finally we can create set of server certificates.

     openssl x509 \
       -extfile openssl.cnf \
       -extensions server_cert -sha256 -req  \
       -CA certs/ca.cert -CAkey private/ca.key -CAcreateserial \
       -passin pass:test123 \
       -in csr/rsa_2048.csr \
       -out certs/rsa_2048.cert \
       -days 9999

ECDSA

     openssl x509 \
       -extfile openssl.cnf \
       -extensions server_cert -sha256 -req  \
       -CA certs/ca.cert -CAkey private/ca.key -CAcreateserial \
       -passin pass:test123 \
       -in csr/ecdsa_p256.csr \
       -out certs/ecdsa_256.cert \
       -days 9999

EdDSA

     openssl x509 \
       -extfile openssl.cnf \
       -extensions server_cert -req  \
       -passin pass:test123 \
       -CA certs/ca.cert -CAkey private/ca.key -CAcreateserial \
       -passin pass:test123 \
       -in csr/ed25519.csr \
       -out certs/ed25519.cert \
       -days 9999

It is currently believed that all private keys created above provide similar attack resistance, which is comparable to 128-bit symmetric cipher. Nevertheless, it’s worth to notice that byte size of those keys are much different.

Client cert creation

Commands below will create client private key and certificate that can be used for mutual TLS (client authentication). Procedure is similar to creating server certificate, so I’ll do it only for ECDSA.

Client’s private key

openssl genpkey -algorithm EC \
          -pkeyopt ec_paramgen_curve:P-256 \
          -pkeyopt ec_param_enc:named_curve \
          -out private/cli_ecdsa_p256.key

Create certificate signing request - intermediary step

openssl req -config openssl.cnf -new \
          -sha256 \
          -passin pass:test123 \
          -key private/cli_ecdsa_p256.key  \
          -out csr/cli_ecdsa_p256.csr \
          -subj "/O=Cert Testing ORG/CN=Client Cert"

Create client cert

openssl x509 \
          -extfile openssl.cnf \
          -extensions client_cert \
          -req  \
          -CA certs/ca.cert \
          -CAkey private/ca.key \
          -CAcreateserial \
          -in csr/cli_ecdsa_p256.csr \
          -passin pass:test123 \
          -out certs/cli_ecdsa_p256.cert \
          -days 9999

Verification

In order to verify server certificate against CA following command can be used.

> openssl verify -CAfile certs/ca.cert certs/ecdsa_256.cert
certs/ecdsa_256.cert: OK

That’s it, I hope it helps, but most of all I hope I won’t have to look for this stuff ever again.

Also

Thank you to @mattcaswell from OpenSSL team, for helping to figure out how to create EdDSA certs.

Looking for C++ object in a memory dump

Kris Kwiatkowski — Mon, 15 Jan 2018 00:00:00 GMT

When analyzing the core dump of a C++ based, long-running, server application it may be helpful to know the exact state of some objects created by the process. The question, then is, how to find that object. Core files consist of the recorded state of the working memory. The task may be not trivial if the process uses lots of memory.

I’ll assume that the C++ object has some virtual method. In that case, the object must contain a virtual pointer to the V-table of a class. By using the nm tool, it is easy to determine the address of a V-table. I can use that address, to determine exact locations in a core dump, of all the objects of that class, as all those objects will contain an address to V-table.

To demonstrate how the procedure works, let’s use the following, code below as “server application”. We are looking for an object t.

class test {
public:
  virtual ~test(){}
};
int main() {
  test t;
  abort();
  return 0;
}

As mentioned, the nm command shows an address of test class V-table.

$ nm -C myapp | grep "vtable for test"
0000000000400980 V vtable for test

Ok, so the V-table address is 0x400980. Now I need to find a V-pointer in the compiled binary. The value of a V-pointer is an address to the V-table + 16 (on a 64-bit system). To understand where this +16 comes from, we need to understand how the layout of the V-table looks like.

V-Table

The graph above shows 5 segments. Typically, on a 64-bit system, each of those segments is 8 bytes long (4 on a 32-bit system). V-table starts with an empty segment, storing value 0x00. The following segment contains an address to the typeinfo object of the class (used by the typeid function). The next segment is an address to the first virtual function declared in the class - in our case, it is an address to the destructor of the test class. The V-pointer stores address to this function, which is a reason why the value of V-pointer is an address of V-table+16 (V-table + 2 segments).

The next step in my investigation is to determine an address of the V-pointer in a program binary. Address of a V-table is 0x400980, so the address too look for is a value 0x00400980 + 0x10 = 0x00400990.

> hexdump -C myapp | grep "90 09 40 00"
00001860  48 c7 00 90 09 40 00 5d  c3 90 90 90 90 90 90 90  |H....@.]........|
00002860  48 c7 00 90 09 40 00 5d  c3 90 90 90 90 90 90 90  |H....@.]........|
0005e410  90 09 40 00 00 00 00 00  00 00 00 00 00 00 00 00  |..@.............|

We have got 3 possible places where an object may be located. I’ll use 3-rd for further description. I know that is the one I’m looking for, but normally at this point, one needs to somehow determine which object is the interesting one by examinating all of them. The address of this object is 0x005e410.

Now we need to find out what’s the address of this object in a core file. To do it you need to do some calculations, because:

object address in a core file = offset to the V-pointer from program binary + VMA address - VMA offset

VMA address and VMA offset we can get by using objdump or readelf commands.

> objdump -h corefile
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
...
36 load26        00001000  00007f7fbc5c5000  0000000000000000  0003e000  2**12
                 CONTENTS, ALLOC, LOAD
37 load27        00022000  00007fff4f143000  0000000000000000  0003f000  2**12
                 CONTENTS, ALLOC, LOAD
38 load28        00001000  00007fff4f1f9000  0000000000000000  00061000  2**12
                 CONTENTS, ALLOC, LOAD, READONLY, CODE

Section 37, starting with load 27 is the interesting one. That’s because the V-pointer offset value from the program binary is 0x005e410. This value is between 0x3f000 (“File off” column for section 37) and 0x61000 (“File off” column for section 38). VMA address value for this section is 00007fff4f143000, VMA offset value is 0003f000. According to the formula above address of the object will be:

0x5e410 + 0x7fff4f143000 - 0x3f000 = 0x7FFF4F162410

Let’s now check with GDB if the described procedure is correct:

> gdb myapp core
(gdb) p &t
$1 = (test *) 0x7fff4f162410

As we see address of the t variable is the same as what I have got from the calculation so the procedure is correct.

Compile C code in Android NDK

Thu, 03 Mar 2016 00:00:00 GMT

I’ve limited love to Android tools provided by Google and never understood why Google tries to make it so complicated to run native code on the device. In the end Android is some form of Linux and some parts of Android framework are implemented in C/C++. I also have limited love (and knowledge) to Java and don’t really like to use it.

Anyways, here below I present 2 methods of compiling C programs with Android NDK.

Let’s use standard “hello world” as an application that we want to run on Android dev board (main.c):

#include 

int main()
{
  printf("Hello World\n");
  return 0;
}

Method 1: Using `ndk-build`

This method follows Android’ic way of doing things:

Create required directories

mkdir -p hello_world/jni
mkdir -p hello_world/libs

In the jni directory create

Android.mk

    LOCAL_PATH := $(call my-dir)
    include $(CLEAR_VARS}
    # give module name
    LOCAL_MODULE := hello_world
    # list your C files to compile
    LOCAL_SRC_FILES := main.c
    include $(BUILD_EXECUTABLE)

Copy/create main.c to jni directory
Go to jni directory, call ndk-build. Compilation result should be in hello_world/libs/armeabi/hello_world

Method 2: Makefile

With the second (and my prefered) way you have better control over files being compiled, compiler settings, etc. There is also no “magic” that ndk-build provides.

Following Makefile uses clang from NDK 16b in order to compile a file for Android with API version 27 and for ARMv8 CPU. The makefile can be used a template.

# Change this to whereever you keep NDK
NDK            = /opt/android-ndk
SRCDIR         = .
OBJDIR         = .
DBG           ?= 0

# Debug/Release configuration
ifeq ($(DBG),1)
MODE_FLAGS     = -DDEBUG -g -O0
else
MODE_FLAGS     = -Os -fdata-sections -ffunction-sections
endif

## NDK configuration (clang)

# NDK Version
NDK_TARGETVER  = 27

# Target arch - here aarch64 for android
NDK_TARGETARCH = aarch64-linux-android

# Target CPU (ARMv8)
NDK_TARGETSHORTARCH = arm64

# Toolchain version
NDK_TOOLVER  = 4.9

# Architecture of a machine that does cross compilation
NDK_HOSTARCH = linux-x86_64

# Set needed preprocessor symbols
NDK_TOOLS    = $(NDK)/toolchains/llvm/prebuilt/$(NDK_HOSTARCH)/bin
NDK_SYSROOT  = $(NDK)/sysroot
NDK_TOOL     = $(NDK_TOOLS)/clang
NDK_LIBS     = $(NDK)/toolchains/$(NDK_TARGETARCH)-$(NDK_TOOLVER)/prebuilt/linux-x86_64/lib/gcc/$(NDK_TARGETARCH)/4.9.x
NDK_INCLUDES = -I$(NDK)/sysroot/usr/include \
               -I$(NDK)/sysroot/usr/include/$(NDK_TARGETARCH)
NDK_SYSROOT  = $(NDK)/platforms/android-$(NDK_TARGETVER)/arch-$(NDK_TARGETSHORTARCH)

# Options common to compiler and linker
OPT          = $(MODE_FLAGS) \
               -std=c99 \
               -fPIE \
               -Wall \
               -target $(NDK_TARGETARCH)

# Compiler options
CFLAGS       = $(OPT) \
               $(NDK_INCLUDES)

# Linker options
LDFLAGS      = $(OPT) \
               $(MODE_FLAGS) \
               -pie \
               --sysroot=$(NDK_SYSROOT) \
               -B $(ANDROID_NDK)/toolchains/$(NDK_TARGETARCH)-$(NDK_TOOLVER)/prebuilt/linux-x86_64/$(NDK_TARGETARCH)/bin \
               -L$(NDK_LIBS)

all:
    $(NDK_TOOL) -c $(SRCDIR)/main.c -o $(OBJDIR)/main.o $(CFLAGS)
    $(NDK_TOOL) -o main $(OBJDIR)/main.o $(LDFLAGS)

adb-prepare:
    adb root
    adb remount

push: adb-prepare
    adb push main /data/app/

run: adb-prepare push
    adb shell /data/app/main

Copy this file to same directory as main.c and try

make all
make run

This should compile the file, push it to target and run (if target is connected).

hdc@cryptoden 23:49 > ~/example 
> make run   
adb root
adb remount
remount succeeded
adb push main /data/app/
main: 1 file pushed. 0.7 MB/s (6000 bytes in 0.008s)
adb shell /data/app/main
Hello World

Encrypting RaspberryPI root partition

Wed, 20 May 2015 00:00:00 GMT

Description of encrypting root partition of already installed ArchLinux running on Raspberry. I assume that ArchLinux is already installed on SD card and Pi is booting correctly.

Tested on: * Kernel 4.1.6 (it may not work with much older kernel) * Raspberry model B revision 2

Creating initrd

Best is to start on some actions that need to be done on raspberry. We need to install mkinitcpio and create initram file.

pacman -S mkinitcpio
cp /etc/mkinitcpio.conf ~/mkinitcpio.ripi.conf
vi ~/mkinitcpio.ripi.conf

Make sure that in the configuration file you have HOOKS and MODULES variables changed as below:

MODULES="dm_mod hid usbhid usbcore"
HOOKS="base udev autodetect modconf block filesystems keyboard encrypt fsck"

In MODULES most important is dm_mod and in HOOKS encrypt. Also order is very important in HOOKS. Once done generate new init-ram.

mkinitcpio -k `uname -r` -c ~/mkinitcpio.ripi.conf -g /boot/initrd-crypt

Creating encrypted volume

This must be done on PC. Insert SD card, mount root partition and copy it’s content to some temporary location. Don’t forget trailing / after temporary_location, it is important.

mount /dev/mmcblk0p2 /media
mkdir /temporary_location
rsync --progress -axv /media /temporary_location/

Next step is to create encrypted volume, format it and copy back root partition content:

cryptsetup luksFormat /dev/mmcblk0p2
cryptsetup luksOpen /dev/mmcblk0p2 root-raspberry
mkfs.ext4 /dev/mapper/root-raspberry
mount /dev/mapper/root-raspberry /mnt
rsync --progress -axv /temporary_location/ /mnt

Modification in /etc/fstab, /mnt/boot/config.txt and /mnt/boot/cmdline.txt file

Watch out here - many sources on internet says that you need to specify and address on which initram is loaded (something like initramfs initrd-crypt 0x0a000000, in config.txt). This doesn’t work with kernel 4.1. It’s enough to specify name of the init-ram file in config.txt and cmdline.txt

/mnt/etc/fstab: Change device that mounts on /. File must have following entry (remove entry that starts with /dev/mmcblk0p2)
```
/dev/mapper/root / ext4 defaults,discard,commit=120 0 1
```
/mnt/boot/config.txt: Set initramfs. This file needs to have following line
```
initramfs initrd-crypt
```

/mnt/boot/cmdline.txt: Add following kernel command line arguments:

cryptdevice=/dev/mmcblk0p2:root:allow-discards root=/dev/mapper/root rootwait rootfstype=ext4 initrd=initrd-crypt

Unmount and close crypto device:

sync
unmount /mnt
cryptsetup luksClose root-raspberry

Now you can put back SD card to raspberry and boot device. It should ask for password while booting.

Password on USB key

Raspberry can also read a password directly from file on USB key while booting. In order to do it, create a file with password:

dd if=/dev/urandom of=/mnt/sdb1/ripi.txt
cryptsetup luksAddKey /dev/mmcblk0p2 /mnt/sdb1/ripi.txt

And add following entry to cmdline.txt

cryptkey=/dev/disk/by-uuid/ABCD-EFGH:vfat:/ripi.txt

Where value for ABCD-EFGH you get by running blkid on partition of USB key that contains password:

blkid /dev/sdb1
/dev/sda: UUID="ABCD-EFGH" TYPE="vfat"

Interesting links

https://www.pavelkogan.com/2014/05/23/luks-full-disk-encryption/
https://outflux.net/blog/archives/2017/08/30/grub-and-luks/

Among Bytes

Benchmarking ML-DSA Signature Generation: Understanding Rejection Sampling Performance

Introduction

Why ML-DSA Signing is Different

The Fiat-Shamir with Aborts Construction

What the Norm Bounds Actually Do

Why This Creates Variable Performance

The Numbers: Rejection Probability and Expected Attempts

Factors Affecting Rejection Probability

Understanding the Distribution

The Mathematical Model

ML-DSA Cumulative Distribution

Practical Implications for Constrained Devices

Latency Unpredictability

Energy Consumption Variability

Impact on TLS Handshakes

System Design Considerations

Best Practices for Benchmarking ML-DSA Signing

1. Single-Iteration Signing Time

2. Average Signing Time

3. Iteration Reporting

Comparing to Traditional Signatures

Conclusion

Making SmartFusion2 Productive in Brownfield Systems

Introduction

Context: An Older Platform and Brownfield Devices

Philosophy: Software First, Hardware Fixed

A Minimal and Repeatable Development Platform

Enabling UART

Flashing Firmware from the Linux Command Line

Using usbip for Remote Access

Conclusion

Evaluating Intel QAT for Hash-Based Post-Quantum Signature Schemes

Introduction

QAT software stack

Key Points

Cryptographic support

Quantitative analysis

SHA2-256

SHA3-256

SHA2-512

Overall findings

Technological Fit for QAT

Asynchronous use of the QAT_engine. The reactor pattern.

TLS offload

QAT ecosystem: IPP and Multi-buffer Crypto

Integrated Performance Primitives (IPP)

Multi-buffer Crypto

Relationship between IPP and Multi-buffer Crypto

Reuse within a cryptographic software stack

Integration into the core crypto stack

Separate companion component

Plugin-based integration for open-source ecosystems

Conclusions

Migration to Post-Quantum Cryptography

Why Start the Migration?

Key Exchange in TLS: The Most Urgent Step

Digital Signatures: Important, But Less Urgent

Infrastructure Costs: What to Expect

Final Thoughts

Note on speed of verification in SLH-DSA

Key agreement methods in FIPS

Gentle introduction to NTRU cryptosystem (part 1)

The lattice and related hard-problems

Rings

NTRU scheme: basic idea

Concrete schemes

Constant-time code verification with Memory Sanitizer

How does the UUM detector work?

Toy example

Utility called ct_check.h

Applying ct_check.h to the existing implementation

Shadow memory propagation: MSan vs Valgrind

Speed

Conclusion, limitations and future direction

Experimenting with NGINX

Sources

Link NGINX with the BoringSSL

Adding QUIC support

Package creation

Asynchronous use of the `QAT_engine`. The reactor pattern.

Utility called `ct_check.h`

Applying `ct_check.h` to the existing implementation

Comparing `mbedTLS` and `BoringSSL` based client

Method 1: Using `ndk-build`