James Routley

Worried that your latest ask to a cloud-based AI reveals a bit too much about you? Want to know your genetic risk of disease without revealing it to the services that compute the answer?

There is a way to do computing on encrypted data without ever having it decrypted. It’s called fully homomorphic encryption, or FHE. But there’s a rather large catch. It can take thousands—even tens of thousands—of times longer to compute on today’s CPUs and GPUs than simply working with the decrypted data.

So universities, startups, and at least one processor giant have been working on specialized chips that could close that gap. Last month at the IEEE International Solid-State Circuits Conference (ISSCC) in San Francisco, Intel demonstrated its answer, Heracles, which sped up FHE computing tasks as much as 5,000-fold compared to a top-of the-line Intel server CPU.

Startups are racing to beat Intel and each other to commercialization. But Sanu Mathew, who leads security circuits research at Intel, believes the CPU giant has a big lead, because its chip can do more computing than any other FHE accelerator yet built. “Heracles is the first hardware that works at scale,” he says.

The scale is measurable both physically and in compute performance. While other FHE research chips have been in the range of 10 square millimeters or less, Heracles is about 20 times that size and is built using Intel’s most advanced, 3-nanometer FinFET technology. And it’s flanked inside a liquid-cooled package by two 24-gigabyte high-bandwidth memory chips—a configuration usually seen only in GPUs for training AI.

In terms of scaling compute performance, Heracles showed muscle in live demonstrations at ISSCC. At its heart the demo was a simple private query to a secure server. It simulated a request by a voter to make sure that her ballot had been registered correctly. The state, in this case, has an encrypted database of voters and their votes. To maintain her privacy, the voter would not want to have her ballot information decrypted at any point; so using FHE, she encrypts her ID and vote and sends it to the government database. There, without decrypting it, the system determines if it is a match and returns an encrypted answer, which she then decrypts on her side.

On an Intel Xeon server CPU, the process took 15 milliseconds. Heracles did it in 14 microseconds. While that difference isn’t something a single human would notice, verifying 100 million voter ballots adds up to more than 17 days of CPU work versus a mere 23 minutes on Heracles.

Looking back on the five-year journey to bring the Heracles chip to life, Ro Cammarota, who led the project at Intel until last December and is now at University of California Irvine, says “we have proven and delivered everything that we promised.”

FHE Data Expansion

FHE is fundamentally a mathematical transformation, sort of like the Fourier transform. It encrypts data using a quantum-computer-proof algorithm, but, crucially, uses corollaries to the mathematical operations usually used on unencrypted data. These corollaries achieve the same ends on the encrypted data.

One of the main things holding such secure computing back is the explosion in the size of the data once it’s encrypted for FHE, Anupam Golder, a research scientist at Intel’s circuits research lab, told engineers at ISSCC. “Usually, the size of cipher text is the same as the size of plain text, but for FHE it’s orders of magnitude larger,” he said.

While the sheer volume is a big problem, the kinds of computing you need to do with that data is also an issue. FHE is all about very large numbers that must be computed with precision. While a CPU can do that, it’s very slow going—integer addition and multiplication take about 10,000 more clock cycles in FHE. Worse still, CPUs aren’t built to do such computing in parallel. Although GPUs excel at parallel operations, precision is not their strong suit. (In fact, from generation to generation, GPU designers have devoted more and more of the chip’s resources to computing less and less-precise numbers.)

FHE also requires some oddball operations with names like “twiddling” and “automorphism,” and it relies on a compute-intensive noise-cancelling process called bootstrapping. None of these things are efficient on a general-purpose processor. So, while clever algorithms and libraries of software cheats have been developed over the years, the need for a hardware accelerator remains if FHE is going to tackle large-scale problems, says Cammarota.

The Labors of Heracles

Heracles was initiated under a DARPA program five years ago to accelerate FHE using purpose-built hardware. It was developed as “a whole system-level effort that went all the way from theory and algorithms down to the circuit design,” says Cammarota.

Among the first problems was how to compute with numbers that were larger than even the 64-bit words that are today a CPU’s most precise. There are ways to break up these gigantic numbers into chunks of bits that can be calculated independently of each other, providing a degree of parallelism. Early on, the Intel team made a big bet that they would be able to make this work in smaller, 32-bit chunks, yet still maintain the needed precision. This decision gave the Heracles architecture some speed and parallelism, because the 32-bit arithmetic circuits are considerably smaller than 64-bit ones, explains Cammarota.

At Heracles’ heart are 64 compute cores—called tile-pairs—arranged in an eight-by-eight grid. These are what are called single instruction multiple data (SIMD) compute engines designed to do the polynomial math, twiddling, and other things that make up computing in FHE and to do them in parallel. An on-chip 2D mesh network connects the tiles to each other with wide, 512 byte, buses.

Important to making encrypted computing efficient is feeding those huge numbers to the compute cores quickly. The sheer amount of data involved meant linking 48-GB-worth of expensive high-bandwidth memory to the processor with 819 GB per second connections. Once on the chip, data musters in 64 megabytes of cache memory—somewhat more than an Nvidia Hopper-generation GPU. From there it can flow through the array at 9.6 terabytes per second by hopping from tile-pair to tile-pair.

To ensure that computing and moving data don’t get in each other’s way, Heracles runs three synchronized streams of instructions simultaneously, one for moving data onto and off of the processor, one for moving data within it, and a third for doing the math, Golder explained.

It all adds up to some massive speed ups, according to Intel. Heracles—operating at 1.2 gigahertz—takes just 39 microseconds to do FHE’s critical math transformation, a 2,355-fold improvement over an Intel Xeon CPU running at 3.5 GHz. Across seven key operations, Heracles was 1,074 to 5,547 times as fast.

The differing ranges have to do with how much data movement is involved in the operations, explains Mathew. “It’s all about balancing the movement of data with the crunching of numbers,” he says.

FHE Competition

“It’s very good work,” Kurt Rohloff, chief technology officer at FHE software firm Duality Technology, says of the Heracles results. Duality was part of a team that developed a competing accelerator design under the same DARPA program that Intel conceived Heracles under. “When Intel starts talking about scale, that usually carries quite a bit of weight.”

Duality’s focus is less on new hardware than on software products that do the kind of encrypted queries that Intel demonstrated at ISSCC. At the scale in use today “there’s less of a need for [specialized] hardware,” says Rohloff. “Where you start to need hardware is emerging applications around deeper machine-learning oriented operations like neural net, LLMs, or semantic search.”

Last year, Duality demonstrated an FHE-encrypted language model called BERT. Like more famous LLMs such as ChatGPT, BERT is a transformer model. However it’s only one tenth the size of even the most compact LLMs.

John Barrus, vice president of product at Dayton, Ohio-based Niobium Microsystems, an FHE chip startup spun out of another DARPA competitor, agrees that encrypted AI is a key target of FHE chips. “There are a lot of smaller models that, even with FHE’s data expansion, will run just fine on accelerated hardware,” he says.

With no stated commercial plans from Intel, Niobium expects its chip to be “the world’s first commercially viable FHE accelerator, designed to enable encrypted computations at speeds practical for real-world cloud and AI infrastructure.” Although it hasn’t announced when a commercial chip will be available, last month the startup revealed that it had inked a deal worth 10 billion South Korean won (US $6.9 million) with Seoul-based chip design firm Semifive to develop the FHE accelerator for fabrication using Samsung Foundry’s 8-nanometer process technology.

Other startups including Fabric Cryptography, Cornami, and Optalysys have been working on chips to accelerate FHE. Optalysys CEO Nick New says Heracles hits about the level of speedup you could hope for using an all-digital system. “We’re looking at pushing way past that digital limit,” he says. His company’s approach is to use the physics of a photonic chip to do FHE’s compute-intensive transform steps. That photonics chip is on its seventh generation, he says, and among the next steps is to 3D integrate it with custom silicon to do the non-transform steps and coordinate the whole process. A full 3D-stacked commercial chip could be ready in two or three years, says New.

While competitors develop their chips, so will Intel, says Mathew. It will be improving on how much the chip can accelerate computations by fine tuning the software. It will also be trying out more massive FHE problems, and exploring hardware improvements for a potential next generation. “This is like the first microprocessor… the start of a whole journey,” says Mathew.

Intel Demos Chip to Compute with Encrypted Data

FHE Data Expansion

The Labors of Heracles

FHE Competition