JETC1204-39 - AhsanJETC2015
JETC1204-39 - AhsanJETC2015
JETC1204-39 - AhsanJETC2015
Performance Simulator
MUHAMMAD AHSAN, Duke University
RODNEY VAN METER, Keio University, Japan
JUNGSANG KIM, Duke University
The optimal design of a fault-tolerant quantum computer involves nding an appropriate balance between
the burden of large-scale integration of noisy components and the load of improving the reliability of hard-
ware technology. This balance can be evaluated by quantitatively modeling the execution of quantum logic
operations on a realistic quantum hardware containing limited computational resources. In this work, we
report a complete performance simulation software tool capable of (1) searching the hardware design space
by varying resource architecture and technology parameters, (2) synthesizing and scheduling a fault-tolerant
quantum algorithm within the hardware constraints, (3) quantifying the performance metrics such as the
execution time and the failure probability of the algorithm, and (4) analyzing the breakdown of these metrics
to highlight the performance bottlenecks and visualizing resource utilization to evaluate the adequacy of the
chosen design. Using this tool, we investigate a vast design space for implementing key building blocks of
Shor’s algorithm to factor a 1,024-bit number with a baseline budget of 1.5 million qubits. We show that a
trapped-ion quantum computer designed with twice as many qubits and one-tenth of the baseline indelity
r r
of the communication channel can factor a 2,048-bit integer in less than 5 months.
CCS Concepts: Computer systems organization → Quantum computing; Hardware →
Quantum error correction and fault tolerance
Additional Key Words and Phrases: Quantum architecture, architecture scalability, resource performance
tradeoffs, performance simulation tool, hardware constraints
ACM Reference Format:
Muhammad Ahsan, Rodney Van Meter, and Jungsang Kim. 2015. Designing a million-qubit quantum com-
puter using a resource performance simulator. J. Emerg. Technol. Comput. Syst. 12, 4, Article 39 (December
2015), 25 pages.
DOI: http://dx.doi.org/10.1145/2830570
1. INTRODUCTION
Although quantum computers (QCs) can in principle solve important problems such as
factoring a product of large prime numbers efciently, the prospect of constructing a
practical system is hampered by the need to build reliable systems out of faulty com-
ponents [Van Meter and Horsman 2013]. Fault-tolerant procedures utilizing quantum
error correcting codes (QECCs) achieve adequate error performance by protecting the
quantum information from noise, but this comes at the expense of substantial resource
This work was funded by Intelligence Advanced Research Projects Activity (IARPA) under the Multi-Qubit
Coherent Operation (MQCO) Program and the Quantum Computer Science (QCS) program.
Authors’ addresses: M. Ahsan, Department of Computer Science, Duke University, Durham, North Carolina,
United States of America 27708; email: [email protected]; R. Van Meter, Faculty of Environment and Infor-
mation Studies, 1-8-13 Zaimokuza Kamakura, Kanagawa-ken, Japan 248-0013; email: [email protected];
J. Kim, Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina,
United States of America 27708; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for prot or commercial advantage and that
copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned
by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request 39
permissions from [email protected].
2015 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 1550-4832/2015/12-ART39 $15.00
DOI: http://dx.doi.org/10.1145/2830570
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:2 M. Ahsan et al.
investment [Nielsen and Chuang 2000]. The threshold theorem (or the quantum fault
tolerance theorem) says that a quantum computation of arbitrary size can be performed
as long as the error probability of each operation is kept below a certain threshold value
and sufcient computational resources, such as the number of quantum bits (qubits),
can be provided to implement adequate fault tolerance [Aharonov and Ben-Or 1997].
Although this is an encouraging theoretical result, an accurate estimate of the resource
overhead remains an extremely complex task, as it depends on the details of the hard-
ware (qubit connectivity, gate speeds and coherence time, etc.), the choice of protocols
(QECC, etc.), and the nature of the target algorithms. Several application-optimized ar-
chitectures have been proposed and analyzed [Metodi et al. 2005; Van Meter et al. 2008;
Whitney et al. 2009; Kim and Kim 2009; Monroe et al. 2014; Galiautdinov et al. 2012;
Fowler et al. 2012], yet the accurate quantication of resource performance scaling for
various benchmarks remains a challenging problem.
In this work, we quantitatively dene the scalability of a quantum architecture to
mean that the resource overhead of running a quantum algorithm, while sustaining
expected behavior in execution time and success probability (of order unity, ∼ O(1)),
increases linearly with the problem size. We propose a modular ion-trap-based architec-
ture and quantify its scalability for three different benchmark circuits crucial for Shor’s
factoring algorithm [Shor 1997]: a quantum carry look-ahead adder (QCLA) [Draper
et al. 2006], the CDKM quantum ripple-carry adder (QRCA) [Cuccaro et al. 2004], and
an approximate quantum Fourier transform (AQFT) [Fowler and Hollenberg 2004].
This architecture features fast and reliable interconnects to ensure efcient access to
computational resources and enables exible distribution of computational resources
to various workload-intensive parts of the system depending on the circuit being ex-
ecuted. By evaluating this architecture for a variety of benchmarks, we show that it
can achieve highly optimized performance by exible and efcient utilization of given
resources over a range of interesting quantum circuits.
To quantify the performance of an architecture as a function of available resources,
we develop a performance simulation tool similar to those reported in Svore et al.
[2006], Whitney et al. [2007], Balensiefer et al. [2005], and Whitney et al. [2009] that
(1) maps application circuits on to the quantum hardware, (2) generates and schedules
the sequence of quantum logic gates from the algorithm operating on the qubits mapped
to the hardware, and (3) estimates performance metrics such as total execution time
and failure probability. Unlike the tools reported previously, our tool features unique
capabilities to (1) simulate performance over varying hardware device parameters,
(2) allow dynamic resource allocation in the architecture, (3) provide detailed break-
down of resource and performance variables, and (4) enable visualization of resource
utilization over (5) a range of benchmark applications. By leveraging these unique
attributes, we search the architecture space for a suitable QC design while providing
valuable insights into the factors limiting performance in a large-scale QC.
This article is organized as follows: Section 2 describes benchmark quantum circuits
and their characteristics. Section 3 describes the underlying quantum hardware tech-
nology and the modular, recongurable architecture used in our simulation. The toolset
and its main features are outlined in Section 4. Simulation results along with detailed
discussions are given in Section 5, and Section 6 describes extensibility of the tool.
Section 7 puts our work in the context of other previous quantum architecture studies,
and Section 8 summarizes the main insights gained from our study.
2. QUANTUM CIRCUITS
2.1. Universal Quantum Gates
Quantum circuits consist of a sequence of gates on qubit operands. An n-qubit quantum
gate performs a deterministic unitary transformation on n operand qubits. In the
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:3
Fig. 1. Fault-tolerant circuit for Toffoli gate used in adder benchmarks. |α L denotes a logical qubit block
representing the state |α.
from T gates. We consider a QC architecture where both Toffoli and T gates can be
executed and optimize the architecture for executing all required quantum circuits for
running Shor’s algorithm.
2.2.1. Quantum Adders. A large number of quantum adder circuits must be called to
complete the modular exponentiation that constitutes the bulk of Shor’s algorithm.
We select two candidate adders QRCA and QCLA, representing two vastly different
addition strategies, analogous to classical adders. QRCA is a linear-depth circuit, con-
taining serially connected CNOT and Toffoli gates: an n-bit addition will require about
2n qubits to perform 2n Toffoli and 5n CNOT gates [Cuccaro et al. 2004]. The sequence
of these gates is inherently local, and nearest-neighbor connectivity among the qubits
is sufcient to implement this circuit. On the other hand, QCLA is a logarithmic-
depth [∼4 log2 n] circuit connecting 4n qubits utilizing up to n concurrently executable
gates [Draper et al. 2006]. This circuit roughly contains 5n − 3 log2 n CNOT and Toffoli
gates for n-bit addition. The exponential gain in performance (execution time) comes at
the cost of sufcient availability of ancilla qubits and rapid communication channels
among distant qubits to exploit parallelism. The QC hardware model considered here
is unique in providing the global connectivity necessary for implementing QCLA. We
study the resource-performance tradeoff in selecting QCLA versus QRCA in Section 5.
2.2.2. Approximate Quantum Fourier Transform. The quantum Fourier transform (QFT)
circuit is often used as the keystone of the order-nding routine in Shor’s algo-
rithm [Nielsen and Chuang 2000]. It contains controlled-rotation gates Rz (π/2k), where
the phase of the target qubit is shifted by π/2k for the |1 state if the control qubit is
in the |1 state, for 1 ≤ k ≤ n, in a n-qubit Fourier transform. Figure 2 shows that
the controlled-rotation gates can rst be decomposed into CNOTs and single-qubit ro-
tations with twice the angle. These rotation operations are not in the Clifford group
for k > 1 and must be approximated using gates from the universal quantum gate
set [Nielsen and Chuang 2000]. A recent theoretical breakthrough provides an asymp-
totically optimal way of approximating an arbitrary quantum gate with a precision
of using only O(log(1/)) Clifford group and T gates, and a concrete algorithm for
generating the approximation circuit [Kliuchnikov et al. 2013; Giles and Selinger 2013].
It has been shown that a QFT circuit can yield the correct result with high enough
probability even if one eliminates all small-angle-rotation gates with k > 8, sufcient
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:5
Fig. 2. Fault-tolerant circuit for a controlled-rotation gate used in AQFT benchmark circuit. Small-angle-
rotation gates are approximated by a sequence of T and Clifford gates, and the T gates are performed by
magic state preparation and data teleportation.
to factor numbers as large as 4,096 bits [Fowler and Hollenberg 2004]. The resulting
truncated QFT is called the approximate QFT (AQFT). The depth of this benchmark
circuit is linear in the size of the problem n, and the total number of controlled-rotation
gates scales is 16n. Using the method outlined in Kliuchnikov et al. [2013], we approx-
imate rotations in our AQFT circuit with a sequence of 375 gates (containing 150 T
gates), with a precision of 10−16 . The resulting approximation sequence consists of T
(or T † ) gates sandwiched between one or two Clifford gates, whose execution time is
negligible compared to the T gate. The execution of the T gate proceeds in two steps:
preparation of the magic state T |+ and teleportation of data into the magic state.
Since state preparation takes a much longer time than teleportation in our system
(78ms vs. 12ms, see Table III), we can employ multiple QC units to prepare magic
states to simulate pipelined execution of T gates (Figure 2). When multiple ancilla
qubits are available for the magic state preparation, we can reduce the delay in the
execution of the approximation sequence. Using a simple calculation, we can show
that the availability of eight logical ancilla qubits completely eliminates any delay.
When the error correction procedure is inserted in the approximation sequence, its
latency can be leveraged to eliminate the preparation delay with even fewer ancilla
qubits.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:6 M. Ahsan et al.
they can be achieved in the near future through rapid technology advancement. Once
quantum hardware technology is specied, we arrange the qubit resources according to
their specic roles and their interconnection in order to assemble a large-scale QC. This
is captured by the parameters of the quantum architecture. For example, the number
of qubits dedicated to perform fault-tolerant quantum operations and the specication
of communication channels are considered architecture parameters.
3.1. Quantum Hardware Model
We choose to model quantum hardware based on trapped ions for its prominent proper-
ties that have been demonstrated experimentally. First, the qubit can be represented by
two internal states of the atomic ion (e.g., 171 Yb+ ion [Olmschenk et al. 2007]), described
as a two-level spin system, manipulated by focusing adequate laser beams at the target
ion(s). The physical ion qubits can be individually accessible for computation [Crain
et al. 2014]. These qubits can be reliably initialized to the desired computational state
and measured with very high accuracy using standard techniques. Most importantly,
by virtue of the very long coherence time of the ions, qubits can retain their state
(memory) for a period of time unparalleled by any other quantum technology. The
qubit memory error is modeled as an exponential decay in its delity F ∼ exp(−at),
where a (=1/Tcoh) is determined by the coherence time of the qubit, and t is the time
between quantum gates over which qubit sits idle (no-op). The corruption of the qubit
state is modeled using a depolarizing channel [Nielsen and Chuang 2000] (equal prob-
ability of bit ip, phase ip, and bit-and-phase ip errors). Arbitrary single qubit gates,
CNOT, and measurement can be performed with adequate reliability, making trapped
ions a suitable candidate for large-scale universal QC.
A single-qubit quantum gate is accomplished by a simple application of laser pulse(s)
on the qubit in its original location. A two-qubit gate, on the other hand, requires that
both ions are brought in proximity before the laser pulse(s) are applied. In our model,
there are two ways to achieve this proximity using two different types of physical
resources: the ballistic shuttling channel (BSC) and the entanglement link (EL) [Monroe
et al. 2014]. BSC provides a physical channel through which an ion can be physically
transported from its original location to the target location by carefully controlling
the voltages of the electrodes on the ion trap chip. This chip can be modeled as a 2D
grid of ion-trap cells, as shown in Figure 3. The dimensions of the state-of-the-art ion-
trap cell described in [Monroe and Kim 2013] fall in the ∼mm size range, and we use
Tshutt = 1μs as the time it takes for an ion to be shuttled through a single cell. In the EL
case, an entangled qubit pair (also known as the Einstein-Podolski-Rosen, or EPR, pair)
is established between designated proxy “entangling ions” (e-ions) that belong to two
independent ion trap chips using a photonic channel. This process is called heralded
entanglement generation [Duan et al. 2004], since the successful EPR pair generation
is announced by the desired output of detectors collecting ion-emitted photons. The
resulting EPR pair is used by the actual operand ions as a resource to perform the
desired gate via quantum teleportation between two ions that cannot be connected by
BSC [Gottesman and Chuang 1999]. It should be noted that the generation time for
the EPR pairs is currently a slow process due to technology limitations. This slowness
can be compensated for by generating several EPR pairs in parallel using dedicated
qubits and hardware. Table I summarizes the DPs used for all the analyses in this
article.
3.2. Quantum Architecture Model
Our model is similar to the modular universal scalable ion-trap QC (MUSIQC) archi-
tecture [Monroe et al. 2014] shown in Figure 3. It features a hierarchical construction of
larger blocks of qubits (called segments) composed of smaller units (called Tiles), which
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:7
Fig. 3. Overview of the recongurable quantum computer architecture analyzed in our performance simu-
lation tool.
are connected by an optical switch network. We use the Steane [[7,1,3]] code [Steane
1996] to encode one logical qubit using 7 physical qubits. Additional (ancilla) qubits are
supplied to perform error correction and fault-tolerant operations on the logical qubit.
We rst construct the rst layer (L1) logical qubit block containing 22 physical qubits
(7 data and 15 ancilla qubits) using Steane encoding. As the size of computation grows,
multiple layers of encoding are needed to minimize the impact of increasing noise:
with each new layer, the qubit and gate count increase by about a factor of 7. At least
two layers of encoding are essential for reliable execution of the sizable benchmark
circuits analyzed in our simulations. Therefore, we cluster seven L1 blocks to construct
the second layer (L2) logical qubit block containing dedicated qubits to simultaneously
carry out error correction operations at the L1 level after every L1 gate. We nd that
the error correction operation at the L2 level occurs much less frequently, and a dedi-
cated error-correcting ancilla resource at L2 is not necessary for each L2 logical qubit.
Therefore, we allocate fewer ancilla qubits at the L2 level for error correction and rely
on resource sharing to accomplish fault tolerance at the L2 level.
At the L2 level, we construct four different types of logical qubit blocks using L1
logical blocks, called L2 Tiles, that serve different functions in the computation [Metodi
et al. 2005]. Each Tile consists of memory cells that provide storage and manipulation
of qubits for quantum gates, and BSCs that allow rapid transportation of ions across
the memory cells to support the qubit interaction necessary for multi-qubit gates.
Tiles are specied by their tasks, such as data storage (Data Tile), state preparation
for non-Clifford gates (Ancilla Tile), error correction (EC Tile), and communication
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:8 M. Ahsan et al.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:9
Fig. 4. An example demonstration of cross-layer resource optimization in L2 Toffoli magic state preparation
circuit. In (a), Type 1 and Type 2 L1 Ancilla Tiles are vertically grouped to enact a transversal L2 Toffoli
gate, leading to the preparation of the L2 magic state: |φ+ L stored in Type 1 L1 Ancilla Tiles. In (b), Type 2
L1 Ancilla Tiles can be horizontally regrouped to perform L2 error correction on |φ+ L.
contains (1) NCS and (2) conguration of CSs specied by three numbers (NData,
NAnc, NComm). For SSs, we replace NAnc with 2×NData. Our analysis framework
will involve changing architectural parameters and studying their impact on resource-
performance tradeoffs. The performance metrics consist of execution time Texec and
failure probability P f ail = 1 − Psucc of the circuit, where Psucc is the probability that
circuit execution yields a correct result.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:10 M. Ahsan et al.
4. TOOL DESCRIPTION
4.1. Design Flow and Tool Components
The toolbox ow shown in Figure 5 has two main components: Tile Designer and
Performance Analyzer (TDPA) and Architecture Designer and Performance Analyzer
(ADPA). Both components share common critical tasks, namely, mapping, scheduling,
and error analysis, but the application of these tasks differs according to their objectives
and the constraints.
4.1.1. TDPA. TDPA works in the back end of the tool and simulates the fault-tolerant
construction of the logical qubit operations using specied DPs. It builds Tiles using the
Tile Builder by allocating sufcient qubits that can perform the operations specied
in the Steane Fault-Tolerant Circuit Generator and maps qubits in the circuit to
the physical qubits in the Tile using the Low-Level Mapper. Then, the Low-Level
Scheduler generates the sequence of quantum gate operations to be executed in the
circuit, including transversal gates, magic state preparation for non-Clifford gates, er-
ror correction, and EPR pair generation. Each logical operation is broken down into
constituent physical operations, whose performance is simulated on the Tile by adding
up the execution time of each gate subject to circuit dependencies and resource con-
straints. The Low-Level Error Analyzer computes the failure probability of the
specied fault-tolerant quantum gates based on the scheduled circuit by counting the
number of ways in which physical errors can propagate to cause a logical error in
the qubit [Aliferis et al. 2005; Ahsan et al. 2013]. The Tile parameterized by DP and
the computed performance metrics is stored in the Tile Database. Table III shows the
performance of the unied L2 Tile (which can act as Data, EC, Ancilla, and Communi-
cation Tiles) computed by the TDPA for baseline DPs given in Table I.
4.1.2. ADPA. ADPA is the front end of the tool that interfaces with the user. It takes
architecture parameters specied by the user (e.g., NCS, NEC, and NComm) as inputs
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:11
and (1) builds and connects segments using Tiles supplied by TDPA to implement the
benchmark application on hardware conguration and (2) evaluates performance of the
benchmark for given architecture parameters. First, the Quantum Circuit Genera-
tor generates the benchmark circuits from the given algorithms (QCLA, QRCA, and
AQFT). Then, the High-Level Mapper maps logical qubits to Tiles in the segments,
maximizing the locality by analyzing their connectivity patterns in the circuit, assign-
ing frequently interacting qubits to the Tiles in the same segment. This is achieved
by solving an optimal linear arrangement problem using an efcient graph-theoretic
algorithm [Juvan and Mohar 1992] to generate the initial map of the Data Tiles in
the segments. Using this map, the High-Level Scheduler generates the sequence of
gates for the circuit execution by solving the standard resource-constraint scheduling
problem in which resources and constraints are given by architecture parameters. The
Scheduler minimizes the execution time by reducing the circuit critical path through
maximum utilization of available resources (Ancilla and Communication Tiles) in the
segments. The non-Clifford gates require operands to be available in the same CS be-
fore being scheduled. Therefore, the operand located in remote Data Tiles needs to be
teleported into the local Data Tile of the CS, while Ancilla Tiles prepare the magic state
for execution. NCS determines how many non-Clifford gates can be scheduled in paral-
lel, while NComm determines how quickly Tiles can be teleported across the segments.
Therefore, the delays in gate scheduling depend mainly on architecture parameters.
The critical path of the circuit consists of these delays and the gate execution time.
The complete list of latencies arising due to insufcient resources and architecture
conguration is as follows:
—Ancilla Delay (DANC ): Delay due to the magic state preparation (fewer NCSs or fewer
Ancilla Tiles per segment)
—Shuttling Delay (DSHU T ): Delay due to the transportation of operand qubits of the
gate, through BSC inside the segment
—Tel Delay (DT EL): Delay due to the logical EPR pair generation for communication
(Fewer NComms)
—Cross-Seg-Swap Delay (DSW P ): Delay due to the cross-segment swapping (fewer
NComms or large number of smaller segments)
The Scheduler also minimizes P f ail by scheduling error correction on Data Tiles
at regular intervals when they sit idle (no-op). Once a complete schedule of logical
operations is obtained, the High-Level Error Analyzer computes the overall P f ail .
Since we cannot correct for logical failure of the operation, the Error Analyzer simply
∏
computes i=Ni=1 (1 − PLi ), where PLi is the failure probability of the ith logical gate
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:12 M. Ahsan et al.
and N is the total number of logical operations in the circuit. The Error Analyzer also
tracks the operational source of each PLi so that 1 − Psucc can be broken down into the
following important noise components:
—Shuttling Error PSHU T : Errors due to the qubit shuttling through noisy BSC
—Teleportation Noise PT EL: Errors due to the indelity of an EPR pair for communi-
cation
—Memory Noise PMEM : Errors due to the delity degradation of qubit during no-op
—Gate Noise PGAT E : Errors due to the noisy quantum gates and measurements
4.2. Splitting Performance Metrics
Our tool can output the constituents of Texec by keeping track of the different types
of latency overhead composing the critical path of the quantum circuit execution. Our
iterative scheduler selects a gate for execution during each iteration and updates the
critical path. When a gate is selected for execution, the operand qubits should be
available for computation; otherwise, it will be delayed. If we let Tstart be the time at
which the gate was executable but operands were available at Tstart
where Tstart
≥ Tstart ,
then the time at which the gate execution is complete is given by T f inish = Tstart + Texec .
The gate execution selected in scheduling iteration i will update the critical path if
T f inish > TTi−1 i−1
otalExec , where TT otalExec is the total execution time (the length of the critical
path) computed in iteration i − 1.
When the gate lies on the critical path, the Scheduler computes T = T f inish −
i−1
TT otalExec and D = Tstart
− Tstart , which can be broken down as D = DANC + DSHU T +
DT EL + DSW P . To dene the components of the critical path in iteration i − 1, we split
TTi−1 i−1 i−1 i−1 i−1 i−1
otalExec into its components as TT otalExec = T ANC + TSHU T + TT EL + TSW P + TGAT E .
i−1
In the case of the critical path, we also update the total execution time TTi otalExec =
T f inish. An example breakdown of execution time is shown in Figure 6, where magic
state preparation for Toffoli gates and cross-segment swapping delay the execution of
gates in turn and make up the bulk of the critical path. Our tool can also decompose
the failure probability P f ail into its components since the Error Analyzer tracks noise
op ∏ op
sources from the schedule. For any operation type op, we compute P f ail = 1 − i=ni=1 (1 −
op op op
PLi ), where op can be shuttling, memory, teleportation, or gate, and n and PLi are the
total operation count and failure probability for op, respectively.
4.3. Tool Validation and Performance
Individual components of the tool can easily be veried for correctness by running
these for known circuits and comparing their output with anticipated results. Overall
validation can be performed by using visualization and the breakdown of performance
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:13
Fig. 6. An example shows tracking of latency overheads composing the critical path of the circuit.
metrics for different types of benchmarks. Tool efciency mainly arises from taking
advantage of the repetitive nature of fault-tolerant procedures and circuit breakdown
of universal quantum gates. Performance of low-level circuit blocks is precomputed
and stored in the database, and used for simulating the behavior of high-level circuits.
For instance, TDPA can be run ofine to generate parameterized Tiles, which are
used to efciently run components of ADPA such as the High-Level Scheduler, Error
Analyzer, and Visualizer. Similarly, the initial mapping of L2 qubits on these Tiles is
generated from a computationally intensive optimization algorithm [Juvan and Mohar
1992], but once generated, can be efciently processed by the High-Level Scheduler to
generate subsequent gate schedules. Thus, we can run the High-Level Mapper ofine
as well. Consequently, the running time of the tool is decided by that of the High-
Level Scheduler and Error Analyzer, which mainly depends on architecture resources
and benchmark size. The results discussed in Section 5 show that the performance
improvement saturates once resource investment exceeds a certain value, and the
maximum size of the overall system we have to simulate is mainly dictated by the size
of the application circuit.
Figure 7 shows the running time of the tool as a function of circuit size. The data
is collected by running the tool on a computer system containing an Intel(R) Core(TM)
i3 2.4GHz processor and 2GB RAM. To incorporate the dependency of tool running
time on the available architecture resources, we choose the conguration containing
maximum resources in order to obtain the typical worst-case running time of the tool.
In this conguration, we allocate maximum Ancilla and Communication Tiles per Data
Tile and allow all Segments to act as Computational Segments. Under these conditions,
Figure 7 shows that the performance simulation of a 2,048-bit circuit can be completed
in less than 1.5 minutes. Thanks to the efciency in the performance simulation, our
tool can explore a large QC design space in a reasonable amount of time.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:14 M. Ahsan et al.
Fig. 7. The running time of ADPA (excluding Visualizer) as a function of benchmark application size. With
full Visualization, the running times increase no more than seven times.
5. SIMULATION RESULTS
We rst analyze the relationship between resources (qubits) and performance as a func-
tion of benchmark size for scalability. We consider the system architecture resource-
performance (RP) scalable if the increase in resources necessary to achieve the expected
behavior of the performance (execution time) grows linearly with the size of the bench-
mark while maintaining Psucc ∼ O(1). In the absence of hardware resource constraints,
the expected execution time for the QRCA and AQFT grows linearly, while that for
QCLA grows logarithmically, as the problem size grows. The execution time could grow
much more quickly in the presence of resource constraints, in which case the system is
not considered RP scalable.
In the rst step, we present a set of simulations to quantify the RP scalability of the
proposed MUSIQC architecture for benchmark circuits and analyze the constituents
of performance metrics. In the next step, we study the impact of limited resources
and architecture parameters on the performance of xed-size benchmarks. This will
provide guidelines to nd an optimized design under limited resources. In the last set,
optimum designs are obtained under resource constraints, for effectively executing the
benchmark circuits.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:15
Fig. 8. QCLA execution time (a) scaling under no constraints on physical resources; Texec and total physical
qubits (NTQ) consumed are plotted as a function of benchmark size, NCS = NSeg. (b) Variation with NCS
for different NComms, showing tradeoffs between resources and Texec .
Fig. 9. QRCA execution time (a) scaling under no constraints on physical resources; Texec and total physical
qubits (NTQ) consumed are plotted as a function of benchmark size, NCS = NSeg. (b) Variation with NCS
for different NComms, showing tradeoffs between resources and Texec .
Fig. 10. AQFT execution time (a) scaling under no constraints on physical resources; Texec and total physical
qubits (NTQ) consumed are plotted as a function of benchmark size, NCS = NSeg. (b) Variation with NCS
for different NComms, showing tradeoffs between resources and Texec .
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:16 M. Ahsan et al.
in Table V. The signicant contribution of the overhead T ANC for adder conguration
(12,4,3) and AQFT conguration (1,4,1) shows that magic state preparation is the dom-
inant component of Texec due to insufcient Ancilla Tiles in CS. This overhead can be
substantially reduced either by increasing NAnc (conguration (1,8,1) for AQFT) or
by increasing the ratio of NData to NAnc (conguration (3,4,1) for adders). However,
the conguration (3,4,1) exposes EPR pair generation overhead captured by TT EL and
TSW P . This is due to the large number of cross-segment CNOT gates and qubit swap-
ping operations required to bring all Toffoli operands to the same segment. Hence,
for adders, frequent cross-segment communication explains the higher contribution of
teleportation error (PT EL) in the failure probability. For AQFT, conguration (1,8,1)
highlights shuttling overhead TSHU T as T gates make up the bulk of the operations.
The scheduling of T gates in the long approximation sequence leads to a large number
of interactions between Data and Ancilla Tiles through BSC in the segment. This inten-
sive localized communication makes (PSHU T ) the only noticeable component of P f ail .
In conclusion, we have shown RP scalability of the architecture when performance is
bottlenecked by different hardware constraints for various benchmarks. This shows
that the architecture can utilize additional resources efciently to achieve adequate
performance when running quantum circuits of larger sizes.
5.2. Resource-Performance Tradeoffs
Now that we have quantied RP scalability, we examine the impact of reduced re-
sources on the performance by constraining architecture parameters. We x the size
of the benchmarks to 1,024 bits and vary NComm, NCS, and the conguration to ob-
serve changes in Texec . Figure 10(b) shows that for AQFT, (1) Texec does not change
with NComm since it is not restricted by cross-segment communication resources, and
(2) Texec initially decreases sharply as NCS increases but attens when NCS reaches
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:17
Fig. 11. (LEFT) Sample QCLA circuit, and (RIGHT) visual representation of latency overhead for Com-
putational Segments of 1,024-bit QCLA architecture with (a) conguration (3,4,4), NCS = NSeg = 1,362,
(b) conguration (3,4,1), NCS = NSeg = 1,362, and (c) conguration (3,4,1), NCS = 341.
∼16, mainly due to insufcient parallelism in the circuit. The QRCA curves in Fig-
ure 9(b) remain unchanged as NCS increases until NCS approaches NSeg where Texec
shows noticeable decline. However, the overall decrease remains within a factor of only
about 3, due to the serial nature of the circuit dependencies. By comparing curves for
different congurations ((12,4,1) vs. (3,4,1) for QRCA and (1,8,x) vs. (4,8,x) for AQFT),
we nd that clustering more Data Tiles in the CS generally reduces Texec due to fewer
delays in cross-segment operand swapping.
Figure 8(b) shows that Texec of QCLA decreases exponentially with NCS. In con-
trast to QRCA, the large number of concurrently executable Toffoli gates in QCLA
demands much higher NCS. Furthermore, Texec also decreases with higher NComm
as the large number of cross-segment teleportations consume more communication
resources. Figures 11(a) through 11(c) provide a pictorial description of these trends
generated through our visualization tool. The visualization highlights different types of
latencies arising from the resource constraints in scheduling within each CS. The visu-
alization tool draws a line, in execution time (horizontal axis) – qubit location (vertical
axis) plane, between each point where a logical qubit in the circuit requires additional
resources to proceed to the next step (such as a magic state for non-Clifford group
gates prepared in an Ancilla Tile, or EPR pairs for teleportation in Communication
Tiles), and a point where the required resource becomes available. As a consequence,
the horizontal lines represent delay in magic state preparation, while nonhorizontal
lines indicate delays in cross-segment teleportation. The corresponding latencies can
be derived by projecting these lines on the horizontal axis. When sufcient NCS and
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:18 M. Ahsan et al.
NComm resources are provided, as in Figure 11(a), there is little delay and the circuit
execution is fast. However, when NComm is reduced from 4 to 1, as in Figure 11(b),
teleportation latency causes a 4.5-fold increase in Texec . The same amount of increase
occurs in Figure 11(c) when NCS is reduced from 1,362 to 341. Long horizontal lines
indicate delays due to fewer Ancilla Tiles available for Toffoli magic state preparation,
which increases the overall Texec by another factor of about 6.
By comparing Figures 8, 9, and 10, it is easy to conclude that QCLA is the most
resource-hungry benchmark, while AQFT is the least. We also note that P f ail values
do not tend to improve substantially when we provide more architecture resources.
It can be shown that a substantial decrease in P f ail can be achieved by improving
the relevant device parameters (DPs) that contribute to the dominant noise sources
[Ahsan and Kim 2015]. These sources can correctly be identied once we have invested
sufcient resources for scheduling error correction and chosen optimized architecture
conguration that minimizes Texec . In the following subsection, we concentrate on Texec
only and return to optimizing P f ail in the last subsection.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:19
Fig. 12. Texec for optimized architectures plotted against benchmark size for different segment sizes. The
resource budget is 1.5 million physical qubits. Optimized architecture congurations are shown in Table VI.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:20 M. Ahsan et al.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:21
Fig. 13. The breakdown of and reduction in the failure probability of 2,048-bit QCLA for the conguration
[30,8,5], NCS = 273, with Tshutt = 1μs and the baseline physical measurement and multi-qubit gate times
reduced by 90%. The original failure probability shown in (a) is reduced from 2.77 × 10−7 to 2.37 × 10−9 in
(b) only by decreasing the indelity of the EPR pair for cross-segment communication.
a certain threshold value so that overall failure probability is reduced far enough to
meet the design criterion. The 2,048-bit integer factorization consumes 16 million calls
to the adder, and therefore we require P f ail << 6.67 × 10−8 for each adder execution.
The P f ail for the design optimized for Texec ((30,8,5), NCS = 273) is 2.77 × 10−7 .
In order to lower the failure probability, we can either add one more layer of encoding
or reduce the noise level in the physical device components. Adding a layer of encoding
will require at least a 7x increase in qubit resources, which enormously expands the
scale of integration. In addition, the Texec is also signicantly inated (Table III shows
that L2 error correction takes 70x more time than L1 error correction). Therefore,
increasing the number of layers of encoding achieves a reduction in P f ail at the expense
of far greater Texec and resources. On the other hand, reducing the noise level in physical
device components can decrease P f ail without compromising Texec . We exploit the tool-
supplied breakdown of P f ail based on the delity of component physical operations to
systematically improve the success probability of the factorizing task.
Figure 13 shows the breakdown of failure probability for 2,048-bit QCLA, where
over 99% of the failure originates from Teleportation Noise. The device parameter that
directly affects the Teleportation Noise is the indelity of EPR pairs for cross-segment
communication. By reducing the indelity from 10−4 to 10−5 , we gain more than a 100x
reduction in the failure probability as shown in Figure 13(b). A further decrease in
P f ail can be obtained by tuning DPs affecting the Gate and Memory Noise.
In conclusion, we showed that we can lower the failure probability of a 2,048-bit
QCLA circuit to 2.37 × 10−9 (<< 6.67 × 10−8 ) by tuning the DPs. This gives an overall
failure probability of modular exponentiation of about 3.8%. The P f ail of the 4,096-bit
AQFT is negligible compared to this value, and the overall failure probability does
not exceed 4%. Hence, we have shown that the optimized adder architecture with the
appropriately tuned DPs can be used to construct reliable QC to execute the 2,048-bit
Shor algorithm.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:22 M. Ahsan et al.
different class of QECCs known as Topological Quantum Code (TQC) [Kitaev 2003]
can be incorporated in the general Tile framework used in our tool (Section 3.2), which
is adequately equipped to handle long-distance and nearest-neighbor qubit interac-
tions. The 2D-grid-based layout of our Tile acts as a substrate for the fault-tolerant
TQC operations. The “hole”-induced creation of the logical qubit(s) in the substrate is
accomplished by the Low-Level Scheduler, which sequences the deactivation (or acti-
vation) of appropriate stabilizer measurements. The same scheduler can also generate
a single-qubit logical gate, which constitutes operations local to the TQC Tile. For a
two-qubit logical gate such as CNOT, the “braiding” procedure can be simulated by
the High-Level Scheduler, which controls the movement of logical qubit data (holes
in case of TQC) across the Tiles. Finally, the error decoding and recovery procedure
can be implemented by simply integrating the well-known minimum weight matching
algorithm [Edmonds 1965] with the existing Error Analyzer of the tool to calculate
overall logical failure probability.
6.2. Different Quantum Algorithm Circuits
Our benchmark circuits compose crucial components of not only the Shor algorithm but
also a wide range of other important applications such as quantum chemistry, solving
system of linear equations, and triangle nding algorithm [Jordan 2011]. The oracle
part in a general quantum algorithm can be separately compiled using off-the-shelf
quantum compilers (e.g., JavadiAbhari et al. [2014]), which offer optimized reversible
circuit synthesis. The resulting circuit can be merged with the quantum part of the
algorithm by using the circuit integration feature of the Quantum Application Circuit
Generator in ADPA. This way, a general quantum algorithm can be simulated for the
investigation of resource-performance tradeoffs.
6.3. Alternative Quantum Device Technologies and Architectures
The Tile construct used to abstract trapped-ion hardware is powerful enough to ade-
quately model QC architecture based on other hardware technologies such as supercon-
ductors, quantum dots, and neutral atoms [Ladd et al. 2010]. The technology-specic
storage and manipulation of qubits are described by a new set of DPs. The hardware
constraints characterizing physical layout can be captured by the weighted directed (or
undirected) graph, wherein standard graph traversal algorithms can be leveraged to
model the movement of quantum information required for the physical gates between
nonlocal qubits. The convenient implementation of these algorithms in the current tool
software will enable TDPA to produce Tiles that can schedule fault-tolerant operation
for various device technologies. Once the performance of technology-specic Tiles is
parameterized using new DPs, ADPA species the mechanism of which application-
level qubit operands contained in different Tiles communicate to allow nonlocal logical
gates. As such, the Tiles are connected by the communication channels compatible
with given hardware technology. For example, ballistic shuttling in trapped-ion, suc-
cessive swapping in superconductors and optical interconnection in photons can be
modeled either as a Tile Port, which interfaces one Tile to another, or as a dedicated
Communication Tile, presented in Section 3.2. Once the communication component is
appropriately modeled by ADPA, application-level QC architecture can be dened for
the chosen quantum physical hardware.
7. COMPARISON WITH THE RELATED ARCHITECTURE WORK
A generation of quantum architectures for large monolithic ion traps had been analyzed
using area as a metric for resource utilization [Metodi et al. 2005; Thaker et al. 2006;
Whitney et al. 2009]. Unfortunately, the sizable trap chip envisioned in these studies
is difcult to fabricate due to limitations of fabrication technology [Guise et al. 2014].
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:23
8. CONCLUSION
We presented a complete performance simulation toolset capable of designing a
resource-efcient, scalable QC. Our tool is capable of analyzing the performance metrics
of a exible, recongurable computer model and deepens our insights into the quantum
architecture design by providing a comprehensive breakdown of performance metrics
and visualization of resource utilization. Using this tool, we were able to quantify, for
the rst time, the resource-performance scalability of a proposed architecture, featuring
unique properties such as (1) cross-layer optimization, where qubit resources provid-
ing L2-level functions are shared throughout the computer; (2) resource-constrained
hardware performance, where optimized architectural design for resource allocation is
considered as a function of the problem size; and (3) complete visualization of the re-
source utilization that provides a means to validate the optimality of the performance,
(4) over a hardware architecture that provides global connectivity among all the qubits
in the system.
Due to the macromodeling approach used in our tool, we achieve highly efcient
runtime for the performance simulation, which allows us to carry out a comprehensive
search for an optimized system design under given resource constraints, over a range
of architecture congurations and benchmark circuits. Our benchmarks included cru-
cial building blocks of Shor’s algorithm, including the approximate quantum Fourier
transform and two types of quantum adders. We found that the optimized designs vary
across the benchmark applications depending on the types of gates used, the depth and
parallelism of circuit structure, and resource budget. By comparing their performance
across these benchmark circuits, we present a concrete quantum computer design ca-
pable of executing the 2,048-bit Shor algorithm in less than 5 months. A free copy of
our toolset can be obtained by contacting the rst author and is currently available at
http://www.cs.duke.edu/∼ahsan/SQrIpT.
REFERENCES
D. Aharonov and M. Ben-Or. 1997. Fault-tolerant quantum computation with constant error fault-tolerant
quantum computation with constant error fault-tolerant quantum computation with constant error. In
Proceedings of the 29th Annual Symposium on Theory of Computing, 176–188.
M. Ahsan, B.-S. Choi, and J. Kim. 2013. Performance simulator based on hardware resources constraints
for ion trap quantum computer. In IEEE 31st International Conference on Computer Design (ICCD’13),
411–418.
M. Ahsan and J. Kim. 2015. Optimization of quantum computer architecture using a resource-performance
simulator. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition
(DATE’15). EDA Consortium, San Jose, CA, 1108–1113.
P. Aliferis, D. Gottesman, and J. Preskill. 2005. Quantum accuracy threshold for concatenated distance-3
codes. arXiv:quant-ph/0504218 (2005).
S. Balensiefer, L. Kreger-Stickles, and M. Oskin. 2005. QUALE: Quantum architecture layout evaluator. In
Proceedings of the SPIE, Vol. 5815, 103–114. DOI:http://dx.doi.org/10.1117/12.604073
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
39:24 M. Ahsan et al.
D. Beckman, A. N. Chari, S. Devabhaktuni, and J. Preskill. 1996. Efcient networks for quantum factoring.
Phys. Rev. A 54 (1996), 1034–1063.
S. Crain, E. Mount, S.-Y. Baek, and J. Kim. 2014. Individual addressing of trapped 171 Y b+ ion qubits using
a microelectromechanical systems-based beam steering system. Appl. Phys. Lett. 105 (2014), 181115.
S. A. Cuccaro, T. G. Draper, S. A. Kutin, and D. P. Moulton. 2004. A new quantum ripple-carry addition
circuit. arXiv preprint quant-ph/0410184 (2004).
T. G. Draper, S. A. Kutin, E. M. Rains, and K. M. Svore. 2006. A logarithmic-depth quantum carry-lookahead
adder. Quantum Inf. Comput. 6 (2006), 351–369.
L.-M. Duan, B. B. Blinov, D. L. Moehring, and C. Monroe. 2004. Scaling trapped ions for quantum computation
with probabilistic ion-photon mapping. Quant. Inf. Comp. 4 (2004), 165–173.
J. Edmonds. 1965. Paths, trees, and owers. Can. J. Math. 17, 3 (1965), 449–467.
A. G. Fowler and L. C. L. Hollenberg. 2004. Scalability of shor’s algorithm with a limited set of rotation gates.
Phys. Rev. A 70, 3 (2004), 032329.
A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland. 2012. Surface codes: Towards practical
large-scale quantum computation. Phys. Rev. A 86, 3 (2012), 032324. DOI:http://dx.doi.org/10.1103/
PhysRevA.86.032324.
A. Galiautdinov, A. N. Korotkov, and J. M. Martinis. 2012. Resonator-zero-qubit architecture for supercon-
ducting qubits. Phys. Rev. A 85 (2012), 042321.
B. Giles and P. Selinger. 2013. Exact synthesis of multiqubit Clifford+T circuits. Phys. Rev. A 87 (2013),
032332.
D. Gottesman and I. L. Chuang. 1999. Demonstrating the viability of universal quantum computation using
teleportation and single-qubit operations. Nature 402 (October 1999), 390–393.
N. D. Guise, S. D. Fallek, K. E. Stevens, K. R. Brown, C. Volin, A. W. Harter, J. M. Amini, R. E. Higashi, S. T.
Lu, H. M. Chanhvongsak, et al. 2014. Ball-grid array architecture for microfabricated ion traps. arXiv
preprint arXiv:1412.5576 (2014).
P. Papadopoulos J. Eisert, K. Jacobs, and M. B. Plenio. 2000. Optimal local implementation of non-local
quantum gates. Phys. Rev. A 62 (2000), 052317.
A. JavadiAbhari, S. Patil, D. Kudrow, J. Heckey, A. Lvov, F. T. Chong, and M. Martonosi. 2014. ScaffCC: A
framework for compilation and analysis of quantum computing programs. In Proceedings of the 11th
ACM Conference on Computing Frontiers. ACM, 1.
L. Jiang, J. M. Taylor, K. Nemoto, W. J. Munro, R. Van Meter, and M. D. Lukin. 2009. Quantum repeater with
encoding. Phys. Rev. A 79, 3 (Mar 2009), 032325. DOI:http://dx.doi.org/10.1103/PhysRevA.79.032325
S. Jordan. 2011. Quantum algorithm zoo. Retrieved June 27, 2013 from http://math.nist.gov/quantum/zoo/.
M. Juvan and B. Mohar. 1992. Optimal linear labelings and eigenvalues of graphs. Discrete Appl. Math. 36
(1992), 153–168.
J. Kim and C. Kim. 2009. Integrated optical approach to trapped ion quantum computation. Quantum Inf.
Comput. 9 (2009), 181–202.
J. Kim, C. J. Nuzman, B. Kumar, D. F. Lieuwen, J. S. Kraus, A. Weiss, C. P. Lichtenwalner, A. R. Papazian, R.
E. Frahm, N. R. Basavanhally, D. A. Ramsey, V. A. Aksyuk, F. Pardo, M. E. Simon, V. Lifton, H. B. Chan,
M. Haueis, A. Gasparyan, H. R. Shea, S. Arney, C. A. Bolle, P. R. Kolodner, R. Ryf, D. T. Neilson, and
J. V. Gates. 2003. 1100 X 1100 port MEMS-based optical crossconnect with 4-dB maximum loss. IEEE
Photon. Technol. Lett. 15 (2003), 1537–1539.
A. Y. Kitaev. 2003. Fault-tolerant quantum computation by anyons. Ann. Phys. 303, 1 (2003), 2–30.
T. Kleinjung, K. Aoki, J. Franke, A. K. Lenstra, E. Thomé, J. W. Bos, P. Gaudry, A. Kruppa, P. L. Montgomery,
D. A. Osvik, et al. 2010. Factorization of a 768-bit RSA modulus. In Advances in Cryptology (CRYPTO’10).
Springer, 333–350.
V. Kliuchnikov, D. Maslov, and M. Mosca. 2013. Asymptotically optimal approximation of single qubit uni-
taries by clifford and T circuits using a constant number of ancillary qubits. Phys. Rev. Lett. 110 (2013),
190502.
T. D. Ladd, F. Jelezko, R. Laamme, Y. Nakamura, C. Monroe, and J. L. OBrien. 2010. Quantum computers.
Nature 464, 7285 (2010), 45–53.
T. S. Metodi, D. D. Thaker, A. W. Cross, F. T. Chong, and I. L. Chuang. 2005. A quantum logic array mi-
croarchitecture: Scalable quantum data movement and computation. In Proceedings of the 38th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO’38), 12–23.
C. Monroe and J. Kim. 2013. Scaling the ion trap quantum processor. Science 339 (2013), 1164.
C. Monroe, R. Raussendorf, A. Ruthven, K. R. Brown, P. Maunz, L.-M. Duan, and J. Kim. 2014. Large scale
modular quantum computer architecture with atomic memory and photonic interconnects. Phys. Rev. A
89 (2014), 022317.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.
Designing a Million-Qubit Quantum Computer Using a Resource Performance Simulator 39:25
M. A. Nielsen and I. L. Chuang. 2000. Quantum Computation and Quantum Information. Cambridge
University Press.
S. Olmschenk, K. C. Younge, D. L. Moehring, D. N. Matsukevich, P. Maunz, and C. Monroe. 2007. Manipula-
tion and detection of a trapped Yb+ hyperne qubit. Phys. Rev. A 76 (2007), 052314.
P. W. Shor. 1997. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum
computer. SIAM J. Comput. 26, 5 (1997), 1484–1509.
A. M. Steane. 1996. Error correcting codes in quantum theory. Phys. Rev. Lett. 77 (1996), 793–797.
K. M. Svore, A. V. Aho, A. W. Cross, I. Chuang, and I. L. Markov. 2006. A layered software architecture for
quantum computing design tools. Computer 39 (2006), 74–83. DOI:http://dx.doi.org/10.1109/MC.2006.4
D. D. Thaker, T. S. Metodi, A. W. Cross, I. L. Chuang, and F. T. Chong. 2006. Quantum memory hierarchies:
Efcient designs to match available parallelism in quantum computing. In ACM SIGARCH Computer
Architecture News, Vol. 34. IEEE Computer Society, 378–390.
R. Van Meter and C. Horsman. 2013. A blueprint for building a quantum computer. Commun. ACM 56, 10
(2013), 84–93.
R. Van Meter and K. M. Itoh. 2005. Fast quantum modular exponentiation. Phys. Rev. A 71, 5 (May 2005),
052320.
R. Van Meter, W. J. Munro, K. Nemoto, and K. M. Itoh. 2008. Arithmetic on a distributed-memory quantum
multicomputer. J. Emerg. Technol. Comput. Syst. 3, Article 2 (2008), 23 pages.
V. Vedral, A. Barenco, and A. Ekert. 1996. Quantum networks for elementary arithmetic operations. Phys.
Rev. A 54 (1996), 147–153.
M. Whitney, N. Isailovic, Y. Patel, and J. Kubiatowicz. 2007. Automated generation of layout and control for
quantum circuits. In Proceedings of the 4th International Conference on Computing Frontiers, 83–94.
M. G. Whitney, N. Isailovic, Y. Patel, and J. Kubiatowicz. 2009. A fault tolerant, area efcient architecture
for Shor’s factoring algorithm. ACM SIGARCH Comput. Arch. News 37, 3 (2009), 383–394.
B. Zeng, A. Cross, and I. L. Chuang. 2011. Transversality versus universality for additive quantum codes.
IEEE Trans. Inf. Theory 57, 9 (2011), 6272–6284.
X. Zhou, D. W. Leung, and I. L. Chuang. 2000. Methodology for quantum logic gate construction. Phys. Rev.
A 62 (2000), 052316.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 12, No. 4, Article 39, Publication date: December 2015.