VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22
An Efficient Implementation of Advanced Encryption
Standard on the Coarse-grained Reconfigurable Architecture
Hung K. Nguyen*, Xuan-Tu Tran
SIS Laboratory, VNU University of Engineering and Technology,
144 Xuan Thuy road, Cau Giay district, Hanoi, Vietnam
Abstract
The Advanced Encryption Standard (AES) is currently considered as one of the best symmetric-key block
ciphers. The hardware implementation of the AES for hand-held mobile devices or wireless sensor network
nodes is always required to meet the strict constraints in terms of performance, power and cost. Coarse-grained
reconfigurable architectures are recently proposed as the solution that provides high flexibility, high performance
and low power consumption for the next-generation embedded systems. This paper presents a flexible, high-
performance implementation of the AES algorithm on a coarse-grained reconfigurable architecture, called
MUSRA (Multimedia Specific Reconfigurable Architecture). First, we propose a hardware-software partitioning
method for mapping the AES algorithm onto the MUSRA. Second, the parallel and pipelining techniques are
considered thoughtfully to increase total computing throughput by efficiently utilizing the computing resources
of the MUSRA. Some optimizations at both loop transformation level and scheduling level are performed in
order to make better use of instruction-, loop- and task- level parallelism. The proposed implementation has been
evaluated by the cycle-accurate simulator of the MUSRA. Experimental results show that the MUSRA can be
reconfigured to support both encryption and decryption with all key lengths specified in the AES standard. The
performance of the AES algorithm on the MUSRA is better than that of the ADRES reconfigurable processor,
Xilinx Virtex-II, and the TI C64+ DSP.
Received 24 November 2015, revised 06 January 2015, accepted 13 January 2016
Keywords: Coarse-grained Reconfigurable Architecture (CGRA), Advanced Encryption Standard (AES), Reconfigurable
Computing, Parallel Processing.
1.
Introduction*
network. The Advanced Encryption Standard
(AES), which has been standardized by the
The fast development of the communication
technology enables the information to be easily
shared globally via the internet, especially with
the Internet of Things (IoT). However, it also
raises the requirement about the secure of the
information, especially the sensitive data such
as password, bank account, personal
information, etc. One method to protect the
sensitive data is using symmetric-key block
cipher before and after sending it over the
________
National Institute of Standard and Technology
(NIST) [1], is currently considered as one of the
best symmetric-key block ciphers. With the
block size of 128 bits and the variable key
length of 128 bits, 192 bits or 256 bits, the AES
has been proved to be a robust cryptographic
algorithm against illegal access.
The hardware implementation of the AES
for modern embedded systems such as hand-
held mobile devices or wireless sensor network
(WSN) nodes always gives designers some
challenges such as reducing chip area and
* Corresponding author. E-mail.: kiemhung@vnu.edu.vn
10
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22
11
power
consumption,
increasing
application
hardware tasks to be mapped onto the same
performance,
shortening
time-to-market,
and
hardware platform, thus reducing the area and
simplifying the updating process. Besides, these
power consumption of the design [8].
systems
are
often
designed
not
only
for
a
specific application but also for multiple
applications. Such sharing of resources by
several applications makes the system cheaper
P
Data Memory
Instruction
Memory
and
more
versatile.
Application
Specific
Integrated
Circuits
(ASICs),
Digital
Signal
Processors (DSPs), and Application-Specific
Instruction Set Processors (ASIPs), have been
AMBA AHB
used for implementing the mobile multimedia
systems. However, none of them meets all of
the
above
challenges
[2].
The
software
AHB/CGRA Interface
implementation of the AES algorithm by using
processors (e.g. [3]) are usually very flexible
and usually targets at the applications at where
DPLL
CGRA
IP cores
flexibility
has
a
higher
priority
than
the
implementation efficiency in terms
of
power
Figure 1. System-level application model of CGRA.
consumption, area, and performance. In contrast,
the ASIC implementation of the AES algorithm
(e.g. [4]) usually offers the optimized
performance and power consumption. However,
the drawback of ASIC is lower flexibility.
Moreover, the high price for designing and
manufacturing the chip masks is becoming
increasingly an important factor that limits the
application scope of ASIC. Recently, a very
promising solution is the reconfigurable
computing systems (e.g. Zynq-7000 [5],
ADRES [6], etc.) that are integrated many
heterogeneous processing resources such as
software programmable microprocessors (P),
hardwired IP (Intellectual Property) cores,
reconfigurable hardware architectures, etc. as
shown in Figure 1. To program such a system, a
target application is first represented
intermediately as a series of tasks that depends
on each other by a Control and Data Flow
Graph (CDFG) [7], and then partitioned and
mapped onto the heterogeneous computational
and routing resources of the system. Especially,
computation-intensive kernel functions of the
application are mapped onto the reconfigurable
hardware so that they can achieve high
performance approximately equivalent to that
of ASIC while maintaining a degree of
flexibility close to that of DSP processors. By
dynamically reconfiguring hardware,
reconfigurable computing systems allow many
The reconfigurable hardware is generally
classified into the Field Programmable Gate
Array (FPGA) and coarse-grained dynamically
reconfigurable architecture (CGRA). A typical
example of the FPGA-based reconfigurable
SoC is Xilinx Zynq-7000 devices [5]. Generally,
FPGAs support the fine-grained reconfigurable
fabric that can operate and be configured at bit-
level. FPGAs are extremely flexible due to their
higher reconfigurable capability. However, the
FPGAs consume more power and have more
delay and area overhead due to greater quantity
of routing required per configuration [9]. This
limits the capability to apply FPGA to mobile
devices. To overcome the limitation of the
FPGA-like fine-grained reconfigurable devices,
we developed and modeled a coarse-grained
dynamically reconfigurable architecture, called
MUSRA (Multimedia Specific Reconfigurable
Architecture) [10]. The MUSRA is a high-
performance, flexible platform for a domain of
applications in multimedia processing. In
contrast with FPGAs, the MUSRA aims at
reconfiguring and manipulating on the data at
word-level. The MUSRA was proposed to
exploit high data-level parallelism (DLP),
instruction-level parallelism (ILP) and TLP
(Task Level Parallelism) of the computation-
intensive loops of an application. The MUSRA
also supports the capability of dynamic
12
H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22
reconfiguration
by
enabling
the
hardware
The
MUSRA
is
composed
of
a
fabrics to be reconfigured into different
functions even if the system is working.
In this paper, we proposed a solution for
implementing the AES algorithm on the
platform of the MUSRA-based system. The
AES algorithm is firstly analyzed and optimized,
and then HW/SW (Hardware/Software)
partitioned and scheduled to be executed on the
MUSRA-based system. The experimental
results show that our proposal achieves the
throughput of 29.71 instructions per cycle in
average. Our implementation has been
compared to the similar works on ADRES
reconfigurable processor [6], Xilinx Virtex-II
[11], and TI C64+ DSP [3]. Our
implementation is about 6.9 times, 2.2 times,
and 1.6 times better than that of TI C64+ DSP,
Xilinx Virtex-II, and ADRES, respectively.
The rest of the paper is organized as follows.
The MUSRA architecture and the AES
algorithm are presented in Section 2 and
Section 3, respectively. Section 4 presents the
mapping the AES algorithm onto the MUSRA-
based system. In Section 5, simulation results
and the evaluation of the AES algorithm on the
MUSRA-based system in terms of flexibility
and performance are reported and discussed.
Finally, conclusions are given in Section 6.
Reconfigurable Computing Array (RCAs),
Input/Output FIFOs, Global Register File
(GRF), Data/Context memory subsystems, and
DMA (Direct Memory Access) controllers, etc.
(Figure 2). Data/Context memory subsystems
consist of storage blocks and DMA controllers
(i.e. CDMAC and DDMAC). The RCA is an
array of 88 RCs (Reconfigurable Cells) that
can be configured partially to implement
computation-intensive tasks. The input and
output FIFOs are the I/O buffers between the
data memory and the RCA. Each RC can get
the input data from the input FIFO or/and GRF,
and store the results back to the output FIFO.
These FIFOs are all 512-bit in width and 8-row
in depth, and can load/store sixty-four bytes or
thirty-two 16-bit words per cycle. Especially,
the input FIFO can broadcast data to every RC
that has been configured to receive the data
from the input FIFO. This mechanism aims at
exploiting the reusable data between several
iterations. The interconnection between two
neighboring rows of RCs is implemented by a
crossbar switch. Through the crossbar switch,
an RC can get results that come from an
arbitrary RC in the above row of it. The Parser
decodes the configuration information that has
been read from the Context Memory, and then
2.
MUSRAArchitecture
generates the control signals that ensure the
execution of RCA accurately and automatically.
2.1. Architecture Overview
RC (Figure 3) is the basic processing unit of
RCA. Each RC includes a data-path that can
execute
signed/unsigned
fixed-point
8/16-bit
AHB/CGRA Interface
operations with two/three source operands, such
Input DMA
DDMAC
as arithmetic and logical operations, multiplier,
and multimedia application-specific operations
IN_FIFO
GRF
(e.g.
barrel
shift,
shift
and
round,
absolute
Crossbar Switch
differences, etc.). Each RC also includes a local
CDMAC
RC
00
RC
01
RC
07
register called LOR. This register can be used
Context
Memory
Context
Parser
RC
10
Crossbar Switch
RC
11
RC
17
Data
Memory
either to adjust operating cycles of the pipeline
or to store coefficients when a loop is mapped
RC
70
Crossbar Switch
RC
71
RC
77
onto the RCA. A set of configuration registers,
which stores configuration information for the
RCA
IN_FIFO
RC, is called a layer. Each RC contains two
Output DMA
layers that can operate in the ping-pong fashion
to reduce the configuration time.
Figure 2. MUSRA architecture.
HƯỚNG DẪN DOWNLOAD TÀI LIỆU

Bước 1:Tại trang tài liệu slideshare.vn bạn muốn tải, click vào nút Download màu xanh lá cây ở phía trên.
Bước 2: Tại liên kết tải về, bạn chọn liên kết để tải File về máy tính. Tại đây sẽ có lựa chọn tải File được lưu trên slideshare.vn
Bước 3: Một thông báo xuất hiện ở phía cuối trình duyệt, hỏi bạn muốn lưu . - Nếu click vào Save, file sẽ được lưu về máy (Quá trình tải file nhanh hay chậm phụ thuộc vào đường truyền internet, dung lượng file bạn muốn tải)
Có nhiều phần mềm hỗ trợ việc download file về máy tính với tốc độ tải file nhanh như: Internet Download Manager (IDM), Free Download Manager, ... Tùy vào sở thích của từng người mà người dùng chọn lựa phần mềm hỗ trợ download cho máy tính của mình  

An Efficient Implementation of Advanced Encryption Standard on the Coarse-grained Reconfigurable Architecture

Đăng ngày | Thể loại: | Lần tải: 0 | Lần xem: 12 | Page: 13 | FileSize: 0.00 M | File type: PDF
12 lần xem

An Efficient Implementation of Advanced Encryption Standard on the Coarse-grained Reconfigurable Architecture. This paper presents a flexible, highperformance implementation of the AES algorithm on a coarse-grained reconfigurable architecture, called MUSRA (Multimedia Specific Reconfigurable Architecture). First, we propose a hardware-software partitioning method for mapping the AES algorithm onto the MUSRA. Second, the parallel and pipelining techniques are considered thoughtfully to increase total computing throughput by efficiently utilizing the computing resources of the MUSRA..

Nội dung

VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 An Efficient Implementation of Advanced Encryption Standard on the Coarse-grained Reconfigurable Architecture Hung K. Nguyen*, Xuan-Tu Tran SIS Laboratory, VNU University of Engineering and Technology, 144 Xuan Thuy road, Cau Giay district, Hanoi, Vietnam Abstract The Advanced Encryption Standard (AES) is currently considered as one of the best symmetric-key block ciphers. The hardware implementation of the AES for hand-held mobile devices or wireless sensor network nodes is always required to meet the strict constraints in terms of performance, power and cost. Coarse-grained reconfigurable architectures are recently proposed as the solution that provides high flexibility, high performance and low power consumption for the next-generation embedded systems. This paper presents a flexible, high-performance implementation of the AES algorithm on a coarse-grained reconfigurable architecture, called MUSRA (Multimedia Specific Reconfigurable Architecture). First, we propose a hardware-software partitioning method for mapping the AES algorithm onto the MUSRA. Second, the parallel and pipelining techniques are considered thoughtfully to increase total computing throughput by efficiently utilizing the computing resources of the MUSRA. Some optimizations at both loop transformation level and scheduling level are performed in order to make better use of instruction-, loop- and task- level parallelism. The proposed implementation has been evaluated by the cycle-accurate simulator of the MUSRA. Experimental results show that the MUSRA can be reconfigured to support both encryption and decryption with all key lengths specified in the AES standard. The performance of the AES algorithm on the MUSRA is better than that of the ADRES reconfigurable processor, Xilinx Virtex-II, and the TI C64+ DSP. Received 24 November 2015, revised 06 January 2015, accepted 13 January 2016 Keywords: Coarse-grained Reconfigurable Architecture (CGRA), Advanced Encryption Standard (AES), Reconfigurable Computing, Parallel Processing. 1. Introduction* The fast development of the communication technology enables the information to be easily shared globally via the internet, especially with the Internet of Things (IoT). However, it also raises the requirement about the secure of the information, especially the sensitive data such as password, bank account, personal information, etc. One method to protect the sensitive data is using symmetric-key block cipher before and after sending it over the ________ * Corresponding author. E-mail.: kiemhung@vnu.edu.vn network. The Advanced Encryption Standard (AES), which has been standardized by the National Institute of Standard and Technology (NIST) [1], is currently considered as one of the best symmetric-key block ciphers. With the block size of 128 bits and the variable key length of 128 bits, 192 bits or 256 bits, the AES has been proved to be a robust cryptographic algorithm against illegal access. The hardware implementation of the AES for modern embedded systems such as hand-held mobile devices or wireless sensor network (WSN) nodes always gives designers some challenges such as reducing chip area and 10 H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 11 power consumption, increasing application performance, shortening time-to-market, and simplifying the updating process. Besides, these systems are often designed not only for a specific application but also for multiple applications. Such sharing of resources by several applications makes the system cheaper and more versatile. Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), and Application-Specific Instruction Set Processors (ASIPs), have been used for implementing the mobile multimedia systems. However, none of them meets all of the above challenges [2]. The software implementation of the AES algorithm by using processors (e.g. [3]) are usually very flexible and usually targets at the applications at where flexibility has a higher priority than the implementation efficiency in terms of power consumption, area, and performance. In contrast, the ASIC implementation of the AES algorithm (e.g. [4]) usually offers the optimized performance and power consumption. However, the drawback of ASIC is lower flexibility. Moreover, the high price for designing and manufacturing the chip masks is becoming increasingly an important factor that limits the application scope of ASIC. Recently, a very promising solution is the reconfigurable computing systems (e.g. Zynq-7000 [5], ADRES [6], etc.) that are integrated many heterogeneous processing resources such as software programmable microprocessors (P), hardwired IP (Intellectual Property) cores, reconfigurable hardware architectures, etc. as shown in Figure 1. To program such a system, a target application is first represented intermediately as a series of tasks that depends on each other by a Control and Data Flow Graph (CDFG) [7], and then partitioned and mapped onto the heterogeneous computational and routing resources of the system. Especially, computation-intensive kernel functions of the application are mapped onto the reconfigurable hardware so that they can achieve high performance approximately equivalent to that of ASIC while maintaining a degree of flexibility close to that of DSP processors. By dynamically reconfiguring hardware, reconfigurable computing systems allow many hardware tasks to be mapped onto the same hardware platform, thus reducing the area and power consumption of the design [8]. P Data Memory Instruction AMBA AHB AHB/CGRA Interface IP cores DPLL CGRA Figure 1. System-level application model of CGRA. The reconfigurable hardware is generally classified into the Field Programmable Gate Array (FPGA) and coarse-grained dynamically reconfigurable architecture (CGRA). A typical example of the FPGA-based reconfigurable SoC is Xilinx Zynq-7000 devices [5]. Generally, FPGAs support the fine-grained reconfigurable fabric that can operate and be configured at bit-level. FPGAs are extremely flexible due to their higher reconfigurable capability. However, the FPGAs consume more power and have more delay and area overhead due to greater quantity of routing required per configuration [9]. This limits the capability to apply FPGA to mobile devices. To overcome the limitation of the FPGA-like fine-grained reconfigurable devices, we developed and modeled a coarse-grained dynamically reconfigurable architecture, called MUSRA (Multimedia Specific Reconfigurable Architecture) [10]. The MUSRA is a high-performance, flexible platform for a domain of applications in multimedia processing. In contrast with FPGAs, the MUSRA aims at reconfiguring and manipulating on the data at word-level. The MUSRA was proposed to exploit high data-level parallelism (DLP), instruction-level parallelism (ILP) and TLP (Task Level Parallelism) of the computation-intensive loops of an application. The MUSRA also supports the capability of dynamic 12 H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 reconfiguration by enabling the hardware fabrics to be reconfigured into different functions even if the system is working. In this paper, we proposed a solution for implementing the AES algorithm on the platform of the MUSRA-based system. The AES algorithm is firstly analyzed and optimized, and then HW/SW (Hardware/Software) partitioned and scheduled to be executed on the MUSRA-based system. The experimental results show that our proposal achieves the throughput of 29.71 instructions per cycle in average. Our implementation has been compared to the similar works on ADRES reconfigurable processor [6], Xilinx Virtex-II [11], and TI C64+ DSP [3]. Our implementation is about 6.9 times, 2.2 times, and 1.6 times better than that of TI C64+ DSP, Xilinx Virtex-II, and ADRES, respectively. The rest of the paper is organized as follows. The MUSRA architecture and the AES algorithm are presented in Section 2 and Section 3, respectively. Section 4 presents the mapping the AES algorithm onto the MUSRA-based system. In Section 5, simulation results and the evaluation of the AES algorithm on the MUSRA-based system in terms of flexibility and performance are reported and discussed. Finally, conclusions are given in Section 6. 2. MUSRAArchitecture 2.1. Architecture Overview AHB/CGRA Interface Input DMA DDMAC IN_FIFO GRF Crossbar Switch RC RC RC 00 01 07 Crossbar Switch Context Context Data Memory Parser RC RC RC Memory Crossbar Switch RC RC RC 70 71 77 RCA IN_FIFO Output DMA Figure 2. MUSRA architecture. The MUSRA is composed of a Reconfigurable Computing Array (RCAs), Input/Output FIFOs, Global Register File (GRF), Data/Context memory subsystems, and DMA (Direct Memory Access) controllers, etc. (Figure 2). Data/Context memory subsystems consist of storage blocks and DMA controllers (i.e. CDMAC and DDMAC). The RCA is an array of 88 RCs (Reconfigurable Cells) that can be configured partially to implement computation-intensive tasks. The input and output FIFOs are the I/O buffers between the data memory and the RCA. Each RC can get the input data from the input FIFO or/and GRF, and store the results back to the output FIFO. These FIFOs are all 512-bit in width and 8-row in depth, and can load/store sixty-four bytes or thirty-two 16-bit words per cycle. Especially, the input FIFO can broadcast data to every RC that has been configured to receive the data from the input FIFO. This mechanism aims at exploiting the reusable data between several iterations. The interconnection between two neighboring rows of RCs is implemented by a crossbar switch. Through the crossbar switch, an RC can get results that come from an arbitrary RC in the above row of it. The Parser decodes the configuration information that has been read from the Context Memory, and then generates the control signals that ensure the execution of RCA accurately and automatically. RC (Figure 3) is the basic processing unit of RCA. Each RC includes a data-path that can execute signed/unsigned fixed-point 8/16-bit operations with two/three source operands, such as arithmetic and logical operations, multiplier, and multimedia application-specific operations (e.g. barrel shift, shift and round, absolute differences, etc.). Each RC also includes a local register called LOR. This register can be used either to adjust operating cycles of the pipeline or to store coefficients when a loop is mapped onto the RCA. A set of configuration registers, which stores configuration information for the RC, is called a layer. Each RC contains two layers that can operate in the ping-pong fashion to reduce the configuration time. H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 13 Config. Data Config. Cnfig. s Config._Addr REGs Layer Config._ENB Layer 1 MUX MUX A_IN B_IN A B C DATAPATH OUT_REG PE CLK MUX RESETN LOR_input ENABLE LOR LOR_output configuration of the MUSRA can take place behind the execution of the RCA. As a result, once the RCA finishes calculating with the current context, it can be immediately changed into the next context. 2.3. Execution Model PE_OUT LOR_OUT Figure 3. RC architecture. The data processed by RCA are classified into two types: variables are streamed into the RCA through the input FIFO meanwhile constants are fed into the RCA via either GRF for scalar constants or LOR array for array constants. The constant type is again classified into global constants and local constants. Global constants are determined at compile-time therefore they are initialized in context memory of the MUSRA at compile-time and loaded into GRF/LORs while configuring the RCA. Local constants (or immediate values) are not determined at compile-time, but are the results generated by other tasks at run-time, therefore, they need to be loaded dynamically into GRF/LCRs by configuration words. 2.2. Configuration Model The configuration information for the MUSRA is organized into the packets called context. The context specifies a particular operation of the RCA core (i.e. the operation of each RC, the interconnection between RCs, the input source, output location, etc.) as well as the control parameters that control the operation of the RCA core. The total length of a context is 128 32-bit words. An application is composed of one or more contexts that are stored into the context memory of the MUSRA. The function of the MUSRA is reconfigured dynamically in run-time according to the required hardware tasks. To deal with the huge configuration overhead in the reconfigurable hardware, the proposed design of the MUSRA supports a mechanism to pre-load and pre-decode the configuration context from the context memory to the configuration layers in the RCA. By this method, the It is a well-known rule of thumb that 90% of the execution time of a program is spent by 10% of the code of LOOP constructs [9]. These LOOP constructs are generally identified as kernel loops. Most of them have computation-intensive and data-parallel characteristics with high regularity, so they can be accelerated by hardware circuits. The MUSRA architecture is basically the such-loop-oriented one. By mapping the body of the kernel loop onto the RCA, the RCA just needs configuring one time for executing multiple times, therefore it can improve the efficiency of the application execution. Executing model of the RCA is the pipelined multi-instruction-multi-data (MIMD) model. In this model, each RC can be configured separately to a certain operation, and each row of RCs corresponds to a stage of a pipeline. Multiple iterations of a loop are possible to execute simultaneouslyin the pipeline. For purpose of mapping, a kernel loop is first analyzed and loop transformed (e.g. loop unrolling, loop pipelining, loop blocking, etc.) in order to expose inherent parallelism and data locality that are then exploited to maximize the computation performance on the target architecture. Next, the body of the loop is represented by data-flow graphs (DFGs) as shown in Figure 4. Thereafter, DFGs are mapped onto RCA by generating configuration information, which relate to binding nodes to the RCs and edges to the interconnections. Finally, these DFGs are scheduled in order to execute automatically on RCA by generating the corresponding control parameters for the CGRA’s controller. Once configured for a certain loop, RCA operates as the hardware dedicated for this loop. When all iterations of loop have completed, this loop is removed from the RCA, and the other loops are mapped onto the RCA. 14 H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 x y InputFIFO z t x y Stage1 × PE LOR PE LOR NI = 2 Data broadcasted CLK1 directly to every RC CLK2 x y Input #1 × z t Input #2 LOAD -EXECUTION z t Stage2 + PE LOR PE LOR + CLK3 EXECUTION & 35 Stage3 & PE LOR PE LOR GRF(0) 0 v 1 w OutputFIFO CLK4 CLK5 - v A Output #1 CLK6 w Output #2 NO = 2 (a) STORE-EXECUTION Stage4 - PE TD PE TD v OUT_FIFO(0) Stage4 A PE TD PE TD w OUT_FIFO(0) (b) Figure 4. (a) DFG representation of a simple loop body, and (b) its map onto RCA. The execution of a loop is scheduled so that the different phases of successive iterations are overlapped each other as much as possible. Scheduling also needs to ensure that there are not any conflicts between resources as multiple phases take place simultaneously. Parallel processing increases not only the computation performance but also the pressure on the data bandwidth. The system’s bandwidth is necessary to ensure that data is always available for all resources running concurrently without the IDLE state. One way to increase data availability is to exploit the data locality that refers to capability of data reuse within a short period of time [12]. Exploiting the data locality has the potential to increase the processing efficiency of the system because the data can be cached in the internal memory for reuse later, thus reducing stalled times due to waiting for external memory accesses. Moreover, the data reuse also has the potential to minimize the number of access to external memory, thus achieves a significant reduction in the power consumption [13]. Compared with the execution model in [14], the MUSRA’s execution model exploits the overlapping data between two successive iterations, so it can enhance the performance and reduce the input data bandwidth [10]. In this model, RCA core can start computing as soon as the data of the first input appears on the input of the RCA, so LOAD phase and EXECUTION phase of the same iteration can happen simultaneously. In other words, our execution model allows overlapping three phases LOAD, EXECUTION, STORE of the same iteration as much as possible. As shown in Figure 4, an iteration of RCA core in the MUSRA’s model is started by LOAD-EXECUTION phase, and then is EXECUTION phase, finally finished by STORE-EXECUTION phase. On the other hand, this model also allows the data of the next iteration be LOADed simultaneously with the data of the current iteration, so it maximizes not only the level of overlapping between the consecutive iterations but also the data reuse [10]. 3. Advanced Encryption Standard The overall structure of the Advanced Encryption Standard (AES) algorithm, which includes both encryption and decryption process, at Electronic Codebook (EBC) mode is depicted in Figure 5 [1]. The AES is an iterated cryptographic block cipher with a block length of 128-bits, which means that the input data is divided into 128-bit blocks and encrypted independently through a sequence of rounds. H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 15 During the cipher process, the 128-bit input block is arranged into a 4×4 matrix of bytes so that the first four bytes of a 128-bit input block are located at the first column in the 4×4 matrix; the next four bytes are located at the second column, and so on. At the output of the last round, the 4×4 matrix of bytes is rearranged into a 128-bit output block. This 4×4 matrix is referred to as the state array in the context of the AES algorithm. The AES standard supports three types of key length, including 128, 196 or 256 bits. The number of rounds to be executed in an AES encryption or decryption process is dependent on the used key length as shown in Eq.(1). The round keys are derived from the original key thank to the key expansion unit. n = Key _ Length + 6 (1) 32 Encryption Decryption 128-bit Plaintext 128/192/256-bit 128-bit Plaintext block Key block Add Round Key w0 – w3 w0 – w3 Round n Round 1 w4 – w7 w4 – w7 Round (n – 1) Round 2 w8 – w11 w8 – w11 Round (n – 2) Round n w4n – w4n+3 w4n – w4n+3 Add Round Key 128-bit Ciphertext 128-bit Ciphertext block block Figure 5. The overall structure of AES algorithm [1]. Substitute Bytes Inverse Mix Columns Shift Rows Round Key Add Round Key Mix Columns Inverse Substitute Bytes Add Round Key Round Key Inverse Shift Rows (a) Encryption Round (b) Decryption Round Figure 6. The overall structure of a round. Except for the last round, all rounds are identical and including four steps as shown in Figure 6. Notice that the last round (Round n) does not have “Mix Columns” and “Inverse Mix Columns” for the encryption and the decryption, respectively. Also notice that the sequence at where the steps are performed is different for the encryption and the decryption. 4. Implementation Motivated by the demand of higher throughput and flexibility, as well as low power consumption for the applications of video conference, security IP camera, etc. in this section we are going to describe our optimization method for improving the performance of the AES algorithm on the architecture of the MUSRA-based system. In the work, we have mapped both the AES encryption and AES decryption with all options of key length onto the MUSRA-based system. However, for simplifying the presentation in this section, we will focus on the AES encryption and assume that the key length is 128 bits. We have started with the C-software implementation of the AES algorithm and then pay attention on analyzing the source code to indentify computation-intensive loops of the C-software. Besides, since no more parallel is available in the application when processing a single block, the loop transformation and source-level transformation are applied to kernel loops to improve parallelism. Next, the kernel loops are represented intermediately by DFGs and mapped onto RCA to increase the total computing throughput. Finally, we propose a scheduling scheme to manage the dynamically reconfigurable operation of the system. The scheduling scheme also takes charge of synchronizing the data communication between tasks, and managing the conflict between hardware resources. 4.1.Hardware/Software Partition The structure of the AES encryption algorithm in Figure 5 can be modeled by the C source code as shown in Figure 7(a). The AES encryption program is represented by two FOR loops that are denoted as block_loop and 16 H.K. Nguyen, X.T. Tran / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 2 (2016) 10-22 1 KeyExpansion(); 2 // processing all blocks of the plain text input file 3 for (block = 1; block =< Nb; block++) 4 { // block_loop 5 AddRoundKey(0); 6 // first Nr-1 rounds 7 for (round = 1; round < Nr; ++round) 8 {// round_loop 9 SubBytes(); 10 ShiftRows(); 11 MixColumns(); 12 AddRoundKey(round); 13 } 14 // The last round 15 SubBytes(); 16 ShiftRows(); 17 AddRoundKey(Nr); 18 } 19 20 21 22 23 24 25 26 27 28 29 (a) Original Code KeyExpansion(); for (block = 1; block =< Nb; block++) { // Hardware AddRoundKey(0); } // first Nr-1 rounds for (round = 1; round < Nr; ++round) { for (block = 1; block =< Nb; block++) { // Software SubBytes(); ShiftRows(); } for (block = 1; block =< Nb; block++) { // Hardware MixColumns(); AddRoundKey(round); } } // The last round for (block = 1; block =< Nb; block++) {// Software SubBytes(); ShiftRows(); } for (block = 1; block =< Nb; block++) {// Hardware AddRoundKey(Nr); } (b) Code after Loop transformations Figure 7. C code for the AES encryption algorithm. Each sample counts as 0.01 seconds. % cumulative self time seconds seconds calls 32.82 29.75 29.75 184549365 29.72 56.69 26.94 150994935 26.85 81.03 24.34 167772150 5.57 86.08 5.05 167772150 2.06 87.95 1.87 16777215 1.86 89.64 1.69 16777215 0.82 90.38 0.74 0.00 90.38 0.00 40 0.00 90.38 0.00 1 self us/call 0.16 0.18 0.15 0.03 0.11 0.10 0.00 0.00

Tài liệu liên quan