Hardware-Focused Performance Comparison for the Standard Block ...
Hardware-Focused Performance Comparison for the Standard Block ...
Hardware-Focused Performance Comparison for the Standard Block ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Hardware</strong>-<strong>Focused</strong> <strong>Per<strong>for</strong>mance</strong><br />
<strong>Comparison</strong> <strong>for</strong> <strong>the</strong> <strong>Standard</strong> <strong>Block</strong><br />
Ciphers AES, Camellia, and Triple-DES<br />
Akashi Satoh and Sumio Morioka<br />
Tokyo Research Laboratory<br />
IBM Japan Ltd.
� Compact and High-Speed Architectures <strong>for</strong> AES<br />
� Compact and High-Speed Architectures <strong>for</strong> Camellia<br />
� Compact and High-Speed Architectures <strong>for</strong> Triple-DES<br />
� S-Box <strong>Comparison</strong> between AES and Camellia<br />
� <strong>Hardware</strong> <strong>Per<strong>for</strong>mance</strong> <strong>Comparison</strong> in ASIC<br />
� Conclusion<br />
Contents
Compact and High-Speed<br />
Architectures <strong>for</strong> AES
AES Algorithm<br />
� 128-bit data blocks with 128-/192-/256-bit keys<br />
� SPN structure using 4 primitive functions takes 11 rounds<br />
1-bit<br />
XOR<br />
128<br />
8-bit<br />
S-Box<br />
16<br />
32-bit<br />
Rotation<br />
4<br />
32-bit<br />
Permutation<br />
4<br />
a 00<br />
a 10<br />
a 20<br />
a 30<br />
a 00<br />
a 10<br />
a 20<br />
a 30<br />
a a a<br />
01 02 03<br />
a a a<br />
11 12 13<br />
a a a<br />
21 22 23<br />
a a a<br />
31 32 33<br />
a a a<br />
01 02 03<br />
a a a<br />
11 ij 13<br />
a a a<br />
21 22 23<br />
a a a<br />
31 32 33<br />
a a a a<br />
00 01 02 03<br />
a a a a<br />
10 11 12 13<br />
a a a a<br />
20 21 22 23<br />
a a a a<br />
30 31 32 33<br />
a0j S-Box<br />
a00 b<br />
a a<br />
01 03 c( x)<br />
b b<br />
a a a a<br />
10 11 1j<br />
13<br />
b b b1j b<br />
a a a<br />
20 21a 23<br />
b b b<br />
2j<br />
b2j a a a<br />
30 31 33<br />
b b b<br />
a3j k 00 k01 k02 k03<br />
k k k k<br />
k k k k<br />
k k k k<br />
no shift<br />
10 11 12 13<br />
20 21 22 23<br />
30 31 32 33<br />
left rotation by 1<br />
left rotation by 2<br />
left rotation by 3<br />
=<br />
b00 b01 b02 b03<br />
b ij<br />
b b b b<br />
b b b b<br />
b b b b<br />
10 11 12 13<br />
20 21 22 23<br />
30 31 32 33<br />
a 00 02<br />
b00 b01 b02 b03<br />
b b b b<br />
b b b b<br />
b b b b<br />
a a a<br />
01 03<br />
a10<br />
a a 20 21<br />
a a a<br />
30 31 32<br />
b0j 00 01 03<br />
10 11 13<br />
20 21 23<br />
b3j 30 31 33<br />
10 11 12 13<br />
20 21 22 23<br />
30 31 32 33<br />
Cipherihg <strong>Block</strong><br />
128-bit plain text<br />
8 8 8<br />
AddRoundKey<br />
SubBytes<br />
ShiftRows<br />
MixColumns<br />
AddRoundKey<br />
SubBytes<br />
ShiftRows<br />
MixColumns<br />
AddRoundKey<br />
SubBytes<br />
ShiftRows<br />
AddRoundKey<br />
8 8 8<br />
128-bit cipher text<br />
Compact Architecture <strong>for</strong> AES<br />
� Primitive components are<br />
shared between<br />
encryption, decryption,<br />
and key scheduling<br />
� 32-bit data path is<br />
repeatedly used to process<br />
128-bit data<br />
� Key scheduler reuses <strong>the</strong><br />
S-Box in <strong>the</strong> ciphering<br />
block while ShiftRows is<br />
executing<br />
� Encryption and decryption<br />
take 54 clock cycles<br />
Ciphering Ciphering Ciphering <strong>Block</strong><br />
32<br />
ShiftRows<br />
InvShiftRows<br />
32 32 32 32<br />
32<br />
affine -1<br />
affine -1<br />
affine -1<br />
affine -1<br />
affine -1<br />
affine -1<br />
affine -1<br />
8 8 8 8<br />
32<br />
-1<br />
5:1<br />
x -1 x -1 x -1 x -1<br />
x -1 x -1 x -1 x -1<br />
x -1 x -1 x -1 x -1<br />
x -1 x -1 x -1 x -1<br />
x -1 x -1 x -1 x -1<br />
x -1 x -1 x -1 x -1<br />
x -1 x -1 x -1 x -1<br />
MxC<br />
MxC MxC<br />
2:1<br />
5:1<br />
affine<br />
MxCo<br />
8-bit 8-bit 8-bit<br />
Data Data Data Reg Reg Reg<br />
High-speed Architecture <strong>for</strong> AES<br />
� Straight<strong>for</strong>ward implementation with 128-bit data path<br />
� Encryption and decryption take 10 clocks each<br />
Ciphering <strong>Block</strong><br />
AddRoundKey<br />
Data<br />
Register<br />
d<br />
SubBytes/<br />
InvSubBytes<br />
AddRoundKey<br />
affine -1<br />
,<br />
8 8 8 8<br />
x -1 x -1 x -1 x -1<br />
32<br />
2:1<br />
d<br />
d, affine d -1<br />
-1<br />
Data Input<br />
32 32 32 32<br />
2:1 2:1 2:1 2:1<br />
ShiftRows/InvShiftrows<br />
2:1<br />
2:1<br />
MxCo<br />
-1<br />
MxCo<br />
32-bit<br />
Slice<br />
32-bit<br />
Slice<br />
32-bit<br />
Slice<br />
2:1<br />
-1<br />
MxCo<br />
Key Scehduler<br />
2:1 2:1 2:1 2:1<br />
2:1 2:1 2:1 2:1<br />
2:1 2:1 2:1<br />
-1<br />
MxCo<br />
-1<br />
MxCo<br />
-1<br />
MxCo<br />
2:1 2:1 2:1 2:1<br />
32 32 32 32<br />
Secret Key<br />
Register<br />
Compact and High-Speed<br />
Architectures <strong>for</strong> Camellia
Camellia Algorithm<br />
� 128-bit data blocks with 128-/192-/256-bit keys<br />
� 2 FL/FL -1 functions are inserted between 3 Feistel network blocks<br />
� It takes 22 rounds <strong>for</strong> both encryption and decryption<br />
kw1<br />
k1~6<br />
kl1<br />
k7~12<br />
kl3<br />
k13~18<br />
kw3<br />
Plain Text<br />
64 64<br />
Feistel<br />
Network<br />
FL FL -1<br />
Feistel<br />
Network<br />
FL FL -1<br />
Feistel<br />
Network<br />
Cipher Text<br />
kw2<br />
kl2<br />
kl4<br />
kw4<br />
Feistel network<br />
k1<br />
k2<br />
k3<br />
k4<br />
k5<br />
k6<br />
64 64 64<br />
F<br />
F<br />
F<br />
F<br />
F<br />
F<br />
k<br />
8 64<br />
P function<br />
8<br />
z8<br />
8<br />
x8<br />
8 S1<br />
z'8<br />
8<br />
z7<br />
8<br />
x7<br />
8 S4<br />
z'7<br />
8<br />
z6<br />
8<br />
x6<br />
8 S3<br />
z'6<br />
8<br />
z5<br />
8<br />
x5<br />
8 S2<br />
z'5<br />
8<br />
z4<br />
8<br />
x4<br />
8 S4<br />
z'4<br />
8<br />
z3<br />
8<br />
x3<br />
8 S3<br />
z'3<br />
8<br />
z2<br />
8<br />
x2<br />
8 S2<br />
z'2<br />
8<br />
z1<br />
8<br />
x1<br />
S1<br />
z'1<br />
-1<br />
FL function FL function<br />
32<br />
OR<br />
64<br />
32<br />
OR<br />
kl<br />
64 32<br />
32 64<br />
kl<br />
AND
32-bit Slice of F function<br />
� 32-bit S-Box is reused twice to generate a 64-bit F function as<br />
output<br />
� Two 32-bit S-Box output blocks are added through<br />
permutation layer and XOR operation<br />
F function<br />
Sbox<br />
Sbox<br />
Permutation<br />
Sbox<br />
0<br />
0<br />
Sbox<br />
Permutation<br />
Permutation
Divide and Merge Primitive Functions<br />
� Number of S-Boxes is halved by using 32-bit slice of F function<br />
k<br />
S1<br />
S4<br />
S3<br />
S2<br />
S4<br />
S3<br />
S2<br />
S1<br />
P function<br />
� Merge FL/FL -1 function and Key whitening<br />
-1<br />
FL function FL function<br />
>>1<br />
+<br />
1<br />
2:1<br />
2:1<br />
2:1<br />
2:1<br />
Data Path using 32-bit Slice of F function<br />
� Data are processed as 32-bit<br />
blocks in each round<br />
� Right half of <strong>the</strong> data is always<br />
processed to simplify controller<br />
kw1<br />
k1~6<br />
kl1<br />
k7~12<br />
kl3<br />
k13~18<br />
kw3<br />
Plain Text<br />
64 64<br />
Feistel<br />
Network<br />
FL FL -1<br />
Feistel<br />
Network<br />
FL FL -1<br />
Feistel<br />
Network<br />
Cipher Text<br />
kw2<br />
kl2<br />
kl4<br />
kw4<br />
k1<br />
k2<br />
k3<br />
k4<br />
k5<br />
k6<br />
64 64 64<br />
F<br />
F<br />
F<br />
F<br />
F<br />
F<br />
k1~6<br />
k7~12<br />
k13~18<br />
Plaintext<br />
64 64<br />
kw2<br />
kw1<br />
Feistel<br />
Network<br />
kl2<br />
kl1<br />
Feistel<br />
Network<br />
kl4<br />
kl3<br />
3<br />
4<br />
FL<br />
Feistel<br />
Network<br />
kw<br />
kw<br />
FL -1<br />
FL<br />
FL -1<br />
Ciphertext<br />
k1H<br />
k1L<br />
k<br />
2H<br />
k2L<br />
k<br />
3H<br />
k3L<br />
k4H<br />
k4L<br />
k<br />
5H<br />
k5L<br />
k<br />
6H<br />
k6L<br />
64 32 64<br />
F<br />
32<br />
F32<br />
F<br />
32<br />
F32<br />
F<br />
32<br />
F32<br />
F<br />
32<br />
F32<br />
F<br />
32<br />
F32<br />
F<br />
32<br />
F32
� Share F function between<br />
data randomization block<br />
and key scheduler<br />
� Round keys are generated<br />
by repeating 16-bit and<br />
1-bit rotations<br />
� Encryption and<br />
decryption take 44 clocks<br />
Compact Architecture<br />
K L K A<br />
4:1 Data<br />
128 128<br />
128<br />
128<br />
16<br />
1<br />
3:1<br />
128<br />
Key Scheduler<br />
64<br />
64<br />
�<br />
H<br />
L<br />
1 234<br />
3:1<br />
2:1<br />
Key/Data Input<br />
64<br />
64<br />
L<br />
H<br />
2:1<br />
128<br />
2:1<br />
2:1<br />
Key<br />
32 32 64 64<br />
2:1<br />
F32<br />
64<br />
FL/FL-1<br />
KeyAdd<br />
Ciphering <strong>Block</strong><br />
Data<br />
Output
� Original 64-bit F<br />
function is used<br />
� Execute F function and<br />
FL/FL -1 functions or<br />
key whitening linear<br />
functions in <strong>the</strong> same<br />
cycle to reduce <strong>the</strong><br />
number of clocks<br />
� encryption and<br />
decryption takes 18<br />
clocks<br />
High-Speed Architecture<br />
K L<br />
128<br />
16<br />
1<br />
128<br />
2:1<br />
16<br />
2:1<br />
1<br />
4:1<br />
17<br />
4:1<br />
64<br />
>>15<br />
�<br />
Data/Key Input<br />
1 234<br />
128<br />
4:1<br />
3:1<br />
2:1<br />
64<br />
64<br />
H<br />
L<br />
F<br />
FL<br />
FL -1<br />
2:1<br />
Data<br />
64<br />
L<br />
H<br />
2:1<br />
3:1<br />
128<br />
128<br />
128 128<br />
128<br />
64<br />
Data<br />
Output
Compact and High-Speed<br />
Architectures <strong>for</strong> Triple-DES
DES Algorithm<br />
� 64-bit data block with 56-bit key<br />
� 16-round Feistel network<br />
� Triple-DES takes 48 rounds<br />
� S-Box is a random substitution table<br />
� Straight<strong>for</strong>ward implementation can<br />
obtain compact hardware with high<br />
operating frequency<br />
32<br />
P<br />
4 6<br />
4 6<br />
4 6<br />
48<br />
32<br />
4<br />
4<br />
6<br />
6<br />
48 48<br />
E<br />
32<br />
4 6<br />
4 6<br />
4 6<br />
S0<br />
S1<br />
S2<br />
S3<br />
S4<br />
S5<br />
S6<br />
S7<br />
F function<br />
Plain Text<br />
32<br />
F<br />
64<br />
IP<br />
F<br />
F<br />
64<br />
32<br />
32 32<br />
64<br />
IP -1<br />
64<br />
Cipher Text<br />
48<br />
48<br />
48<br />
Key Scheduler
Compact and High-Speed Architectures <strong>for</strong> DES<br />
64<br />
� Three architectures containing 1-/2-/4-round functions<br />
execute triple-DES in 48/24/12 clocks<br />
� Multi-round/clock version obtained higher throughput due<br />
to <strong>the</strong> decrease of register access and <strong>the</strong> logic compression<br />
on <strong>the</strong> stacked round functions<br />
1round/clock<br />
Total 48cycles<br />
Data Key<br />
Round func.<br />
48<br />
Schedule<br />
56<br />
2rounds/clock<br />
Total 24cycles<br />
64 56<br />
Data Key<br />
Round func.<br />
Round func.<br />
48<br />
48<br />
Schedule<br />
4rounds/clock<br />
Total 12cycles<br />
Data Key<br />
Round func.<br />
Round func.<br />
Round func.<br />
Round func. 48<br />
Compact High-Speed<br />
64<br />
48<br />
48<br />
48<br />
56<br />
Schedule
S-Box <strong>Comparison</strong> between<br />
AES and Camellia
Compact AES S-Box Architecture<br />
� Field conversion from GF(2 8 ) to GF(((2 2 ) 2 ) 2 ) by isomorphism functions<br />
� Hierarchical architecture of <strong>the</strong> GF(((2 2 ) 2 ) 2 ) inverter is very compact<br />
� Isomorphism function and affine trans<strong>for</strong>mation are merged into a<br />
single XOR matrix<br />
GF(28 GF(2 ) 8 )<br />
GF(((22 GF(((2 ) 2 )<br />
GF(28 GF(2 ) 8 )<br />
2 2<br />
))<br />
isomorphism<br />
inversion<br />
GF(((2 )<br />
GF((22<br />
) 2)<br />
GF(2)<br />
2 inversion<br />
GF(((2 ) )<br />
GF((22<br />
) 2)<br />
GF(22)<br />
GF(2)<br />
2 )<br />
GF(22)<br />
2 ) 2 2 ) 2<br />
isomorphism<br />
merged<br />
��<br />
��<br />
affine trans.<br />
-1<br />
4<br />
4<br />
2<br />
2<br />
x2<br />
x2<br />
λ<br />
φ<br />
-1 x<br />
-1 x<br />
GF(2 2 )inverter<br />
4<br />
4<br />
2<br />
2<br />
GF(((2 2 ) 2 ) 2 )<br />
inverter<br />
GF((2 2 ) 2 )<br />
inverter
Small Camellia S-Box Architecture<br />
� Camellia uses GF((24 ) 2 ) inverter<br />
� GF(24 ) inverter is implemented as a lookup table<br />
GF((2<br />
) )<br />
4 2<br />
affine trans. F<br />
inversion<br />
GF((24) 2)<br />
GF(24)<br />
H<br />
affine trans.<br />
4<br />
4<br />
x2<br />
ω<br />
-1 x<br />
GF(2 4 ) inverter<br />
4<br />
4<br />
GF((2 4 ) 2 )<br />
inverter
S-Box <strong>Per<strong>for</strong>mance</strong> in ASIC<br />
� Syn<strong>the</strong>sized using a 0.13 um ASIC library<br />
� GF(((2 2 ) 2 ) 2 ) inverter is smaller than GF((2 4 ) 2 ) by 26%<br />
� Lookup table inverter is 2 times faster but 3 times bigger<br />
� <strong>Per<strong>for</strong>mance</strong> of S-Boxes are almost <strong>the</strong> same between AES and Camellia<br />
Inverter<br />
AES<br />
Camellia<br />
Component<br />
Inverter<br />
SubBytes<br />
InvSubBytes<br />
S1~S4<br />
Method<br />
GF(((2<br />
Table<br />
2 ) 2 ) 2 GF(((2<br />
Table<br />
)<br />
2 ) 2 ) 2 )<br />
GF((2<br />
Table<br />
4 ) 2 GF(((2<br />
Table<br />
)<br />
2 ) 2 ) 2 GF((2<br />
)<br />
4 ) 2 )<br />
Size (gates)<br />
227<br />
169<br />
540~549<br />
230<br />
562<br />
219<br />
558<br />
256<br />
540~562<br />
Delay (ns)<br />
2.28<br />
2.61<br />
1.31~1.39<br />
3.67<br />
1.30<br />
3.89<br />
1.33<br />
3.45<br />
1.31~1.40
<strong>Hardware</strong> <strong>Per<strong>for</strong>mance</strong><br />
<strong>Comparison</strong> in ASIC
Throughput (Mbps)<br />
AES/Camellia/Triple-DES in ASIC<br />
� In compact architecture, 3 algorithms show same per<strong>for</strong>mance<br />
� In high-speed architecture, 128-bit Feistel-cipher Camellia is 2<br />
times faster than 64-bit Feistel-cipher Triple-DES<br />
� SPN-cipher AES is 1.6 times faster than Camellia<br />
3500<br />
3000<br />
2500<br />
2000<br />
311Mbps/5.4Kgates<br />
1500<br />
326Mbps/6.5Kgates<br />
1000<br />
251Mbps/5.5Kgates<br />
500<br />
0<br />
AES (Ours)<br />
Camellia (Ours)<br />
Triple-DES(Ours)<br />
AES (Conventional)<br />
Camellia (Conventional)<br />
3.46Gbps/36.9Kgates<br />
2.15Gbps/29.8Kgates<br />
1.07Gbps/16.9Kgates<br />
Small & Fast<br />
0 5 10 15 20 25 30 35 40 45 50 55 60<br />
Gate Count (Kgates) (0.13um ASIC, worst case)
<strong>Hardware</strong> Efficiency (Kbps/gate<br />
<strong>Hardware</strong> Efficiency<br />
� <strong>Hardware</strong> efficiency defined by “throughput /gate” is compared<br />
� High-speed versions show higher efficiency because no<br />
additional circuits or delays <strong>for</strong> component sharing are required<br />
� High-speed versions with composite field S-Box show <strong>the</strong> best<br />
per<strong>for</strong>mance<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
(0.13um ASIC, worst case)<br />
54<br />
cycles<br />
44<br />
32<br />
21<br />
11 10<br />
10<br />
Lookup table<br />
44<br />
18<br />
Area optimized<br />
Speed optimized<br />
Lookup table<br />
18<br />
48 24<br />
Camellia Triple-DES<br />
12
Conclusion