Hardware-Focused Performance Comparison for the Standard Block ...

Hardware-Focused Performance 

Comparison for the Standard Block 

Ciphers AES, Camellia, and Triple-DES 

Akashi Satoh and Sumio Morioka 

Tokyo Research Laboratory 

IBM Japan Ltd.

� Compact and High-Speed Architectures for AES 

� Compact and High-Speed Architectures for Camellia 

� Compact and High-Speed Architectures for Triple-DES 

� S-Box Comparison between AES and Camellia 

� Hardware Performance Comparison in ASIC 

� Conclusion 

Contents

Compact and High-Speed 

Architectures for AES

AES Algorithm 

� 128-bit data blocks with 128-/192-/256-bit keys 

� SPN structure using 4 primitive functions takes 11 rounds 

1-bit 

XOR 

128 

8-bit 

S-Box 

16 

32-bit 

Rotation 

4 

32-bit 

Permutation 

4 

a 00 

a 10 

a 20 

a 30 

a 00 

a 10 

a 20 

a 30 

a a a 

01 02 03 

a a a 

11 12 13 

a a a 

21 22 23 

a a a 

31 32 33 

a a a 

01 02 03 

a a a 

11 ij 13 

a a a 

21 22 23 

a a a 

31 32 33 

a a a a 

00 01 02 03 

a a a a 

10 11 12 13 

a a a a 

20 21 22 23 

a a a a 

30 31 32 33 

a0j S-Box 

a00 b 

a a 

01 03 c( x) 

b b 

a a a a 

10 11 1j 

13 

b b b1j b 

a a a 

20 21a 23 

b b b 

2j 

b2j a a a 

30 31 33 

b b b 

a3j k 00 k01 k02 k03 

k k k k 

k k k k 

k k k k 

no shift 

10 11 12 13 

20 21 22 23 

30 31 32 33 

left rotation by 1 



= 

b00 b01 b02 b03 

b ij 

b b b b 

b b b b 

b b b b 

10 11 12 13 

20 21 22 23 

30 31 32 33 

a 00 02 

b00 b01 b02 b03 

b b b b 

b b b b 

b b b b 

a a a 

01 03 

a10 

a a 20 21 

a a a 

30 31 32 

b0j 00 01 03 

10 11 13 

20 21 23 

b3j 30 31 33 

10 11 12 13 

20 21 22 23 

30 31 32 33 

Cipherihg Block 

128-bit plain text 

8 8 8 

AddRoundKey 

SubBytes 

ShiftRows 

MixColumns 

AddRoundKey 

SubBytes 

ShiftRows 

MixColumns 

AddRoundKey 

SubBytes 

ShiftRows 

AddRoundKey 

8 8 8 

128-bit cipher text 

Compact Architecture for AES 

� Primitive components are 

shared between 

encryption, decryption, 

and key scheduling 

� 32-bit data path is 

repeatedly used to process 

128-bit data 

� Key scheduler reuses the 

S-Box in the ciphering 

block while ShiftRows is 

executing 

� Encryption and decryption 

take 54 clock cycles 

Ciphering Ciphering Ciphering Block 

32 

ShiftRows 

InvShiftRows 

32 32 32 32 

32 

affine -1 

affine -1 

affine -1 

affine -1 

affine -1 

affine -1 

affine -1 

8 8 8 8 

32 

-1 

5:1 

x -1 x -1 x -1 x -1 

x -1 x -1 x -1 x -1 

x -1 x -1 x -1 x -1 

x -1 x -1 x -1 x -1 

x -1 x -1 x -1 x -1 

x -1 x -1 x -1 x -1 

x -1 x -1 x -1 x -1 

MxC 

MxC MxC 

2:1 

5:1 

affine 

MxCo 

8-bit 8-bit 8-bit 

Data Data Data Reg Reg Reg 

High-speed Architecture for AES 

� Straightforward implementation with 128-bit data path 

� Encryption and decryption take 10 clocks each 

Ciphering Block 

AddRoundKey 

Data 

Register 

d 

SubBytes/ 

InvSubBytes 

AddRoundKey 

affine -1 

, 

8 8 8 8 

x -1 x -1 x -1 x -1 

32 

2:1 

d 

d, affine d -1 

-1 

Data Input 

32 32 32 32 

2:1 2:1 2:1 2:1 

ShiftRows/InvShiftrows 

2:1 

2:1 

MxCo 

-1 

MxCo 

32-bit 

Slice 

32-bit 

Slice 

32-bit 

Slice 

2:1 

-1 

MxCo 

Key Scehduler 

2:1 2:1 2:1 2:1 

2:1 2:1 2:1 2:1 

2:1 2:1 2:1 

-1 

MxCo 

-1 

MxCo 

-1 

MxCo 

2:1 2:1 2:1 2:1 

32 32 32 32 

Secret Key 

Register 


Architectures for Camellia

Camellia Algorithm 

� 128-bit data blocks with 128-/192-/256-bit keys 

� 2 FL/FL -1 functions are inserted between 3 Feistel network blocks 

� It takes 22 rounds for both encryption and decryption 

kw1 

k1~6 

kl1 

k7~12 

kl3 

k13~18 

kw3 

Plain Text 

64 64 

Feistel 

Network 

FL FL -1 

Feistel 

Network 

FL FL -1 

Feistel 

Network 

Cipher Text 

kw2 

kl2 

kl4 

kw4 

Feistel network 

k1 

k2 

k3 

k4 

k5 

k6 

64 64 64 

F 

F 

F 

F 

F 

F 

k 

8 64 

P function 

8 

z8 

8 

x8 

8 S1 

z'8 

8 

z7 

8 

x7 

8 S4 

z'7 

8 

z6 

8 

x6 

8 S3 

z'6 

8 

z5 

8 

x5 

8 S2 

z'5 

8 

z4 

8 

x4 

8 S4 

z'4 

8 

z3 

8 

x3 

8 S3 

z'3 

8 

z2 

8 

x2 

8 S2 

z'2 

8 

z1 

8 

x1 

S1 

z'1 

-1 

FL function FL function 

32 

OR 

64 

32 

OR 

kl 

64 32 

32 64 

kl 

AND

32-bit Slice of F function 

� 32-bit S-Box is reused twice to generate a 64-bit F function as 

output 

� Two 32-bit S-Box output blocks are added through 

permutation layer and XOR operation 

F function 

Sbox 

Sbox 

Permutation 

Sbox 

0 

0 

Sbox 

Permutation 

Permutation

Divide and Merge Primitive Functions 

� Number of S-Boxes is halved by using 32-bit slice of F function 

k 

S1 

S4 

S3 

S2 

S4 

S3 

S2 

S1 

P function 

� Merge FL/FL -1 function and Key whitening 

-1 

FL function FL function 

>>1 

+ 

1 

2:1 

2:1 

2:1 

2:1 

Data Path using 32-bit Slice of F function 

� Data are processed as 32-bit 

blocks in each round 

� Right half of the data is always 

processed to simplify controller 

kw1 

k1~6 

kl1 

k7~12 

kl3 

k13~18 

kw3 

Plain Text 

64 64 

Feistel 

Network 

FL FL -1 

Feistel 

Network 

FL FL -1 

Feistel 

Network 

Cipher Text 

kw2 

kl2 

kl4 

kw4 

k1 

k2 

k3 

k4 

k5 

k6 

64 64 64 

F 

F 

F 

F 

F 

F 

k1~6 

k7~12 

k13~18 

Plaintext 

64 64 

kw2 

kw1 

Feistel 

Network 

kl2 

kl1 

Feistel 

Network 

kl4 

kl3 

3 

4 

FL 

Feistel 

Network 

kw 

kw 

FL -1 

FL 

FL -1 

Ciphertext 

k1H 

k1L 

k 

2H 

k2L 

k 

3H 

k3L 

k4H 

k4L 

k 

5H 

k5L 

k 

6H 

k6L 

64 32 64 

F 

32 

F32 

F 

32 

F32 

F 

32 

F32 

F 

32 

F32 

F 

32 

F32 

F 

32 

F32

� Share F function between 

data randomization block 

and key scheduler 

� Round keys are generated 

by repeating 16-bit and 

1-bit rotations 

� Encryption and 

decryption take 44 clocks 

Compact Architecture 

K L K A 

4:1 Data 

128 128 

128 

128 

16 

1 

3:1 

128 

Key Scheduler 

64 

64 

� 

H 

L 

1 234 

3:1 

2:1 

Key/Data Input 

64 

64 

L 

H 

2:1 

128 

2:1 

2:1 

Key 

32 32 64 64 

2:1 

F32 

64 

FL/FL-1 

KeyAdd 

Ciphering Block 

Data 

Output

� Original 64-bit F 

function is used 

� Execute F function and 

FL/FL -1 functions or 

key whitening linear 

functions in the same 

cycle to reduce the 

number of clocks 

� encryption and 

decryption takes 18 

clocks 

High-Speed Architecture 

K L 

128 

16 

1 

128 

2:1 

16 

2:1 

1 

4:1 

17 

4:1 

64 

>>15 

� 

Data/Key Input 

1 234 

128 

4:1 

3:1 

2:1 

64 

64 

H 

L 

F 

FL 

FL -1 

2:1 

Data 

64 

L 

H 

2:1 

3:1 

128 

128 

128 128 

128 

64 

Data 

Output


Architectures for Triple-DES

DES Algorithm 

� 64-bit data block with 56-bit key 

� 16-round Feistel network 

� Triple-DES takes 48 rounds 

� S-Box is a random substitution table 

� Straightforward implementation can 

obtain compact hardware with high 

operating frequency 

32 

P 

4 6 

4 6 

4 6 

48 

32 

4 

4 

6 

6 

48 48 

E 

32 

4 6 

4 6 

4 6 

S0 

S1 

S2 

S3 

S4 

S5 

S6 

S7 

F function 

Plain Text 

32 

F 

64 

IP 

F 

F 

64 

32 

32 32 

64 

IP -1 

64 

Cipher Text 

48 

48 

48 

Key Scheduler

Compact and High-Speed Architectures for DES 

64 

� Three architectures containing 1-/2-/4-round functions 

execute triple-DES in 48/24/12 clocks 

� Multi-round/clock version obtained higher throughput due 

to the decrease of register access and the logic compression 

on the stacked round functions 

1round/clock 

Total 48cycles 

Data Key 

Round func. 

48 

Schedule 

56 

2rounds/clock 


64 56 

Data Key 

Round func. 

Round func. 

48 

48 

Schedule 

4rounds/clock 


Data Key 

Round func. 

Round func. 

Round func. 

Round func. 48 

Compact High-Speed 

64 

48 

48 

48 

56 

Schedule

S-Box Comparison between 

AES and Camellia

Compact AES S-Box Architecture 

� Field conversion from GF(2 8 ) to GF(((2 2 ) 2 ) 2 ) by isomorphism functions 

� Hierarchical architecture of the GF(((2 2 ) 2 ) 2 ) inverter is very compact 

� Isomorphism function and affine transformation are merged into a 

single XOR matrix 

GF(28 GF(2 ) 8 ) 

GF(((22 GF(((2 ) 2 ) 

GF(28 GF(2 ) 8 ) 

2 2 

)) 

isomorphism 

inversion 

GF(((2 ) 

GF((22 

) 2) 

GF(2) 

2 inversion 

GF(((2 ) ) 

GF((22 

) 2) 

GF(22) 

GF(2) 

2 ) 

GF(22) 

2 ) 2 2 ) 2 

isomorphism 

merged 

�� 

�� 

affine trans. 

-1 

4 

4 

2 

2 

x2 

x2 

λ 

φ 

-1 x 

-1 x 

GF(2 2 )inverter 

4 

4 

2 

2 

GF(((2 2 ) 2 ) 2 ) 

inverter 

GF((2 2 ) 2 ) 

inverter

Small Camellia S-Box Architecture 

� Camellia uses GF((24 ) 2 ) inverter 

� GF(24 ) inverter is implemented as a lookup table 

GF((2 

) ) 

4 2 

affine trans. F 

inversion 

GF((24) 2) 

GF(24) 

H 

affine trans. 

4 

4 

x2 

ω 

-1 x 

GF(2 4 ) inverter 

4 

4 

GF((2 4 ) 2 ) 

inverter

S-Box Performance in ASIC 

� Synthesized using a 0.13 um ASIC library 

� GF(((2 2 ) 2 ) 2 ) inverter is smaller than GF((2 4 ) 2 ) by 26% 

� Lookup table inverter is 2 times faster but 3 times bigger 

� Performance of S-Boxes are almost the same between AES and Camellia 

Inverter 

AES 

Camellia 

Component 

Inverter 

SubBytes 

InvSubBytes 

S1~S4 

Method 

GF(((2 

Table 

2 ) 2 ) 2 GF(((2 

Table 

) 

2 ) 2 ) 2 ) 

GF((2 

Table 

4 ) 2 GF(((2 

Table 

) 

2 ) 2 ) 2 GF((2 

) 

4 ) 2 ) 

Size (gates) 

227 

169 

540~549 

230 

562 

219 

558 

256 

540~562 

Delay (ns) 

2.28 

2.61 

1.31~1.39 

3.67 

1.30 

3.89 

1.33 

3.45 

1.31~1.40

Hardware Performance 

Comparison in ASIC

Throughput (Mbps) 

AES/Camellia/Triple-DES in ASIC 

� In compact architecture, 3 algorithms show same performance 

� In high-speed architecture, 128-bit Feistel-cipher Camellia is 2 

times faster than 64-bit Feistel-cipher Triple-DES 

� SPN-cipher AES is 1.6 times faster than Camellia 

3500 

3000 

2500 

2000 

311Mbps/5.4Kgates 

1500 


1000 


500 

0 

AES (Ours) 

Camellia (Ours) 

Triple-DES(Ours) 

AES (Conventional) 

Camellia (Conventional) 

3.46Gbps/36.9Kgates 



Small & Fast 

0 5 10 15 20 25 30 35 40 45 50 55 60 

Gate Count (Kgates) (0.13um ASIC, worst case)

Hardware Efficiency (Kbps/gate 

Hardware Efficiency 

� Hardware efficiency defined by “throughput /gate” is compared 

� High-speed versions show higher efficiency because no 

additional circuits or delays for component sharing are required 

� High-speed versions with composite field S-Box show the best 

performance 

140 

120 

100 

80 

60 

40 

20 

0 

(0.13um ASIC, worst case) 

54 

cycles 

44 

32 

21 

11 10 

10 

Lookup table 

44 

18 

Area optimized 

Speed optimized 

Lookup table 

18 

48 24 

Camellia Triple-DES 

12

Conclusion

Hardware-Focused Performance Comparison for the Standard Block ...

Create successful ePaper yourself

Delete template?

Save as template?