18.02.2013 Views

Hardware-Focused Performance Comparison for the Standard Block ...

Hardware-Focused Performance Comparison for the Standard Block ...

Hardware-Focused Performance Comparison for the Standard Block ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Hardware</strong>-<strong>Focused</strong> <strong>Per<strong>for</strong>mance</strong><br />

<strong>Comparison</strong> <strong>for</strong> <strong>the</strong> <strong>Standard</strong> <strong>Block</strong><br />

Ciphers AES, Camellia, and Triple-DES<br />

Akashi Satoh and Sumio Morioka<br />

Tokyo Research Laboratory<br />

IBM Japan Ltd.


� Compact and High-Speed Architectures <strong>for</strong> AES<br />

� Compact and High-Speed Architectures <strong>for</strong> Camellia<br />

� Compact and High-Speed Architectures <strong>for</strong> Triple-DES<br />

� S-Box <strong>Comparison</strong> between AES and Camellia<br />

� <strong>Hardware</strong> <strong>Per<strong>for</strong>mance</strong> <strong>Comparison</strong> in ASIC<br />

� Conclusion<br />

Contents


Compact and High-Speed<br />

Architectures <strong>for</strong> AES


AES Algorithm<br />

� 128-bit data blocks with 128-/192-/256-bit keys<br />

� SPN structure using 4 primitive functions takes 11 rounds<br />

1-bit<br />

XOR<br />

128<br />

8-bit<br />

S-Box<br />

16<br />

32-bit<br />

Rotation<br />

4<br />

32-bit<br />

Permutation<br />

4<br />

a 00<br />

a 10<br />

a 20<br />

a 30<br />

a 00<br />

a 10<br />

a 20<br />

a 30<br />

a a a<br />

01 02 03<br />

a a a<br />

11 12 13<br />

a a a<br />

21 22 23<br />

a a a<br />

31 32 33<br />

a a a<br />

01 02 03<br />

a a a<br />

11 ij 13<br />

a a a<br />

21 22 23<br />

a a a<br />

31 32 33<br />

a a a a<br />

00 01 02 03<br />

a a a a<br />

10 11 12 13<br />

a a a a<br />

20 21 22 23<br />

a a a a<br />

30 31 32 33<br />

a0j S-Box<br />

a00 b<br />

a a<br />

01 03 c( x)<br />

b b<br />

a a a a<br />

10 11 1j<br />

13<br />

b b b1j b<br />

a a a<br />

20 21a 23<br />

b b b<br />

2j<br />

b2j a a a<br />

30 31 33<br />

b b b<br />

a3j k 00 k01 k02 k03<br />

k k k k<br />

k k k k<br />

k k k k<br />

no shift<br />

10 11 12 13<br />

20 21 22 23<br />

30 31 32 33<br />

left rotation by 1<br />

left rotation by 2<br />

left rotation by 3<br />

=<br />

b00 b01 b02 b03<br />

b ij<br />

b b b b<br />

b b b b<br />

b b b b<br />

10 11 12 13<br />

20 21 22 23<br />

30 31 32 33<br />

a 00 02<br />

b00 b01 b02 b03<br />

b b b b<br />

b b b b<br />

b b b b<br />

a a a<br />

01 03<br />

a10<br />

a a 20 21<br />

a a a<br />

30 31 32<br />

b0j 00 01 03<br />

10 11 13<br />

20 21 23<br />

b3j 30 31 33<br />

10 11 12 13<br />

20 21 22 23<br />

30 31 32 33<br />

Cipherihg <strong>Block</strong><br />

128-bit plain text<br />

8 8 8<br />

AddRoundKey<br />

SubBytes<br />

ShiftRows<br />

MixColumns<br />

AddRoundKey<br />

SubBytes<br />

ShiftRows<br />

MixColumns<br />

AddRoundKey<br />

SubBytes<br />

ShiftRows<br />

AddRoundKey<br />

8 8 8<br />

128-bit cipher text<br />


Compact Architecture <strong>for</strong> AES<br />

� Primitive components are<br />

shared between<br />

encryption, decryption,<br />

and key scheduling<br />

� 32-bit data path is<br />

repeatedly used to process<br />

128-bit data<br />

� Key scheduler reuses <strong>the</strong><br />

S-Box in <strong>the</strong> ciphering<br />

block while ShiftRows is<br />

executing<br />

� Encryption and decryption<br />

take 54 clock cycles<br />

Ciphering Ciphering Ciphering <strong>Block</strong><br />

32<br />

ShiftRows<br />

InvShiftRows<br />

32 32 32 32<br />

32<br />

affine -1<br />

affine -1<br />

affine -1<br />

affine -1<br />

affine -1<br />

affine -1<br />

affine -1<br />

8 8 8 8<br />

32<br />

-1<br />

5:1<br />

x -1 x -1 x -1 x -1<br />

x -1 x -1 x -1 x -1<br />

x -1 x -1 x -1 x -1<br />

x -1 x -1 x -1 x -1<br />

x -1 x -1 x -1 x -1<br />

x -1 x -1 x -1 x -1<br />

x -1 x -1 x -1 x -1<br />

MxC<br />

MxC MxC<br />

2:1<br />

5:1<br />

affine<br />

MxCo<br />

8-bit 8-bit 8-bit<br />

Data Data Data Reg Reg Reg<br />


High-speed Architecture <strong>for</strong> AES<br />

� Straight<strong>for</strong>ward implementation with 128-bit data path<br />

� Encryption and decryption take 10 clocks each<br />

Ciphering <strong>Block</strong><br />

AddRoundKey<br />

Data<br />

Register<br />

d<br />

SubBytes/<br />

InvSubBytes<br />

AddRoundKey<br />

affine -1<br />

,<br />

8 8 8 8<br />

x -1 x -1 x -1 x -1<br />

32<br />

2:1<br />

d<br />

d, affine d -1<br />

-1<br />

Data Input<br />

32 32 32 32<br />

2:1 2:1 2:1 2:1<br />

ShiftRows/InvShiftrows<br />

2:1<br />

2:1<br />

MxCo<br />

-1<br />

MxCo<br />

32-bit<br />

Slice<br />

32-bit<br />

Slice<br />

32-bit<br />

Slice<br />

2:1<br />

-1<br />

MxCo<br />

Key Scehduler<br />

2:1 2:1 2:1 2:1<br />

2:1 2:1 2:1 2:1<br />

2:1 2:1 2:1<br />

-1<br />

MxCo<br />

-1<br />

MxCo<br />

-1<br />

MxCo<br />

2:1 2:1 2:1 2:1<br />

32 32 32 32<br />

Secret Key<br />

Register<br />


Compact and High-Speed<br />

Architectures <strong>for</strong> Camellia


Camellia Algorithm<br />

� 128-bit data blocks with 128-/192-/256-bit keys<br />

� 2 FL/FL -1 functions are inserted between 3 Feistel network blocks<br />

� It takes 22 rounds <strong>for</strong> both encryption and decryption<br />

kw1<br />

k1~6<br />

kl1<br />

k7~12<br />

kl3<br />

k13~18<br />

kw3<br />

Plain Text<br />

64 64<br />

Feistel<br />

Network<br />

FL FL -1<br />

Feistel<br />

Network<br />

FL FL -1<br />

Feistel<br />

Network<br />

Cipher Text<br />

kw2<br />

kl2<br />

kl4<br />

kw4<br />

Feistel network<br />

k1<br />

k2<br />

k3<br />

k4<br />

k5<br />

k6<br />

64 64 64<br />

F<br />

F<br />

F<br />

F<br />

F<br />

F<br />

k<br />

8 64<br />

P function<br />

8<br />

z8<br />

8<br />

x8<br />

8 S1<br />

z'8<br />

8<br />

z7<br />

8<br />

x7<br />

8 S4<br />

z'7<br />

8<br />

z6<br />

8<br />

x6<br />

8 S3<br />

z'6<br />

8<br />

z5<br />

8<br />

x5<br />

8 S2<br />

z'5<br />

8<br />

z4<br />

8<br />

x4<br />

8 S4<br />

z'4<br />

8<br />

z3<br />

8<br />

x3<br />

8 S3<br />

z'3<br />

8<br />

z2<br />

8<br />

x2<br />

8 S2<br />

z'2<br />

8<br />

z1<br />

8<br />

x1<br />

S1<br />

z'1<br />

-1<br />

FL function FL function<br />

32<br />

OR<br />

64<br />

32<br />

OR<br />

kl<br />

64 32<br />

32 64<br />

kl<br />

AND


32-bit Slice of F function<br />

� 32-bit S-Box is reused twice to generate a 64-bit F function as<br />

output<br />

� Two 32-bit S-Box output blocks are added through<br />

permutation layer and XOR operation<br />

F function<br />

Sbox<br />

Sbox<br />

Permutation<br />

Sbox<br />

0<br />

0<br />

Sbox<br />

Permutation<br />

Permutation


Divide and Merge Primitive Functions<br />

� Number of S-Boxes is halved by using 32-bit slice of F function<br />

k<br />

S1<br />

S4<br />

S3<br />

S2<br />

S4<br />

S3<br />

S2<br />

S1<br />

P function<br />

� Merge FL/FL -1 function and Key whitening<br />

-1<br />

FL function FL function<br />

>>1<br />

+<br />

1<br />

2:1<br />

2:1<br />

2:1<br />

2:1<br />


Data Path using 32-bit Slice of F function<br />

� Data are processed as 32-bit<br />

blocks in each round<br />

� Right half of <strong>the</strong> data is always<br />

processed to simplify controller<br />

kw1<br />

k1~6<br />

kl1<br />

k7~12<br />

kl3<br />

k13~18<br />

kw3<br />

Plain Text<br />

64 64<br />

Feistel<br />

Network<br />

FL FL -1<br />

Feistel<br />

Network<br />

FL FL -1<br />

Feistel<br />

Network<br />

Cipher Text<br />

kw2<br />

kl2<br />

kl4<br />

kw4<br />

k1<br />

k2<br />

k3<br />

k4<br />

k5<br />

k6<br />

64 64 64<br />

F<br />

F<br />

F<br />

F<br />

F<br />

F<br />

k1~6<br />

k7~12<br />

k13~18<br />

Plaintext<br />

64 64<br />

kw2<br />

kw1<br />

Feistel<br />

Network<br />

kl2<br />

kl1<br />

Feistel<br />

Network<br />

kl4<br />

kl3<br />

3<br />

4<br />

FL<br />

Feistel<br />

Network<br />

kw<br />

kw<br />

FL -1<br />

FL<br />

FL -1<br />

Ciphertext<br />

k1H<br />

k1L<br />

k<br />

2H<br />

k2L<br />

k<br />

3H<br />

k3L<br />

k4H<br />

k4L<br />

k<br />

5H<br />

k5L<br />

k<br />

6H<br />

k6L<br />

64 32 64<br />

F<br />

32<br />

F32<br />

F<br />

32<br />

F32<br />

F<br />

32<br />

F32<br />

F<br />

32<br />

F32<br />

F<br />

32<br />

F32<br />

F<br />

32<br />

F32


� Share F function between<br />

data randomization block<br />

and key scheduler<br />

� Round keys are generated<br />

by repeating 16-bit and<br />

1-bit rotations<br />

� Encryption and<br />

decryption take 44 clocks<br />

Compact Architecture<br />

K L K A<br />

4:1 Data<br />

128 128<br />

128<br />

128<br />

16<br />

1<br />

3:1<br />

128<br />

Key Scheduler<br />

64<br />

64<br />

�<br />

H<br />

L<br />

1 234<br />

3:1<br />

2:1<br />

Key/Data Input<br />

64<br />

64<br />

L<br />

H<br />

2:1<br />

128<br />

2:1<br />

2:1<br />

Key<br />

32 32 64 64<br />

2:1<br />

F32<br />

64<br />

FL/FL-1<br />

KeyAdd<br />

Ciphering <strong>Block</strong><br />

Data<br />

Output


� Original 64-bit F<br />

function is used<br />

� Execute F function and<br />

FL/FL -1 functions or<br />

key whitening linear<br />

functions in <strong>the</strong> same<br />

cycle to reduce <strong>the</strong><br />

number of clocks<br />

� encryption and<br />

decryption takes 18<br />

clocks<br />

High-Speed Architecture<br />

K L<br />

128<br />

16<br />

1<br />

128<br />

2:1<br />

16<br />

2:1<br />

1<br />

4:1<br />

17<br />

4:1<br />

64<br />

>>15<br />

�<br />

Data/Key Input<br />

1 234<br />

128<br />

4:1<br />

3:1<br />

2:1<br />

64<br />

64<br />

H<br />

L<br />

F<br />

FL<br />

FL -1<br />

2:1<br />

Data<br />

64<br />

L<br />

H<br />

2:1<br />

3:1<br />

128<br />

128<br />

128 128<br />

128<br />

64<br />

Data<br />

Output


Compact and High-Speed<br />

Architectures <strong>for</strong> Triple-DES


DES Algorithm<br />

� 64-bit data block with 56-bit key<br />

� 16-round Feistel network<br />

� Triple-DES takes 48 rounds<br />

� S-Box is a random substitution table<br />

� Straight<strong>for</strong>ward implementation can<br />

obtain compact hardware with high<br />

operating frequency<br />

32<br />

P<br />

4 6<br />

4 6<br />

4 6<br />

48<br />

32<br />

4<br />

4<br />

6<br />

6<br />

48 48<br />

E<br />

32<br />

4 6<br />

4 6<br />

4 6<br />

S0<br />

S1<br />

S2<br />

S3<br />

S4<br />

S5<br />

S6<br />

S7<br />

F function<br />

Plain Text<br />

32<br />

F<br />

64<br />

IP<br />

F<br />

F<br />

64<br />

32<br />

32 32<br />

64<br />

IP -1<br />

64<br />

Cipher Text<br />

48<br />

48<br />

48<br />

Key Scheduler


Compact and High-Speed Architectures <strong>for</strong> DES<br />

64<br />

� Three architectures containing 1-/2-/4-round functions<br />

execute triple-DES in 48/24/12 clocks<br />

� Multi-round/clock version obtained higher throughput due<br />

to <strong>the</strong> decrease of register access and <strong>the</strong> logic compression<br />

on <strong>the</strong> stacked round functions<br />

1round/clock<br />

Total 48cycles<br />

Data Key<br />

Round func.<br />

48<br />

Schedule<br />

56<br />

2rounds/clock<br />

Total 24cycles<br />

64 56<br />

Data Key<br />

Round func.<br />

Round func.<br />

48<br />

48<br />

Schedule<br />

4rounds/clock<br />

Total 12cycles<br />

Data Key<br />

Round func.<br />

Round func.<br />

Round func.<br />

Round func. 48<br />

Compact High-Speed<br />

64<br />

48<br />

48<br />

48<br />

56<br />

Schedule


S-Box <strong>Comparison</strong> between<br />

AES and Camellia


Compact AES S-Box Architecture<br />

� Field conversion from GF(2 8 ) to GF(((2 2 ) 2 ) 2 ) by isomorphism functions<br />

� Hierarchical architecture of <strong>the</strong> GF(((2 2 ) 2 ) 2 ) inverter is very compact<br />

� Isomorphism function and affine trans<strong>for</strong>mation are merged into a<br />

single XOR matrix<br />

GF(28 GF(2 ) 8 )<br />

GF(((22 GF(((2 ) 2 )<br />

GF(28 GF(2 ) 8 )<br />

2 2<br />

))<br />

isomorphism<br />

inversion<br />

GF(((2 )<br />

GF((22<br />

) 2)<br />

GF(2)<br />

2 inversion<br />

GF(((2 ) )<br />

GF((22<br />

) 2)<br />

GF(22)<br />

GF(2)<br />

2 )<br />

GF(22)<br />

2 ) 2 2 ) 2<br />

isomorphism<br />

merged<br />

��<br />

��<br />

affine trans.<br />

-1<br />

4<br />

4<br />

2<br />

2<br />

x2<br />

x2<br />

λ<br />

φ<br />

-1 x<br />

-1 x<br />

GF(2 2 )inverter<br />

4<br />

4<br />

2<br />

2<br />

GF(((2 2 ) 2 ) 2 )<br />

inverter<br />

GF((2 2 ) 2 )<br />

inverter


Small Camellia S-Box Architecture<br />

� Camellia uses GF((24 ) 2 ) inverter<br />

� GF(24 ) inverter is implemented as a lookup table<br />

GF((2<br />

) )<br />

4 2<br />

affine trans. F<br />

inversion<br />

GF((24) 2)<br />

GF(24)<br />

H<br />

affine trans.<br />

4<br />

4<br />

x2<br />

ω<br />

-1 x<br />

GF(2 4 ) inverter<br />

4<br />

4<br />

GF((2 4 ) 2 )<br />

inverter


S-Box <strong>Per<strong>for</strong>mance</strong> in ASIC<br />

� Syn<strong>the</strong>sized using a 0.13 um ASIC library<br />

� GF(((2 2 ) 2 ) 2 ) inverter is smaller than GF((2 4 ) 2 ) by 26%<br />

� Lookup table inverter is 2 times faster but 3 times bigger<br />

� <strong>Per<strong>for</strong>mance</strong> of S-Boxes are almost <strong>the</strong> same between AES and Camellia<br />

Inverter<br />

AES<br />

Camellia<br />

Component<br />

Inverter<br />

SubBytes<br />

InvSubBytes<br />

S1~S4<br />

Method<br />

GF(((2<br />

Table<br />

2 ) 2 ) 2 GF(((2<br />

Table<br />

)<br />

2 ) 2 ) 2 )<br />

GF((2<br />

Table<br />

4 ) 2 GF(((2<br />

Table<br />

)<br />

2 ) 2 ) 2 GF((2<br />

)<br />

4 ) 2 )<br />

Size (gates)<br />

227<br />

169<br />

540~549<br />

230<br />

562<br />

219<br />

558<br />

256<br />

540~562<br />

Delay (ns)<br />

2.28<br />

2.61<br />

1.31~1.39<br />

3.67<br />

1.30<br />

3.89<br />

1.33<br />

3.45<br />

1.31~1.40


<strong>Hardware</strong> <strong>Per<strong>for</strong>mance</strong><br />

<strong>Comparison</strong> in ASIC


Throughput (Mbps)<br />

AES/Camellia/Triple-DES in ASIC<br />

� In compact architecture, 3 algorithms show same per<strong>for</strong>mance<br />

� In high-speed architecture, 128-bit Feistel-cipher Camellia is 2<br />

times faster than 64-bit Feistel-cipher Triple-DES<br />

� SPN-cipher AES is 1.6 times faster than Camellia<br />

3500<br />

3000<br />

2500<br />

2000<br />

311Mbps/5.4Kgates<br />

1500<br />

326Mbps/6.5Kgates<br />

1000<br />

251Mbps/5.5Kgates<br />

500<br />

0<br />

AES (Ours)<br />

Camellia (Ours)<br />

Triple-DES(Ours)<br />

AES (Conventional)<br />

Camellia (Conventional)<br />

3.46Gbps/36.9Kgates<br />

2.15Gbps/29.8Kgates<br />

1.07Gbps/16.9Kgates<br />

Small & Fast<br />

0 5 10 15 20 25 30 35 40 45 50 55 60<br />

Gate Count (Kgates) (0.13um ASIC, worst case)


<strong>Hardware</strong> Efficiency (Kbps/gate<br />

<strong>Hardware</strong> Efficiency<br />

� <strong>Hardware</strong> efficiency defined by “throughput /gate” is compared<br />

� High-speed versions show higher efficiency because no<br />

additional circuits or delays <strong>for</strong> component sharing are required<br />

� High-speed versions with composite field S-Box show <strong>the</strong> best<br />

per<strong>for</strong>mance<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

(0.13um ASIC, worst case)<br />

54<br />

cycles<br />

44<br />

32<br />

21<br />

11 10<br />

10<br />

Lookup table<br />

44<br />

18<br />

Area optimized<br />

Speed optimized<br />

Lookup table<br />

18<br />

48 24<br />

Camellia Triple-DES<br />

12


Conclusion

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!