17.08.2013 Views

Symmetric Key Cryptography on Modern Graphics Hardware

Symmetric Key Cryptography on Modern Graphics Hardware

Symmetric Key Cryptography on Modern Graphics Hardware

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Symmetric</str<strong>on</strong>g> <str<strong>on</strong>g>Key</str<strong>on</strong>g> <str<strong>on</strong>g>Cryptography</str<strong>on</strong>g> <strong>on</strong><br />

<strong>Modern</strong> <strong>Graphics</strong> <strong>Hardware</strong><br />

Jas<strong>on</strong> Yang<br />

<strong>Graphics</strong> Products Group<br />

Jim Goodman


Motivati<strong>on</strong><br />

Digital Rights Management<br />

Advanced Access C<strong>on</strong>tent System (AACS)<br />

- Blu-Ray / HD-DVD<br />

<str<strong>on</strong>g>Key</str<strong>on</strong>g> Searching<br />

2


Outline<br />

• Why <strong>Graphics</strong> <strong>Hardware</strong> (GPU)?<br />

• GPGPU Programming Model<br />

• Block-Based AES<br />

- GPU Programming Example<br />

• <str<strong>on</strong>g>Key</str<strong>on</strong>g> Searching<br />

- Bitsliced DES and AES<br />

• C<strong>on</strong>clusi<strong>on</strong>s<br />

3


Why GPUs?


Why use GPUs instead of CPUs?<br />

Potential for significant speedup<br />

for data parallel problems<br />

5<br />

– Basic: 5-10x<br />

– Tuned: 20-100x or even more<br />

Designed to handle processing<br />

massive amounts of data<br />

efficiently<br />

– Supports 1000’s of c<strong>on</strong>current<br />

threads<br />

– Massive memory bandwidth<br />

– Memory latency hiding<br />

~1-2 orders of magnitude better<br />

in several key metrics vs. current<br />

CPUs<br />

– Memory bandwidth<br />

– GFLOPS per Watt<br />

– GFLOPS per $<br />

RGPU/RCPU<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

AMD 6000+<br />

AMD FX-62<br />

Intel E6850<br />

Intel Q6600<br />

Intel QX6700<br />

Intel QX6800<br />

Memory BW vs. 2950XT<br />

GFLOP/W vs. 2950XT<br />

GFLOP/$ vs. 2950XT<br />

Memory BW vs. 2950XT2<br />

GFLOP/W vs. 2950XT2<br />

GFLOP/$ vs. 2950XT2<br />

Memory BW vs. 2900XT<br />

GFLOP/W vs. 2900XT<br />

GFLOP/$ vs. 2900XT


GPU vs. CPU: Quick Comparis<strong>on</strong><br />

6<br />

# Processors<br />

ALU area<br />

Memory System<br />

Memory Access<br />

Cache<br />

FP Compliance<br />

64+<br />

HD2900XT<br />

GPU<br />

~40% of die<br />

Max bandwidth (10x)<br />

Complex (tiling +<br />

arithmetic in memory)<br />

Small cache<br />

Partial IEEE SP**<br />

4<br />

Barcel<strong>on</strong>a<br />

CPU<br />

~5% of die<br />

Min latency (0.1x)<br />

Simple LD/ST<br />

Large cache (10x)<br />

Full IEEE 754 DP/SP<br />

** FS670/680 have DP/SP


GPU vs. CPU: Design Points<br />

7<br />

Program Style<br />

C<strong>on</strong>trol Flow<br />

Access Patterns<br />

Program Model<br />

Synchr<strong>on</strong>izati<strong>on</strong><br />

Legacy Support<br />

SIMD<br />

GPU<br />

Few instructi<strong>on</strong>s, lots<br />

of data<br />

<strong>Hardware</strong> threading<br />

Little reuse<br />

Data parallel<br />

Very simple sync<br />

Not necessarily<br />

CPU<br />

Lots of instructi<strong>on</strong>s,<br />

little data<br />

Out of order executi<strong>on</strong><br />

Branch predicti<strong>on</strong><br />

Reuse + locality<br />

Task parallel<br />

Complex sync<br />

ISA Proprietary Standardized<br />

Backwards compatible<br />

Functi<strong>on</strong>al Deltas Large and frequent Small and infrequent


GPU Internals: ATI Rade<strong>on</strong> HD2900XT<br />

Hierarchical Hierarchical ZZ<br />

Z/Stencil Z/Stencil Cache Cache<br />

8<br />

Stream Stream Out Out<br />

Memory Memory Read/Write Read/Write Cache Cache<br />

Rasterizer<br />

Command Processor<br />

Setup<br />

Setup<br />

Unit<br />

Unit<br />

Geometry<br />

Interpolators Assembler<br />

Color Cache<br />

Tessellator<br />

Vertex<br />

Assembler<br />

Ultra-Threaded Ultra-Threaded Ultra Threaded Dispatch Dispatch Processor<br />

Processor<br />

Unified<br />

Unified<br />

Shader<br />

Shader<br />

Processors<br />

Processors<br />

Shader Export<br />

Render Render Back-Ends<br />

Back-Ends<br />

Vertex Index Fetch<br />

Texture Texture Texture Texture Units Units Units Units<br />

Shader Shader Caches Caches<br />

Instructi<strong>on</strong> Instructi<strong>on</strong> & &<br />

C<strong>on</strong>stant C<strong>on</strong>stant<br />

L1 L1 Texture Texture Cache Cache<br />

L2 L2 Texture Texture Cache Cache<br />

>100 GB/s memory bandwidth<br />

– 512b DDR3/4 interface<br />

Targeted for handling thousands<br />

of simultaneous lightweight<br />

threads<br />

Instructi<strong>on</strong> cache and c<strong>on</strong>stant<br />

cache for unlimited program size<br />

Scalar ALU implementati<strong>on</strong> with<br />

320 (64x5) independent stream<br />

processors<br />

– 256 (64x4) basic units<br />

(FMAC, ADD/SUB, SIN, etc.)<br />

– 64 enhanced transcedental units<br />

(adds COS, LOG, EXP, RSQ, etc.)<br />

– Support for INT/UINT in all units<br />

(ADD/SUB, AND, XOR, NOT, OR,<br />

etc.)


GPU Programming Model


General Purpose GPU (GPGPU)<br />

Computing<br />

Not a new idea, graphics APIs (e.g., OpenGL) have been used<br />

for general purpose computati<strong>on</strong> for years<br />

10<br />

• VERY difficult to use due to c<strong>on</strong>straints of graphics APIs and the graphics<br />

programming model itself<br />

• High overhead of graphics APIs yielded generally poor performance<br />

<str<strong>on</strong>g>Key</str<strong>on</strong>g> enabler today is that GPU developers are investing effort<br />

to improve usability and GPGPU performance<br />

• Introducti<strong>on</strong> of high level C-like programming languages that access GPUs’<br />

features (e.g., Brook+/CUDA)<br />

• Support for simpler programming model<br />

• Exposure of proprietary internal GPU features (e.g., ISA and IL specs)<br />

• Creati<strong>on</strong> of toolsets for emulati<strong>on</strong>, performance tuning, and debugging<br />

• Explicit architectural support for GPGPU (e.g., shared memory)<br />

• Industry standardizati<strong>on</strong> efforts<br />

GPUs are also starting to more closely resemble CPUs<br />

• Native integer and DP support<br />

• Advanced c<strong>on</strong>trol flow features for branching/looping


GPU <strong>Hardware</strong> Simplified<br />

11<br />

Command Queue<br />

ALU Units Texture Fetch Units<br />

Memory C<strong>on</strong>troller<br />

Local Memory


Simplified GPGPU Programming Model<br />

Height = y<br />

Virtualized SPMD Array of Threads<br />

12<br />

T0,0<br />

T1,0<br />

T0,1 T1,1<br />

. . .<br />

. . .<br />

T0,y-1 T1,y-1<br />

Width = x<br />

. . .<br />

. . .<br />

. . .<br />

. . .<br />

Ti, j . . .<br />

. . .<br />

. . .<br />

Tx-1,0<br />

Tx-1,1<br />

. . .<br />

ALU<br />

0<br />

SP0 . . .<br />

SP k<br />

Registers/C<strong>on</strong>stants<br />

ALU<br />

1<br />

Memory Interface<br />

SPk<br />

. . .<br />

Scheduler<br />

ALU<br />

n-1<br />

. . .<br />

Shader Processing Cores<br />

GPU<br />

SPN-1<br />

Memory<br />

Interface<br />

Scheduler maps thread (i, j) of virtual SPMD<br />

array <strong>on</strong>to phyical shader processor k<br />

Input Input<br />

Arrays Arrays<br />

Split problem into virtual array of independent “pieces”<br />

processed by the virtualized SPMD array<br />

Memory<br />

Output Input<br />

Arrays Arrays<br />

Data parallel model scales across multiple GPUs for massively<br />

parallel problems


Block-Based AES


Prior Work<br />

Cook, et al., “Crypto<strong>Graphics</strong>: Secret <str<strong>on</strong>g>Key</str<strong>on</strong>g> <str<strong>on</strong>g>Cryptography</str<strong>on</strong>g><br />

Using <strong>Graphics</strong> Cards”, RSA C<strong>on</strong>ference, 2005<br />

- Uses native XOR <strong>on</strong> Memory Output (ROP)<br />

Harris<strong>on</strong> and Waldr<strong>on</strong>, “AES Encrypti<strong>on</strong> Implementati<strong>on</strong><br />

and Analysis <strong>on</strong> Commodity <strong>Graphics</strong> Processing Units”,<br />

CHES, Sept. 2007<br />

- Cook method and programmable floating point<br />

units<br />

Takeshi Yamanouchi, “AES Encrypti<strong>on</strong> and Decrypti<strong>on</strong> <strong>on</strong><br />

the GPU”, GPU Gems 3, Aug. 2007<br />

- Uses integer operati<strong>on</strong>s<br />

14


GPU Programming Example:<br />

Block-Based AES<br />

Review: AES Round<br />

1) SubBytes<br />

2) ShiftRows<br />

3) MixColumns<br />

4) AddRound<str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

15


Code Example: Integer Operati<strong>on</strong>s<br />

…<br />

int4 c0, c1, c2, c3;<br />

for(int i=0; i


“C<strong>on</strong>venti<strong>on</strong>al” Block Based AES<br />

17<br />

SubBytes +<br />

MixColumn<br />

s<br />

native XOR<br />

support<br />

float4 c0, r0;<br />

c0 = txMcol[r0.w].wzyx<br />

^ txMcol[r3.z].xwzy<br />

^ txMcol[r2.y].yxwz<br />

^ txMcol[r1.x].zyxw;<br />

r0 = c0 ^ t<str<strong>on</strong>g>Key</str<strong>on</strong>g>add[round_offset]<br />

N<strong>on</strong>-bitsliced implementati<strong>on</strong> via DX10 HLSL<br />

Utilize T-table based implementati<strong>on</strong><br />

Texture fetch does T i[•] lookups<br />

Comp<strong>on</strong>ent swizzling reduces table count to 1 w/o performance hit<br />

XOR supported natively in DX10 (not true in DX9)<br />

Need to pre-compute key expansi<strong>on</strong><br />

Do <strong>on</strong> CPU in parallel with computati<strong>on</strong>s <strong>on</strong> GPU<br />

27Mops/s <strong>on</strong> HD 2900XT @ 750MHz ~ 3.5Gbs<br />

ShiftRows:<br />

comp<strong>on</strong>ent<br />

swizzling<br />

Add Round<str<strong>on</strong>g>Key</str<strong>on</strong>g>:<br />

pre-computed<br />

round key lookup


Floating Point Implementati<strong>on</strong><br />

float4 c0, r0;<br />

c0 = txMcol[r0.w].wzyx<br />

18<br />

^ txMcol[r3.z].xwzy<br />

^ txMcol[r2.y].yxwz<br />

^ txMcol[r1.x].zyxw;<br />

float4 a0,a1,b0,b1,c0,t0,t1;<br />

a0 = txMcol[r0.w].wzyx;<br />

a1 = txMcol[r3.z].xwzy;<br />

t0 = XOR(a0, a1);<br />

b0 = txMcol[r2.y].yxwz;<br />

b1 = txMcol[r1.x].zyxw;<br />

t1 = XOR(a, b);<br />

c0 = XOR(t0, t1);<br />

float4 XOR(a,b)<br />

{<br />

}<br />

float4 out;<br />

out.x = Txor[a.x][b.x];<br />

out.y = Txor[a.y][b.y];<br />

out.z = Txor[a.z][b.z];<br />

out.w = Txor[a.w][b.w];<br />

return out;<br />

float4 a, b, c0, r0;<br />

a = txMcol[r0.w][r3.z];<br />

b = txMcol[r2.y][r1.x];<br />

c0 = XOR(a, b);


Floating Point Performance<br />

Same approach as Harris<strong>on</strong> and Waldr<strong>on</strong>, CHES, Sept.<br />

2007, using floating point hardware to emulate integer<br />

operati<strong>on</strong>s<br />

Harris<strong>on</strong> achieves rates of 300 Mbs using XOR lookup<br />

tables, but …<br />

“ALU instructi<strong>on</strong>s are not presenting a bottleneck. This<br />

could be shown by the removal of all ALU instructi<strong>on</strong>s<br />

within the algorithm implementati<strong>on</strong> which resulted in no<br />

performance difference.”<br />

Soluti<strong>on</strong> is to use both lookup tables and ALU instructi<strong>on</strong>s<br />

up to 990 Mbs<br />

19


AES Results Comparis<strong>on</strong><br />

20<br />

Method Paper GPU Year Mbit/s<br />

ROP<br />

Nvidia TNT2 Cook 1999 0.73<br />

Nvidia Geforce3 Cook 2001 1.53<br />

Nvidia Geforce 6600 GT Harris<strong>on</strong> 2004 361.20<br />

Nvidia Geforce 7900 GT Harris<strong>on</strong> 2006 870.88<br />

Floating Point Ops<br />

Nvidia Geforce 6600 GT Harris<strong>on</strong> 2004 80.29<br />

Nvidia Geforce 7900 GT Harris<strong>on</strong> 2006 313.84<br />

ATI Rade<strong>on</strong> X1950 XTX Yang 2006 840.00<br />

ATI Rade<strong>on</strong> HD 2900 XT Yang 2007 990.00<br />

Integer Ops<br />

Nvidia 8800 GTS Yamanouchi 2007 3,000.00<br />

ATI Rade<strong>on</strong> HD 2900 XT Yang 2007 3,500.00<br />

Bit-Sliced<br />

ATI Rade<strong>on</strong> HD 2900 XT Yang 2007 18,500.00


<str<strong>on</strong>g>Key</str<strong>on</strong>g> Searching<br />

Bitsliced DES and AES


DES KEYSEARCH Applicati<strong>on</strong><br />

22<br />

Height = 2048<br />

J0,0<br />

J0,1<br />

. . .<br />

Width = 2048<br />

J1,0<br />

J1,1<br />

. . .<br />

J0,2047 J1,2047<br />

. . .<br />

. . .<br />

. . .<br />

. . .<br />

Each job checks<br />

2 22 ∙2 6 ∙2 6 = 2 34 keys<br />

J2047,0<br />

J2047,1<br />

. . .<br />

J2047,<br />

2047<br />

T0,0<br />

T0,1<br />

. . .<br />

T1,0<br />

T1,1<br />

. . .<br />

. . .<br />

. . .<br />

KEYSEARCH applicati<strong>on</strong> checks 2 34 ∙2 11 ∙2 11 = 2 56 keys<br />

. . .<br />

. . .<br />

T63,0<br />

T63,1<br />

. . .<br />

T0,63 T1,63 T63,63<br />

Width = 64<br />

Height = 64<br />

Each thread checks 2 22 keys<br />

Perfectly parallelizable = ideally suited to GPUs<br />

– Job is basic unit of computati<strong>on</strong><br />

– Each thread performs 2 16 iterati<strong>on</strong>s, each of which checks 2 6 keys


Bitsliced DES Implementati<strong>on</strong><br />

23<br />

Init<br />

Init Program<br />

6<br />

Sample Data<br />

Init State<br />

32<br />

Setup Sbox<br />

6<br />

Sbox15_odd<br />

69<br />

Setup Sbox<br />

6<br />

Round 1<br />

Sbox26_odd<br />

65<br />

Setup Sbox<br />

6<br />

Shader Program<br />

Iterati<strong>on</strong><br />

Round 2<br />

Bitsliced to compute 64 blocks per iterati<strong>on</strong><br />

– 32 x 4 x 32-bit registers hold cipher state<br />

Sbox37_odd<br />

63<br />

Setup Sbox<br />

6<br />

Sbox48_odd<br />

61<br />

Setup Sbox<br />

6<br />

Finish<br />

– <str<strong>on</strong>g>Key</str<strong>on</strong>g> schedule computed explicitly in instructi<strong>on</strong>s via SetupSbox<br />

Sbox15_even<br />

72<br />

Setup Sbox<br />

6<br />

Sbox26_even<br />

64<br />

Setup Sbox<br />

6<br />

Sbox37_even<br />

Setup Sbox<br />

Sbox48_even<br />

Round 3<br />

. . .<br />

282 284 131<br />

4691 Instructi<strong>on</strong>s<br />

63<br />

6<br />

61<br />

282<br />

Round 16<br />

284<br />

Check Result<br />

99<br />

Increment <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

32


Bitsliced DES Performance<br />

24<br />

Rate (Mops/s)<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

1 10 100 1000 10000 100000<br />

Iterati<strong>on</strong>s/Thread (N)<br />

GPU (ATI 2900XT @ 750MHz) peak rate of 545Mops/s<br />

– 89% of GPU peak rate at 16 iterati<strong>on</strong>s/thread<br />

CPU (Athl<strong>on</strong> 64 FX-62 @ 2.8GHz) peak rate of 9.0Mops/s<br />

19-60x performance improvement over CPU<br />

Performance Ratio (RGPU/RCPU)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

1 10 100 1000 10000 100000<br />

Iterati<strong>on</strong>s/Thread (N)


Bitsliced AES Implementati<strong>on</strong><br />

25<br />

Init<br />

Init Program<br />

6<br />

Sample Data<br />

Init State<br />

32<br />

Setup<br />

Init <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

32<br />

Add <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

32<br />

Byte Sub/Shift<br />

126<br />

Byte Sub/Shift<br />

126<br />

Round 1<br />

Byte Sub/Shift<br />

126<br />

Shader Program<br />

Utilized in AES KEYSEARCH applicati<strong>on</strong><br />

Bitsliced to compute 32 blocks per iterati<strong>on</strong><br />

Byte Sub/Shift<br />

126<br />

Mix Columns<br />

153<br />

Update <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

160<br />

Add <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

32<br />

– <str<strong>on</strong>g>Key</str<strong>on</strong>g> schedule computed <strong>on</strong> the fly<br />

Iterati<strong>on</strong><br />

Round 2<br />

849<br />

. . .<br />

Round 9<br />

Finish<br />

96 849 696<br />

131<br />

849<br />

8560 Instructi<strong>on</strong>s<br />

Byte Sub/Shift<br />

126<br />

Round 10<br />

Byte Sub/Shift<br />

126<br />

Byte Sub/Shift<br />

126<br />

Byte Sub/Shift<br />

126<br />

Update <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

160<br />

Add <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

32<br />

CheckResult<br />

99<br />

Increment<str<strong>on</strong>g>Key</str<strong>on</strong>g><br />

32


Composite Sbox Implementati<strong>on</strong><br />

26<br />

Composite Normal Basis Sbox<br />

GF(2 4 )<br />

GF(2<br />

x[7:0] y[7:0]<br />

4 GF(2<br />

)<br />

1/γ<br />

4 ν × γ<br />

)<br />

2<br />

γ1<br />

γ1<br />

XFORM<br />

γ0<br />

GF(2 4 )<br />

GF(2 4 )<br />

GF(2 8 ) GF(2 8 GF(2 )<br />

4 ) Representati<strong>on</strong><br />

Optimized implementati<strong>on</strong> of Canright’s composite Sbox<br />

– 126 instructi<strong>on</strong>s<br />

– Previous best-reported bitsliced Sbox was 205 instructi<strong>on</strong>s<br />

γ0<br />

XFORM -1


Bitsliced AES Performance Results<br />

Rate (Mops/s)<br />

27<br />

200<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

w/Round<str<strong>on</strong>g>Key</str<strong>on</strong>g> Update<br />

w/o Round<str<strong>on</strong>g>Key</str<strong>on</strong>g> Update<br />

1 10 100 1000 10000 100000<br />

Iterati<strong>on</strong>s/Thread (N)<br />

GPU (ATI 2900XT @ 750MHz) peak rate of 145Mops/s<br />

– 92% of GPU peak rate at 16 iterati<strong>on</strong>s/thread<br />

CPU (Athl<strong>on</strong> 64 3500+ @ 2.2GHz) peak rate of 13Mops/s<br />

– Pre-computed key schedule, algorithm-<strong>on</strong>ly w/no key checking<br />

6-16x performance improvement over CPU<br />

Performance Ratio (RGPU/RCPU)<br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

w/o Update vs. CPU @ 13Mops/s<br />

w/o Update vs. CPU @ 9Mops/s<br />

1 10 100 1000 10000 100000<br />

Iterati<strong>on</strong>s/Thread (N)


C<strong>on</strong>clusi<strong>on</strong>s


Limitati<strong>on</strong>s<br />

GPUs are not optimized for serial operati<strong>on</strong>s<br />

-Protecti<strong>on</strong> modes (CBC)<br />

GPUs are still designed for graphics<br />

29


Future Work<br />

Moss, et al., “Toward Accelerati<strong>on</strong> of RSA Using 3D<br />

<strong>Graphics</strong> <strong>Hardware</strong>”, <str<strong>on</strong>g>Cryptography</str<strong>on</strong>g> and Coding, Dec.<br />

2007<br />

“ElcomSoft Files Patent for Revoluti<strong>on</strong>ary Technique to<br />

Recover Lost Passwords Quickly”, Oct. 2007<br />

Take advantage of future hardware features<br />

- Scatter<br />

- Double precisi<strong>on</strong><br />

- PCI-Express 2.0<br />

30


C<strong>on</strong>clusi<strong>on</strong><br />

GPUs are ready for cryptography<br />

More info:<br />

jas<strong>on</strong>c.yang@amd.com<br />

streamcomputing@amd.com<br />

http://ati.amd.com/technology/streamcomputing/<br />

http://developer.amd.com<br />

31


The Market – For the curious…<br />

32<br />

Desktop 3D <strong>Graphics</strong> - Units in 1,000s<br />

2005 2006 2007 2008<br />

7th Gen 3D (DX9) 25,437 18,547 7,070 2,553<br />

8th Gen 3D (DX9c) 32,657 22,866 9,071 1,708<br />

9th Gen 3D (WGF 1.0) 681 29,883 39,319 20,689<br />

10th Gen 3D (WGF 2.0) 0 340 28,728 56,165<br />

Total Desktop 58,775 71,636 84,188 81,115<br />

Portable <strong>Graphics</strong> - Units in 1,000s<br />

Portable 3D Acc. 13,550 18,210 23,688 26,768<br />

Total GPU w/ accelerati<strong>on</strong> 72,325 89,846 107,876 107,883

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!