Symmetric Key Cryptography on Modern Graphics Hardware

<strong>Symmetric</strong> <strong>Key</strong> <strong>Cryptography</strong> on 

Modern Graphics Hardware 

Jason Yang 

Graphics Products Group 

Jim Goodman

Motivation 

Digital Rights Management 

Advanced Access Content System (AACS) 

- Blu-Ray / HD-DVD 

<strong>Key</strong> Searching 

2

Outline 

• Why Graphics Hardware (GPU)? 

• GPGPU Programming Model 

• Block-Based AES 

- GPU Programming Example 

• <strong>Key</strong> Searching 

- Bitsliced DES and AES 

• Conclusions 

3

Why GPUs?

Why use GPUs instead of CPUs? 

Potential for significant speedup 

for data parallel problems 

5 

– Basic: 5-10x 

– Tuned: 20-100x or even more 

Designed to handle processing 

massive amounts of data 

efficiently 

– Supports 1000’s of concurrent 

threads 

– Massive memory bandwidth 

– Memory latency hiding 

~1-2 orders of magnitude better 

in several key metrics vs. current 

CPUs 

– Memory bandwidth 

– GFLOPS per Watt 

– GFLOPS per $ 

RGPU/RCPU 

80 

70 

60 

50 

40 

30 

20 

10 

0 

AMD 6000+ 

AMD FX-62 

Intel E6850 

Intel Q6600 

Intel QX6700 

Intel QX6800 

Memory BW vs. 2950XT 

GFLOP/W vs. 2950XT 

GFLOP/$ vs. 2950XT 

Memory BW vs. 2950XT2 

GFLOP/W vs. 2950XT2 

GFLOP/$ vs. 2950XT2 

Memory BW vs. 2900XT 

GFLOP/W vs. 2900XT 

GFLOP/$ vs. 2900XT

GPU vs. CPU: Quick Comparison 

6 

# Processors 

ALU area 

Memory System 

Memory Access 

Cache 

FP Compliance 

64+ 

HD2900XT 

GPU 

~40% of die 

Max bandwidth (10x) 

Complex (tiling + 

arithmetic in memory) 

Small cache 

Partial IEEE SP** 

4 

Barcelona 

CPU 

~5% of die 

Min latency (0.1x) 

Simple LD/ST 

Large cache (10x) 

Full IEEE 754 DP/SP 

** FS670/680 have DP/SP

GPU vs. CPU: Design Points 

7 

Program Style 

Control Flow 

Access Patterns 

Program Model 

Synchronization 

Legacy Support 

SIMD 

GPU 

Few instructions, lots 

of data 

Hardware threading 

Little reuse 

Data parallel 

Very simple sync 

Not necessarily 

CPU 

Lots of instructions, 

little data 

Out of order execution 

Branch prediction 

Reuse + locality 

Task parallel 

Complex sync 

ISA Proprietary Standardized 

Backwards compatible 

Functional Deltas Large and frequent Small and infrequent

GPU Internals: ATI Radeon HD2900XT 

Hierarchical Hierarchical ZZ 

Z/Stencil Z/Stencil Cache Cache 

8 

Stream Stream Out Out 

Memory Memory Read/Write Read/Write Cache Cache 

Rasterizer 

Command Processor 

Setup 

Setup 

Unit 

Unit 

Geometry 

Interpolators Assembler 

Color Cache 

Tessellator 

Vertex 

Assembler 

Ultra-Threaded Ultra-Threaded Ultra Threaded Dispatch Dispatch Processor 

Processor 

Unified 

Unified 

Shader 

Shader 

Processors 

Processors 

Shader Export 

Render Render Back-Ends 

Back-Ends 

Vertex Index Fetch 

Texture Texture Texture Texture Units Units Units Units 

Shader Shader Caches Caches 

Instruction Instruction & & 

Constant Constant 

L1 L1 Texture Texture Cache Cache 

L2 L2 Texture Texture Cache Cache 

>100 GB/s memory bandwidth 

– 512b DDR3/4 interface 

Targeted for handling thousands 

of simultaneous lightweight 

threads 

Instruction cache and constant 

cache for unlimited program size 

Scalar ALU implementation with 

320 (64x5) independent stream 

processors 

– 256 (64x4) basic units 

(FMAC, ADD/SUB, SIN, etc.) 

– 64 enhanced transcedental units 

(adds COS, LOG, EXP, RSQ, etc.) 

– Support for INT/UINT in all units 

(ADD/SUB, AND, XOR, NOT, OR, 

etc.)

GPU Programming Model

General Purpose GPU (GPGPU) 

Computing 

Not a new idea, graphics APIs (e.g., OpenGL) have been used 

for general purpose computation for years 

10 

• VERY difficult to use due to constraints of graphics APIs and the graphics 

programming model itself 

• High overhead of graphics APIs yielded generally poor performance 

<strong>Key</strong> enabler today is that GPU developers are investing effort 

to improve usability and GPGPU performance 

• Introduction of high level C-like programming languages that access GPUs’ 

features (e.g., Brook+/CUDA) 

• Support for simpler programming model 

• Exposure of proprietary internal GPU features (e.g., ISA and IL specs) 

• Creation of toolsets for emulation, performance tuning, and debugging 

• Explicit architectural support for GPGPU (e.g., shared memory) 

• Industry standardization efforts 

GPUs are also starting to more closely resemble CPUs 

• Native integer and DP support 

• Advanced control flow features for branching/looping

GPU Hardware Simplified 

11 

Command Queue 

ALU Units Texture Fetch Units 

Memory Controller 

Local Memory

Simplified GPGPU Programming Model 

Height = y 

Virtualized SPMD Array of Threads 

12 

T0,0 

T1,0 

T0,1 T1,1 

. . . 

. . . 

T0,y-1 T1,y-1 

Width = x 

. . . 

. . . 

. . . 

. . . 

Ti, j . . . 

. . . 

. . . 

Tx-1,0 

Tx-1,1 

. . . 

ALU 

0 

SP0 . . . 

SP k 

Registers/Constants 

ALU 

1 

Memory Interface 

SPk 

. . . 

Scheduler 

ALU 

n-1 

. . . 

Shader Processing Cores 

GPU 

SPN-1 

Memory 

Interface 

Scheduler maps thread (i, j) of virtual SPMD 

array onto phyical shader processor k 

Input Input 

Arrays Arrays 

Split problem into virtual array of independent “pieces” 

processed by the virtualized SPMD array 

Memory 

Output Input 

Arrays Arrays 

Data parallel model scales across multiple GPUs for massively 

parallel problems

Block-Based AES

Prior Work 

Cook, et al., “CryptoGraphics: Secret <strong>Key</strong> <strong>Cryptography</strong> 

Using Graphics Cards”, RSA Conference, 2005 

- Uses native XOR on Memory Output (ROP) 

Harrison and Waldron, “AES Encryption Implementation 

and Analysis on Commodity Graphics Processing Units”, 

CHES, Sept. 2007 

- Cook method and programmable floating point 

units 

Takeshi Yamanouchi, “AES Encryption and Decryption on 

the GPU”, GPU Gems 3, Aug. 2007 

- Uses integer operations 

14

GPU Programming Example: 

Block-Based AES 

Review: AES Round 

1) SubBytes 

2) ShiftRows 

3) MixColumns 

4) AddRound<strong>Key</strong> 

15

Code Example: Integer Operations 

… 

int4 c0, c1, c2, c3; 

for(int i=0; i

“Conventional” Block Based AES 

17 

SubBytes + 

MixColumn 

s 

native XOR 

support 

float4 c0, r0; 

c0 = txMcol[r0.w].wzyx 

^ txMcol[r3.z].xwzy 

^ txMcol[r2.y].yxwz 

^ txMcol[r1.x].zyxw; 

r0 = c0 ^ t<strong>Key</strong>add[round_offset] 

Non-bitsliced implementation via DX10 HLSL 

Utilize T-table based implementation 

Texture fetch does T i[•] lookups 

Component swizzling reduces table count to 1 w/o performance hit 

XOR supported natively in DX10 (not true in DX9) 

Need to pre-compute key expansion 

Do on CPU in parallel with computations on GPU 

27Mops/s on HD 2900XT @ 750MHz ~ 3.5Gbs 

ShiftRows: 

component 

swizzling 

Add Round<strong>Key</strong>: 

pre-computed 

round key lookup

Floating Point Implementation 

float4 c0, r0; 

c0 = txMcol[r0.w].wzyx 

18 

^ txMcol[r3.z].xwzy 

^ txMcol[r2.y].yxwz 

^ txMcol[r1.x].zyxw; 

float4 a0,a1,b0,b1,c0,t0,t1; 

a0 = txMcol[r0.w].wzyx; 

a1 = txMcol[r3.z].xwzy; 

t0 = XOR(a0, a1); 

b0 = txMcol[r2.y].yxwz; 

b1 = txMcol[r1.x].zyxw; 

t1 = XOR(a, b); 

c0 = XOR(t0, t1); 

float4 XOR(a,b) 

{ 

} 

float4 out; 

out.x = Txor[a.x][b.x]; 

out.y = Txor[a.y][b.y]; 

out.z = Txor[a.z][b.z]; 

out.w = Txor[a.w][b.w]; 

return out; 

float4 a, b, c0, r0; 

a = txMcol[r0.w][r3.z]; 

b = txMcol[r2.y][r1.x]; 

c0 = XOR(a, b);

Floating Point Performance 

Same approach as Harrison and Waldron, CHES, Sept. 

2007, using floating point hardware to emulate integer 

operations 

Harrison achieves rates of 300 Mbs using XOR lookup 

tables, but … 

“ALU instructions are not presenting a bottleneck. This 

could be shown by the removal of all ALU instructions 

within the algorithm implementation which resulted in no 

performance difference.” 

Solution is to use both lookup tables and ALU instructions 

up to 990 Mbs 

19

AES Results Comparison 

20 

Method Paper GPU Year Mbit/s 

ROP 

Nvidia TNT2 Cook 1999 0.73 

Nvidia Geforce3 Cook 2001 1.53 

Nvidia Geforce 6600 GT Harrison 2004 361.20 


Floating Point Ops 



ATI Radeon X1950 XTX Yang 2006 840.00 

ATI Radeon HD 2900 XT Yang 2007 990.00 

Integer Ops 

Nvidia 8800 GTS Yamanouchi 2007 3,000.00 

ATI Radeon HD 2900 XT Yang 2007 3,500.00 

Bit-Sliced 

ATI Radeon HD 2900 XT Yang 2007 18,500.00

<strong>Key</strong> Searching 

Bitsliced DES and AES

DES KEYSEARCH Application 

22 

Height = 2048 

J0,0 

J0,1 

. . . 

Width = 2048 

J1,0 

J1,1 

. . . 

J0,2047 J1,2047 

. . . 

. . . 

. . . 

. . . 

Each job checks 

2 22 ∙2 6 ∙2 6 = 2 34 keys 

J2047,0 

J2047,1 

. . . 

J2047, 

2047 

T0,0 

T0,1 

. . . 

T1,0 

T1,1 

. . . 

. . . 

. . . 

KEYSEARCH application checks 2 34 ∙2 11 ∙2 11 = 2 56 keys 

. . . 

. . . 

T63,0 

T63,1 

. . . 

T0,63 T1,63 T63,63 

Width = 64 

Height = 64 

Each thread checks 2 22 keys 

Perfectly parallelizable = ideally suited to GPUs 

– Job is basic unit of computation 

– Each thread performs 2 16 iterations, each of which checks 2 6 keys

Bitsliced DES Implementation 

23 

Init 

Init Program 

6 

Sample Data 

Init State 

32 

Setup Sbox 

6 

Sbox15_odd 

69 

Setup Sbox 

6 

Round 1 

Sbox26_odd 

65 

Setup Sbox 

6 

Shader Program 

Iteration 

Round 2 

Bitsliced to compute 64 blocks per iteration 

– 32 x 4 x 32-bit registers hold cipher state 

Sbox37_odd 

63 

Setup Sbox 

6 

Sbox48_odd 

61 

Setup Sbox 

6 

Finish 

– <strong>Key</strong> schedule computed explicitly in instructions via SetupSbox 

Sbox15_even 

72 

Setup Sbox 

6 

Sbox26_even 

64 

Setup Sbox 

6 

Sbox37_even 

Setup Sbox 

Sbox48_even 

Round 3 

. . . 

282 284 131 

4691 Instructions 

63 

6 

61 

282 

Round 16 

284 

Check Result 

99 

Increment <strong>Key</strong> 

32

Bitsliced DES Performance 

24 

Rate (Mops/s) 

600 

500 

400 

300 

200 

100 

0 

1 10 100 1000 10000 100000 

Iterations/Thread (N) 

GPU (ATI 2900XT @ 750MHz) peak rate of 545Mops/s 

– 89% of GPU peak rate at 16 iterations/thread 

CPU (Athlon 64 FX-62 @ 2.8GHz) peak rate of 9.0Mops/s 

19-60x performance improvement over CPU 

Performance Ratio (RGPU/RCPU) 

70 

60 

50 

40 

30 

20 

10 

0 

1 10 100 1000 10000 100000 

Iterations/Thread (N)

Bitsliced AES Implementation 

25 

Init 

Init Program 

6 

Sample Data 

Init State 

32 

Setup 

Init <strong>Key</strong> 

32 

Add <strong>Key</strong> 

32 

Byte Sub/Shift 

126 


126 

Round 1 


126 

Shader Program 

Utilized in AES KEYSEARCH application 

Bitsliced to compute 32 blocks per iteration 


126 

Mix Columns 

153 

Update <strong>Key</strong> 

160 


32 

– <strong>Key</strong> schedule computed on the fly 

Iteration 

Round 2 

849 

. . . 

Round 9 

Finish 

96 849 696 

131 

849 

8560 Instructions 


126 

Round 10 


126 


126 


126 

Update <strong>Key</strong> 

160 


32 

CheckResult 

99 

Increment<strong>Key</strong> 

32

Composite Sbox Implementation 

26 

Composite Normal Basis Sbox 

GF(2 4 ) 

GF(2 

x[7:0] y[7:0] 

4 GF(2 

) 

1/γ 

4 ν × γ 

) 

2 

γ1 

γ1 

XFORM 

γ0 

GF(2 4 ) 

GF(2 4 ) 

GF(2 8 ) GF(2 8 GF(2 ) 

4 ) Representation 

Optimized implementation of Canright’s composite Sbox 

– 126 instructions 

– Previous best-reported bitsliced Sbox was 205 instructions 

γ0 

XFORM -1

Bitsliced AES Performance Results 

Rate (Mops/s) 

27 

200 

180 

160 

140 

120 

100 

80 

60 

40 

20 

0 

w/Round<strong>Key</strong> Update 

w/o Round<strong>Key</strong> Update 

1 10 100 1000 10000 100000 

Iterations/Thread (N) 

GPU (ATI 2900XT @ 750MHz) peak rate of 145Mops/s 

– 92% of GPU peak rate at 16 iterations/thread 

CPU (Athlon 64 3500+ @ 2.2GHz) peak rate of 13Mops/s 

– Pre-computed key schedule, algorithm-only w/no key checking 

6-16x performance improvement over CPU 

Performance Ratio (RGPU/RCPU) 

18 

16 

14 

12 

10 

8 

6 

4 

2 

0 

w/o Update vs. CPU @ 13Mops/s 

w/o Update vs. CPU @ 9Mops/s 

1 10 100 1000 10000 100000 

Iterations/Thread (N)

Conclusions

Limitations 

GPUs are not optimized for serial operations 

-Protection modes (CBC) 

GPUs are still designed for graphics 

29

Future Work 

Moss, et al., “Toward Acceleration of RSA Using 3D 

Graphics Hardware”, <strong>Cryptography</strong> and Coding, Dec. 

2007 

“ElcomSoft Files Patent for Revolutionary Technique to 

Recover Lost Passwords Quickly”, Oct. 2007 

Take advantage of future hardware features 

- Scatter 

- Double precision 

- PCI-Express 2.0 

30

Conclusion 

GPUs are ready for cryptography 

More info: 

jasonc.yang@amd.com 

streamcomputing@amd.com 

http://ati.amd.com/technology/streamcomputing/ 

http://developer.amd.com 

31

The Market – For the curious… 

32 

Desktop 3D Graphics - Units in 1,000s 

2005 2006 2007 2008 

7th Gen 3D (DX9) 25,437 18,547 7,070 2,553 

8th Gen 3D (DX9c) 32,657 22,866 9,071 1,708 

9th Gen 3D (WGF 1.0) 681 29,883 39,319 20,689 

10th Gen 3D (WGF 2.0) 0 340 28,728 56,165 

Total Desktop 58,775 71,636 84,188 81,115 

Portable Graphics - Units in 1,000s 

Portable 3D Acc. 13,550 18,210 23,688 26,768 

Total GPU w/ acceleration 72,325 89,846 107,876 107,883

Symmetric Key Cryptography on Modern Graphics Hardware

Create successful ePaper yourself

Delete template?

Save as template?