Symmetric Key Cryptography on Modern Graphics Hardware
Symmetric Key Cryptography on Modern Graphics Hardware
Symmetric Key Cryptography on Modern Graphics Hardware
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<str<strong>on</strong>g>Symmetric</str<strong>on</strong>g> <str<strong>on</strong>g>Key</str<strong>on</strong>g> <str<strong>on</strong>g>Cryptography</str<strong>on</strong>g> <strong>on</strong><br />
<strong>Modern</strong> <strong>Graphics</strong> <strong>Hardware</strong><br />
Jas<strong>on</strong> Yang<br />
<strong>Graphics</strong> Products Group<br />
Jim Goodman
Motivati<strong>on</strong><br />
Digital Rights Management<br />
Advanced Access C<strong>on</strong>tent System (AACS)<br />
- Blu-Ray / HD-DVD<br />
<str<strong>on</strong>g>Key</str<strong>on</strong>g> Searching<br />
2
Outline<br />
• Why <strong>Graphics</strong> <strong>Hardware</strong> (GPU)?<br />
• GPGPU Programming Model<br />
• Block-Based AES<br />
- GPU Programming Example<br />
• <str<strong>on</strong>g>Key</str<strong>on</strong>g> Searching<br />
- Bitsliced DES and AES<br />
• C<strong>on</strong>clusi<strong>on</strong>s<br />
3
Why GPUs?
Why use GPUs instead of CPUs?<br />
Potential for significant speedup<br />
for data parallel problems<br />
5<br />
– Basic: 5-10x<br />
– Tuned: 20-100x or even more<br />
Designed to handle processing<br />
massive amounts of data<br />
efficiently<br />
– Supports 1000’s of c<strong>on</strong>current<br />
threads<br />
– Massive memory bandwidth<br />
– Memory latency hiding<br />
~1-2 orders of magnitude better<br />
in several key metrics vs. current<br />
CPUs<br />
– Memory bandwidth<br />
– GFLOPS per Watt<br />
– GFLOPS per $<br />
RGPU/RCPU<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
AMD 6000+<br />
AMD FX-62<br />
Intel E6850<br />
Intel Q6600<br />
Intel QX6700<br />
Intel QX6800<br />
Memory BW vs. 2950XT<br />
GFLOP/W vs. 2950XT<br />
GFLOP/$ vs. 2950XT<br />
Memory BW vs. 2950XT2<br />
GFLOP/W vs. 2950XT2<br />
GFLOP/$ vs. 2950XT2<br />
Memory BW vs. 2900XT<br />
GFLOP/W vs. 2900XT<br />
GFLOP/$ vs. 2900XT
GPU vs. CPU: Quick Comparis<strong>on</strong><br />
6<br />
# Processors<br />
ALU area<br />
Memory System<br />
Memory Access<br />
Cache<br />
FP Compliance<br />
64+<br />
HD2900XT<br />
GPU<br />
~40% of die<br />
Max bandwidth (10x)<br />
Complex (tiling +<br />
arithmetic in memory)<br />
Small cache<br />
Partial IEEE SP**<br />
4<br />
Barcel<strong>on</strong>a<br />
CPU<br />
~5% of die<br />
Min latency (0.1x)<br />
Simple LD/ST<br />
Large cache (10x)<br />
Full IEEE 754 DP/SP<br />
** FS670/680 have DP/SP
GPU vs. CPU: Design Points<br />
7<br />
Program Style<br />
C<strong>on</strong>trol Flow<br />
Access Patterns<br />
Program Model<br />
Synchr<strong>on</strong>izati<strong>on</strong><br />
Legacy Support<br />
SIMD<br />
GPU<br />
Few instructi<strong>on</strong>s, lots<br />
of data<br />
<strong>Hardware</strong> threading<br />
Little reuse<br />
Data parallel<br />
Very simple sync<br />
Not necessarily<br />
CPU<br />
Lots of instructi<strong>on</strong>s,<br />
little data<br />
Out of order executi<strong>on</strong><br />
Branch predicti<strong>on</strong><br />
Reuse + locality<br />
Task parallel<br />
Complex sync<br />
ISA Proprietary Standardized<br />
Backwards compatible<br />
Functi<strong>on</strong>al Deltas Large and frequent Small and infrequent
GPU Internals: ATI Rade<strong>on</strong> HD2900XT<br />
Hierarchical Hierarchical ZZ<br />
Z/Stencil Z/Stencil Cache Cache<br />
8<br />
Stream Stream Out Out<br />
Memory Memory Read/Write Read/Write Cache Cache<br />
Rasterizer<br />
Command Processor<br />
Setup<br />
Setup<br />
Unit<br />
Unit<br />
Geometry<br />
Interpolators Assembler<br />
Color Cache<br />
Tessellator<br />
Vertex<br />
Assembler<br />
Ultra-Threaded Ultra-Threaded Ultra Threaded Dispatch Dispatch Processor<br />
Processor<br />
Unified<br />
Unified<br />
Shader<br />
Shader<br />
Processors<br />
Processors<br />
Shader Export<br />
Render Render Back-Ends<br />
Back-Ends<br />
Vertex Index Fetch<br />
Texture Texture Texture Texture Units Units Units Units<br />
Shader Shader Caches Caches<br />
Instructi<strong>on</strong> Instructi<strong>on</strong> & &<br />
C<strong>on</strong>stant C<strong>on</strong>stant<br />
L1 L1 Texture Texture Cache Cache<br />
L2 L2 Texture Texture Cache Cache<br />
>100 GB/s memory bandwidth<br />
– 512b DDR3/4 interface<br />
Targeted for handling thousands<br />
of simultaneous lightweight<br />
threads<br />
Instructi<strong>on</strong> cache and c<strong>on</strong>stant<br />
cache for unlimited program size<br />
Scalar ALU implementati<strong>on</strong> with<br />
320 (64x5) independent stream<br />
processors<br />
– 256 (64x4) basic units<br />
(FMAC, ADD/SUB, SIN, etc.)<br />
– 64 enhanced transcedental units<br />
(adds COS, LOG, EXP, RSQ, etc.)<br />
– Support for INT/UINT in all units<br />
(ADD/SUB, AND, XOR, NOT, OR,<br />
etc.)
GPU Programming Model
General Purpose GPU (GPGPU)<br />
Computing<br />
Not a new idea, graphics APIs (e.g., OpenGL) have been used<br />
for general purpose computati<strong>on</strong> for years<br />
10<br />
• VERY difficult to use due to c<strong>on</strong>straints of graphics APIs and the graphics<br />
programming model itself<br />
• High overhead of graphics APIs yielded generally poor performance<br />
<str<strong>on</strong>g>Key</str<strong>on</strong>g> enabler today is that GPU developers are investing effort<br />
to improve usability and GPGPU performance<br />
• Introducti<strong>on</strong> of high level C-like programming languages that access GPUs’<br />
features (e.g., Brook+/CUDA)<br />
• Support for simpler programming model<br />
• Exposure of proprietary internal GPU features (e.g., ISA and IL specs)<br />
• Creati<strong>on</strong> of toolsets for emulati<strong>on</strong>, performance tuning, and debugging<br />
• Explicit architectural support for GPGPU (e.g., shared memory)<br />
• Industry standardizati<strong>on</strong> efforts<br />
GPUs are also starting to more closely resemble CPUs<br />
• Native integer and DP support<br />
• Advanced c<strong>on</strong>trol flow features for branching/looping
GPU <strong>Hardware</strong> Simplified<br />
11<br />
Command Queue<br />
ALU Units Texture Fetch Units<br />
Memory C<strong>on</strong>troller<br />
Local Memory
Simplified GPGPU Programming Model<br />
Height = y<br />
Virtualized SPMD Array of Threads<br />
12<br />
T0,0<br />
T1,0<br />
T0,1 T1,1<br />
. . .<br />
. . .<br />
T0,y-1 T1,y-1<br />
Width = x<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
Ti, j . . .<br />
. . .<br />
. . .<br />
Tx-1,0<br />
Tx-1,1<br />
. . .<br />
ALU<br />
0<br />
SP0 . . .<br />
SP k<br />
Registers/C<strong>on</strong>stants<br />
ALU<br />
1<br />
Memory Interface<br />
SPk<br />
. . .<br />
Scheduler<br />
ALU<br />
n-1<br />
. . .<br />
Shader Processing Cores<br />
GPU<br />
SPN-1<br />
Memory<br />
Interface<br />
Scheduler maps thread (i, j) of virtual SPMD<br />
array <strong>on</strong>to phyical shader processor k<br />
Input Input<br />
Arrays Arrays<br />
Split problem into virtual array of independent “pieces”<br />
processed by the virtualized SPMD array<br />
Memory<br />
Output Input<br />
Arrays Arrays<br />
Data parallel model scales across multiple GPUs for massively<br />
parallel problems
Block-Based AES
Prior Work<br />
Cook, et al., “Crypto<strong>Graphics</strong>: Secret <str<strong>on</strong>g>Key</str<strong>on</strong>g> <str<strong>on</strong>g>Cryptography</str<strong>on</strong>g><br />
Using <strong>Graphics</strong> Cards”, RSA C<strong>on</strong>ference, 2005<br />
- Uses native XOR <strong>on</strong> Memory Output (ROP)<br />
Harris<strong>on</strong> and Waldr<strong>on</strong>, “AES Encrypti<strong>on</strong> Implementati<strong>on</strong><br />
and Analysis <strong>on</strong> Commodity <strong>Graphics</strong> Processing Units”,<br />
CHES, Sept. 2007<br />
- Cook method and programmable floating point<br />
units<br />
Takeshi Yamanouchi, “AES Encrypti<strong>on</strong> and Decrypti<strong>on</strong> <strong>on</strong><br />
the GPU”, GPU Gems 3, Aug. 2007<br />
- Uses integer operati<strong>on</strong>s<br />
14
GPU Programming Example:<br />
Block-Based AES<br />
Review: AES Round<br />
1) SubBytes<br />
2) ShiftRows<br />
3) MixColumns<br />
4) AddRound<str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
15
Code Example: Integer Operati<strong>on</strong>s<br />
…<br />
int4 c0, c1, c2, c3;<br />
for(int i=0; i
“C<strong>on</strong>venti<strong>on</strong>al” Block Based AES<br />
17<br />
SubBytes +<br />
MixColumn<br />
s<br />
native XOR<br />
support<br />
float4 c0, r0;<br />
c0 = txMcol[r0.w].wzyx<br />
^ txMcol[r3.z].xwzy<br />
^ txMcol[r2.y].yxwz<br />
^ txMcol[r1.x].zyxw;<br />
r0 = c0 ^ t<str<strong>on</strong>g>Key</str<strong>on</strong>g>add[round_offset]<br />
N<strong>on</strong>-bitsliced implementati<strong>on</strong> via DX10 HLSL<br />
Utilize T-table based implementati<strong>on</strong><br />
Texture fetch does T i[•] lookups<br />
Comp<strong>on</strong>ent swizzling reduces table count to 1 w/o performance hit<br />
XOR supported natively in DX10 (not true in DX9)<br />
Need to pre-compute key expansi<strong>on</strong><br />
Do <strong>on</strong> CPU in parallel with computati<strong>on</strong>s <strong>on</strong> GPU<br />
27Mops/s <strong>on</strong> HD 2900XT @ 750MHz ~ 3.5Gbs<br />
ShiftRows:<br />
comp<strong>on</strong>ent<br />
swizzling<br />
Add Round<str<strong>on</strong>g>Key</str<strong>on</strong>g>:<br />
pre-computed<br />
round key lookup
Floating Point Implementati<strong>on</strong><br />
float4 c0, r0;<br />
c0 = txMcol[r0.w].wzyx<br />
18<br />
^ txMcol[r3.z].xwzy<br />
^ txMcol[r2.y].yxwz<br />
^ txMcol[r1.x].zyxw;<br />
float4 a0,a1,b0,b1,c0,t0,t1;<br />
a0 = txMcol[r0.w].wzyx;<br />
a1 = txMcol[r3.z].xwzy;<br />
t0 = XOR(a0, a1);<br />
b0 = txMcol[r2.y].yxwz;<br />
b1 = txMcol[r1.x].zyxw;<br />
t1 = XOR(a, b);<br />
c0 = XOR(t0, t1);<br />
float4 XOR(a,b)<br />
{<br />
}<br />
float4 out;<br />
out.x = Txor[a.x][b.x];<br />
out.y = Txor[a.y][b.y];<br />
out.z = Txor[a.z][b.z];<br />
out.w = Txor[a.w][b.w];<br />
return out;<br />
float4 a, b, c0, r0;<br />
a = txMcol[r0.w][r3.z];<br />
b = txMcol[r2.y][r1.x];<br />
c0 = XOR(a, b);
Floating Point Performance<br />
Same approach as Harris<strong>on</strong> and Waldr<strong>on</strong>, CHES, Sept.<br />
2007, using floating point hardware to emulate integer<br />
operati<strong>on</strong>s<br />
Harris<strong>on</strong> achieves rates of 300 Mbs using XOR lookup<br />
tables, but …<br />
“ALU instructi<strong>on</strong>s are not presenting a bottleneck. This<br />
could be shown by the removal of all ALU instructi<strong>on</strong>s<br />
within the algorithm implementati<strong>on</strong> which resulted in no<br />
performance difference.”<br />
Soluti<strong>on</strong> is to use both lookup tables and ALU instructi<strong>on</strong>s<br />
up to 990 Mbs<br />
19
AES Results Comparis<strong>on</strong><br />
20<br />
Method Paper GPU Year Mbit/s<br />
ROP<br />
Nvidia TNT2 Cook 1999 0.73<br />
Nvidia Geforce3 Cook 2001 1.53<br />
Nvidia Geforce 6600 GT Harris<strong>on</strong> 2004 361.20<br />
Nvidia Geforce 7900 GT Harris<strong>on</strong> 2006 870.88<br />
Floating Point Ops<br />
Nvidia Geforce 6600 GT Harris<strong>on</strong> 2004 80.29<br />
Nvidia Geforce 7900 GT Harris<strong>on</strong> 2006 313.84<br />
ATI Rade<strong>on</strong> X1950 XTX Yang 2006 840.00<br />
ATI Rade<strong>on</strong> HD 2900 XT Yang 2007 990.00<br />
Integer Ops<br />
Nvidia 8800 GTS Yamanouchi 2007 3,000.00<br />
ATI Rade<strong>on</strong> HD 2900 XT Yang 2007 3,500.00<br />
Bit-Sliced<br />
ATI Rade<strong>on</strong> HD 2900 XT Yang 2007 18,500.00
<str<strong>on</strong>g>Key</str<strong>on</strong>g> Searching<br />
Bitsliced DES and AES
DES KEYSEARCH Applicati<strong>on</strong><br />
22<br />
Height = 2048<br />
J0,0<br />
J0,1<br />
. . .<br />
Width = 2048<br />
J1,0<br />
J1,1<br />
. . .<br />
J0,2047 J1,2047<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
Each job checks<br />
2 22 ∙2 6 ∙2 6 = 2 34 keys<br />
J2047,0<br />
J2047,1<br />
. . .<br />
J2047,<br />
2047<br />
T0,0<br />
T0,1<br />
. . .<br />
T1,0<br />
T1,1<br />
. . .<br />
. . .<br />
. . .<br />
KEYSEARCH applicati<strong>on</strong> checks 2 34 ∙2 11 ∙2 11 = 2 56 keys<br />
. . .<br />
. . .<br />
T63,0<br />
T63,1<br />
. . .<br />
T0,63 T1,63 T63,63<br />
Width = 64<br />
Height = 64<br />
Each thread checks 2 22 keys<br />
Perfectly parallelizable = ideally suited to GPUs<br />
– Job is basic unit of computati<strong>on</strong><br />
– Each thread performs 2 16 iterati<strong>on</strong>s, each of which checks 2 6 keys
Bitsliced DES Implementati<strong>on</strong><br />
23<br />
Init<br />
Init Program<br />
6<br />
Sample Data<br />
Init State<br />
32<br />
Setup Sbox<br />
6<br />
Sbox15_odd<br />
69<br />
Setup Sbox<br />
6<br />
Round 1<br />
Sbox26_odd<br />
65<br />
Setup Sbox<br />
6<br />
Shader Program<br />
Iterati<strong>on</strong><br />
Round 2<br />
Bitsliced to compute 64 blocks per iterati<strong>on</strong><br />
– 32 x 4 x 32-bit registers hold cipher state<br />
Sbox37_odd<br />
63<br />
Setup Sbox<br />
6<br />
Sbox48_odd<br />
61<br />
Setup Sbox<br />
6<br />
Finish<br />
– <str<strong>on</strong>g>Key</str<strong>on</strong>g> schedule computed explicitly in instructi<strong>on</strong>s via SetupSbox<br />
Sbox15_even<br />
72<br />
Setup Sbox<br />
6<br />
Sbox26_even<br />
64<br />
Setup Sbox<br />
6<br />
Sbox37_even<br />
Setup Sbox<br />
Sbox48_even<br />
Round 3<br />
. . .<br />
282 284 131<br />
4691 Instructi<strong>on</strong>s<br />
63<br />
6<br />
61<br />
282<br />
Round 16<br />
284<br />
Check Result<br />
99<br />
Increment <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
32
Bitsliced DES Performance<br />
24<br />
Rate (Mops/s)<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
1 10 100 1000 10000 100000<br />
Iterati<strong>on</strong>s/Thread (N)<br />
GPU (ATI 2900XT @ 750MHz) peak rate of 545Mops/s<br />
– 89% of GPU peak rate at 16 iterati<strong>on</strong>s/thread<br />
CPU (Athl<strong>on</strong> 64 FX-62 @ 2.8GHz) peak rate of 9.0Mops/s<br />
19-60x performance improvement over CPU<br />
Performance Ratio (RGPU/RCPU)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
1 10 100 1000 10000 100000<br />
Iterati<strong>on</strong>s/Thread (N)
Bitsliced AES Implementati<strong>on</strong><br />
25<br />
Init<br />
Init Program<br />
6<br />
Sample Data<br />
Init State<br />
32<br />
Setup<br />
Init <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
32<br />
Add <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
32<br />
Byte Sub/Shift<br />
126<br />
Byte Sub/Shift<br />
126<br />
Round 1<br />
Byte Sub/Shift<br />
126<br />
Shader Program<br />
Utilized in AES KEYSEARCH applicati<strong>on</strong><br />
Bitsliced to compute 32 blocks per iterati<strong>on</strong><br />
Byte Sub/Shift<br />
126<br />
Mix Columns<br />
153<br />
Update <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
160<br />
Add <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
32<br />
– <str<strong>on</strong>g>Key</str<strong>on</strong>g> schedule computed <strong>on</strong> the fly<br />
Iterati<strong>on</strong><br />
Round 2<br />
849<br />
. . .<br />
Round 9<br />
Finish<br />
96 849 696<br />
131<br />
849<br />
8560 Instructi<strong>on</strong>s<br />
Byte Sub/Shift<br />
126<br />
Round 10<br />
Byte Sub/Shift<br />
126<br />
Byte Sub/Shift<br />
126<br />
Byte Sub/Shift<br />
126<br />
Update <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
160<br />
Add <str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
32<br />
CheckResult<br />
99<br />
Increment<str<strong>on</strong>g>Key</str<strong>on</strong>g><br />
32
Composite Sbox Implementati<strong>on</strong><br />
26<br />
Composite Normal Basis Sbox<br />
GF(2 4 )<br />
GF(2<br />
x[7:0] y[7:0]<br />
4 GF(2<br />
)<br />
1/γ<br />
4 ν × γ<br />
)<br />
2<br />
γ1<br />
γ1<br />
XFORM<br />
γ0<br />
GF(2 4 )<br />
GF(2 4 )<br />
GF(2 8 ) GF(2 8 GF(2 )<br />
4 ) Representati<strong>on</strong><br />
Optimized implementati<strong>on</strong> of Canright’s composite Sbox<br />
– 126 instructi<strong>on</strong>s<br />
– Previous best-reported bitsliced Sbox was 205 instructi<strong>on</strong>s<br />
γ0<br />
XFORM -1
Bitsliced AES Performance Results<br />
Rate (Mops/s)<br />
27<br />
200<br />
180<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
w/Round<str<strong>on</strong>g>Key</str<strong>on</strong>g> Update<br />
w/o Round<str<strong>on</strong>g>Key</str<strong>on</strong>g> Update<br />
1 10 100 1000 10000 100000<br />
Iterati<strong>on</strong>s/Thread (N)<br />
GPU (ATI 2900XT @ 750MHz) peak rate of 145Mops/s<br />
– 92% of GPU peak rate at 16 iterati<strong>on</strong>s/thread<br />
CPU (Athl<strong>on</strong> 64 3500+ @ 2.2GHz) peak rate of 13Mops/s<br />
– Pre-computed key schedule, algorithm-<strong>on</strong>ly w/no key checking<br />
6-16x performance improvement over CPU<br />
Performance Ratio (RGPU/RCPU)<br />
18<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
w/o Update vs. CPU @ 13Mops/s<br />
w/o Update vs. CPU @ 9Mops/s<br />
1 10 100 1000 10000 100000<br />
Iterati<strong>on</strong>s/Thread (N)
C<strong>on</strong>clusi<strong>on</strong>s
Limitati<strong>on</strong>s<br />
GPUs are not optimized for serial operati<strong>on</strong>s<br />
-Protecti<strong>on</strong> modes (CBC)<br />
GPUs are still designed for graphics<br />
29
Future Work<br />
Moss, et al., “Toward Accelerati<strong>on</strong> of RSA Using 3D<br />
<strong>Graphics</strong> <strong>Hardware</strong>”, <str<strong>on</strong>g>Cryptography</str<strong>on</strong>g> and Coding, Dec.<br />
2007<br />
“ElcomSoft Files Patent for Revoluti<strong>on</strong>ary Technique to<br />
Recover Lost Passwords Quickly”, Oct. 2007<br />
Take advantage of future hardware features<br />
- Scatter<br />
- Double precisi<strong>on</strong><br />
- PCI-Express 2.0<br />
30
C<strong>on</strong>clusi<strong>on</strong><br />
GPUs are ready for cryptography<br />
More info:<br />
jas<strong>on</strong>c.yang@amd.com<br />
streamcomputing@amd.com<br />
http://ati.amd.com/technology/streamcomputing/<br />
http://developer.amd.com<br />
31
The Market – For the curious…<br />
32<br />
Desktop 3D <strong>Graphics</strong> - Units in 1,000s<br />
2005 2006 2007 2008<br />
7th Gen 3D (DX9) 25,437 18,547 7,070 2,553<br />
8th Gen 3D (DX9c) 32,657 22,866 9,071 1,708<br />
9th Gen 3D (WGF 1.0) 681 29,883 39,319 20,689<br />
10th Gen 3D (WGF 2.0) 0 340 28,728 56,165<br />
Total Desktop 58,775 71,636 84,188 81,115<br />
Portable <strong>Graphics</strong> - Units in 1,000s<br />
Portable 3D Acc. 13,550 18,210 23,688 26,768<br />
Total GPU w/ accelerati<strong>on</strong> 72,325 89,846 107,876 107,883