Grundlagen der CELL Programmierung

1 

Grundlagen der CELL 

Programmierung 

Seminarvortrag im Rahmen des Seminars 

„Ausgewählte Themen in Hardwareentwurf und Optik“ 

WS 2005/2006 

von 

Benjamin Kalisch

Inhalt 

2 

 

 

 

 

Was ist Cell 

Erste Generation (Hardware) 


Spezialregister & Channels 

... 

... 

... 

 

 

MFC 

SPE 

 

Linux und Bibliotheken 

 

Programmierbeispiel 

 

Schlussfolgerungen 

Quelle: [16]

Was ist Cell 

3 

Die Cell Broadband Engine Architecture (CBEA), kurz Cell, ist ein, von 

Sony Toshiba und IBM entwickelter, heterogener Multi-Core Prozessor. 

Ziele 

Hohe Leistung in Multimedia Anwendungen 

Energieeffizienz (GFLOPS/Watt) 

• Verzicht auf out-of-order Execution höherer Takt 

Daten 

• Transistoren 

• Chip-Fläche 

• Technologie 

234 Millionen 

235 mm² 

90 nm SOI

4 

Hardware 

Quellen [1], [2], [6], [7], [14], [24]

Erste Generation 

5 

[ Quelle: http://www.research.ibm.com/cell/cell_chip.html ] 

© IBM


5 

Grundidee: 

1 Hauptprozessor 

unterstützt von 8 

„Anwendungsbeschleunigern“ 


© IBM


5 

Grundidee: 

1 Hauptprozessor 

unterstützt von 8 

„Anwendungsbeschleunigern“ 

S 

X 

U 

MFC 

L 

S 


© IBM

Erste Generation - PPE 

6 

Power Processor Element (PPE) 

L2 Cache: 512kB, kohärent 

Optimierter 64bit POWER Prozessor 

• 2-way simultaneous Multithreading 

• SIMD Erweiterung (AltiVec) 

• kompatibel zu POWER Anwendungen 

• Führt das Betriebssystem aus 

[ Quelle: http://www.research.ibm.com/cell/cell_chip.html ] © IBM

Erste Generation - SPE (SXU) 

Synergistic Processing Element (SPE) 

Synergistic Execution Unit (SXU) 

SIMD Prozessor auf 128bit Vektoren 

4-Wege SIMD Einheit 

Register File 128 Einträge à 128bit 

Loop-unrolling 

Instruktionen ähnlich AltiVec 

Hauptspeicherzugriff nur per DMA 

7 


© IBM

Erste Generation - SPE (LS) 


Local Store (LS) 

256kB für Instruktionen und Daten 

Single ported SRAM 

Zugriffsprioritäten 

Granularität 

1. DMA 128B 

2. Load + Store 16B 

3. Instruction Fetch 128B 

8 

Abgebildet in Hauptspeicher Domäne 

Lokaler Zugriff: Keine Adressübersetzung 

Kein Adressschutz 


© IBM

Quelle: [24] 

Erste Generation - SPE (MFC) 


Memory Flow Controller (MFC) 

DMA Controller inklusive MMU 

Transfergrößen 1B bis 16kB 

Alignment 128 Byte 

2 Befehls Queues: 

lokal SPU Queue (16 E.) 

global Proxy Queue (8 E.) 

Kommunikation zum MFC: 

lokal Channel Interface 

global MMIO Register 

9 


© IBM

10 


Quellen 

Allgemein [22], [13], [17], [25], [26] 

Skalar, SIMDization [19], [20], [21] 

MFC, Channels [24]

MMIO Register (1/3) 

11 

32bit Spezialregister sind in Hauptspeicherdomäne verfügbar. 

Memory Mapped I/O Register 

Kommunikationsregister 

SPU_Out_Mbox 

SPU_In_Mbox 

SPU_Sig_Notify_1(/2) 

MFC_MSSync 

Nachrichten SPE PPE 

Nachrichten PPE SPE (Queue) 

Kommunikation SPE SPE oder I/O 

Modi: „Logical-Or“ Many-to-One 

„Overwrite“ One-to-One 

Stellt Abschluss ausstehender DMA Befehle sicher 

Wichtig für sicheren Prozesswechsel 

C-Mnemonics


12 

DMA Befehlsregister 

Adressierung relativ zu BE_MMIO_Base 

Beispiel: Reihenfolge für DMA Befehl initialisiert durch PPE 

Adresse SPE(n): 

BE_MMIO_Base 

+ n 0x80000 

+ 0x43004 

1. S: Local Store Adresse MFC_LSA 

2. S: Hauptspeicheradresse MFC_EAL (+ MFC_EAH) 

3. S: Größe und Tag zur Kontrolle MFC_Size_Tag 

4. S: Auszuführender Befehl MFC_ClassID_CMD 

Startet Hinzufügversuch 

5. L: Hinzufügen erfolgreich MFC_CMDStatus 

DMA Informationsregister 

Tag-Gruppe ausgeführt Prxy_TagStatus 

PPE Interrupt möglich


13 

SPE Kontrollregister 

SPU_NPC 

SPU_RunCntl 

SPU_Status 

LS Adresse der nächsten Instruktion 

Nur gültig, wenn SPU idle (SPU_Status[R] = 0) 

Startet und Stoppt SPU 

• Stop Request Stoppt SPE 

• Run Request Startet SPE bei SPU_NPC 

SPE Status Register, wichtige Bitfelder: 

Bit 

R 

I 

P 

Bedeutung (falls Bit = 1) 

SPU running 

Ungültige Instruktion 

Stop-and-Signal Instruktion 

 

 

SPU angehalten 

SPU angehalten 

SPU_Status enthält in diesem Fall auch den Stoppcode

SPU Channel Interface 

14 

Registerinhalte bei MMIO und Channel Interface größtenteils identisch. 

Unterschiede: Store+Load/DMA Befehle Channel Read/Write Befehle 

SPU Assembler 

RDCH RT, CH 

WRCH CH, RA 

RCHCNT RT, CH 

MMIO Register 

MFC_Size_Tag 

MFC_CMDStatus 

Channel Äquivalent 

MFC_Size & MFC_TagID 

Nichts, wrch blockiert bis Platz ist 

ChannelCH Register RT 

Register RA ChannelCH 

#Einträge in CH Register RT 

SPU Channel C-Intrinsics 

d = spu_readch( channel ); 

d = spu_readchcnt( channel ); 

d,a = u32 

spu_writech( channel, a );

MFC Programmierung 

15 

SPU DMA C-Intrinsics 

spu_mfcdma64 (ls, eahi, ealow, size, tag, cmd); 

d = spu_mfcstat( type ); 

MFC Assembler Befehle (cmd): 

put 

puts 

putb 

putl 

get* 

barrier 

getllar 

putllc 

Kopiert: LS Hauptspeicher 

Startet SPU nach Kopieren (nur Proxy) 

Ordnet diesen Befehl relativ zu vorherigen 

DMA Befehlsliste aus dem LS (nur SPU) 


Erzeugt Speicherbarriere 

Atomic get, entspricht PowerPC lwarx 

Atomic put, entspricht PowerPC stwcx 

MFC_PUT_CMD 

MFC_PUTS_CMD 

MFC_PUTB_CMD 

MFC_PUTL_CMD 

MFC_GET*_CMD 

MFC_BARRIER_CMD 

MFC_GETLLAR_CMD 

MFC_PUTLLC_CMD

MFC Programmierung 

15 

SPU DMA C-Intrinsics 

spu_mfcdma64 (ls, eahi, ealow, size, tag, cmd); 

d = spu_mfcstat( type ); 

MFC Assembler Befehle (cmd): 

Channel writes: 

ls MFC_LSA 

eahigh MFC_EAH 

ealow MFC_EAL 

size MFC_Size 

tag MFC_TagID 

cmd MFC_CMD 

put 

puts 

putb 

putl 

get* 

barrier 

getllar 

putllc 


Startet SPU nach Kopieren (nur Proxy) 

Ordnet diesen Befehl relativ zu vorherigen 

DMA Befehlsliste aus dem LS (nur SPU) 


Erzeugt Speicherbarriere 

Atomic get, entspricht PowerPC lwarx 

Atomic put, entspricht PowerPC stwcx 

MFC_PUT_CMD 

MFC_PUTS_CMD 

MFC_PUTB_CMD 

MFC_PUTL_CMD 

MFC_GET*_CMD 

MFC_BARRIER_CMD 

MFC_GETLLAR_CMD 

MFC_PUTLLC_CMD

SPE Programmzuweisung 

16

SPE Programmzuweisung 

16 

Einige Alternativen 

1. Instr. & Daten direkt per store 

2. Interrupt auf PPE möglich 

1. 

3. Kann man sich mit getbs sparen 

4. Stop-and-Signal Instruktion 

Assembler stop u14 

Interrupt auf PPE 

3. 

2. 

4.

SPE Instruktionssatz (Auszug) 

17 

LS Instruktionen 

Integer Instruktionen 

Logik Instruktionen 

lqx rt, ra, rb 

stqx rt, ra, rb 

a rt, ra, rb 

xor rt, ra, rb 

Shift and Rotate Instruktionen 

rotqby rt, ra, rb 

Compare and Branch Instruktionen 

ceq rt, ra, rb 

brz rt, i16 

Floating Point Instruktionen 

Kontroll Instruktionen 

Channel Instruktionen 

fma rt, ra, rb, rc 

fcgt rt, ra, rb 

Load Quadword 

Store Quadword 

Add Word 

Logical XOR 

Rotate Quadword left by Bytes 

Compare Equal Word 

Branch if zero 

Floating Point Multiply and Add 

Floating Point Compare Greater Than

Spezielles an SPEs 

18 

SPE Error-Handling & System-Calls nur auf PPE Weiterleitung 

Run-to-Completion Nutzung empfohlen (preemptive möglich) 

Branch-Hint Instruktion 

Verbessert Static Branch Prediction (‘not taken‘) 

• Rechtzeitiger und korrekter Hint keine Penalty 

• Maximal ein ausstehender Branch-Hint 

hbr 

s11, ra 

hbrp Instruction-Fetch-Hint 

Erhöht I-Fetch Priorität

Predication auf SPEs 

19 

Bitwise Select 

Vermeidung kleiner Branches durch Predication 

Assembler: 

C-Intrinsic: 

selb rt, ra, rb, rc 

d = spu_sel(a, b, pattern); 

Beispiel: 

cmp 

bra 

x cond 

x, else 

then: add d d+1 

bra done 

else: add 

done: 

d d+a 

 

cmp 

x cond 

add y d+a 

add z d+1 

selb d, z, y, x

Skalararithmetik auf SPEs 

20 

SPEs besitzen keine dedizierte Skalarlogik 

LS Zugriffe 16B aligned + Register 16B breit 

Position in Register = Offset im LS 

Bsp.1: 

Bsp.2: 

a[0] = c[0] + b[0] ̌ 

a[1] = c[2] + b[3] 

[i] = Offset relativ zum 16B Alignment 

Quelle: [19]


20 




Add 

Bsp.1: 

Bsp.2: 

a[0] = c[0] + b[0] ̌ 

a[1] = c[2] + b[3] 


Quelle: [19]


20 




Add 

Bsp.1: 

Bsp.2: 

a[0] = c[0] + b[0] ̌ 

a[1] = c[2] + b[3] 


Quelle: [19]


20 




Add 

Bsp.1: 

Bsp.2: 

a[0] = c[0] + b[0] ̌ 

a[1] = c[2] + b[3] 


Lösung: Register Rotieren 

Speichern auch problematisch 

a[1] geändert, was mit a[0],a[2],a[3] 

Quelle: [19] 

Rotate 

Add


20 




Add 

Bsp.1: 

Bsp.2: 

a[0] = c[0] + b[0] ̌ 

a[1] = c[2] + b[3] 





Quelle: [19] 

Lösung: Read, Insert, Write 

Rotate 

Add


20 




Add 

Bsp.1: 

Bsp.2: 

a[0] = c[0] + b[0] ̌ 

a[1] = c[2] + b[3] 





Quelle: [19] 

Lösung: Read, Insert, Write 

Rotate 

Add 

Einfachere Lösung für Beides: 

Pro Skalar 16B und Position 0 

GCC: float example __attribute__ ((aligned (16)));

21 

Linux und Bibliotheken 

Quellen 

Linux [3], [4], [18] 

libspe [11] 

spu_intrinsics [13]

Linux auf Cell 

22 

Anpassungen: Spezieller Interrupt Controller und SPEs 

SPE Thread Verwaltung durch SPU File System (spufs) 

Verzeichnisse entsprechen logischem SPE Kontext 

Mapping auf SPEs übernimmt BS (Virtualisierung) 

Quelle: [3] 

SPU starten: System-Call mit Parametern auf run 

Übergabepointer auf: struct spufs_run_arg { 

u32 npc; 

u32 status; }; 

Aufrufender Thread blockiert 

Outb. Inter. MBox 

Outbound MBox 

Local Store 

Register File 

RUN 

Inbound MBox

libspe.h 

(1/2) 

23 

libspe.h = SPE Runtime Management Library 

Ist PPE C-Bibliothek 

POSIX-Thread ähnliche SPE Nutzung auf User-Level Basis 

Unterscheidung: SPE Group SPE Thread 

spe_gid_t speid_t 

Gang-Scheduling für Gruppen möglich 

Kommunikationsfunktionen: 

u32 = spe_read_out_mbox( speid ); 

spe_write_in_mbox( speid, u32 ); 

spe_write_signal( speid, reg, u32 );

libspe.h 

(2/2) 

24 

Verwaltungsfunktionen: 

program* = spe_open_image( *filename ); 

gid = spe_create_group( policy, priority, spe_events ); 

speid = spe_create_thread( gid, *program, *argp, *envp, mask, flags ); 

succ = spe_wait( speid, *status, options ); 

Funktionen zum Zugriff auf SPE Ressourcen (z.B.) 

void* = spe_get_ls( speid ); 

Ändernung gewählter Einstellungen (z.B.) 

succ = spe_set_affinity( speid, mask ); 

Form einer Thread Funktion 

int main( speid, argp, envp );

spu_intrinsics 

intrinsics.h 

25 

Ist SPE C-Bibliothek 

Enthält alle bisher gezeigten C-Intrinsics 

Führt Vektor-Datentypen ein 

vector unsigned int 

vector float 

vec_uint4 

vec_float4 

qword 

Ermöglicht direkte Vektor-Operationen 

d = spu_add(a, b); 

//a, b, c, d Vektoren 

d = spu_madd(a, b, c); 

d = spu_compeq(a, b);

Programmierbeispiel PPE 

26 

/* --------------------------------------- context.h --------------------------------------- */ 

typedef struct { 

vector float pos; 

vector float vel; 

float delta_t; 

} context; 

/* ------------------------------------ ppeprogram.c ------------------------------------ */ 

extern spe_program_handle_t speprogram; 

int main() { 

int status; 

context ctx; 

speid_t speid; 

// [ . . . ] 

speid = spe_create_thread(0, &speprogram, &ctx, NULL, -1, 0); 

(void) spe_wait(speid, &status, 0); 

}

Programmierbeispiel SPE 

27 

/* ------------------------------------ speprogram.c ------------------------------------ */ 

int main(unsigned long long spu_id, unsigned long long parm) { 

unsigned int tag_id = 0; 

vector float delta_t_vec; 

context ctx; 

spu_mfcdma32(&ctx, parm, sizeof(context), tag_id, MFC_GET_CMD); 

(void) spu_mfcstat(2); 

delta_t delta_t 

delta_t_vec = spu_splats(ctx.delta_t); 

ctx.pos = spu_madd(ctx.vel, delta_t_vec, ctx.pos); 

delta_t 

delta_t 

} 

spu_mfcdma32(&ctx, parm, sizeof(context), tag_id, MFC_PUT_CMD); 

(void) spu_mfcstat(2);


28


29 

MMIO Register und Channels wichtiger 

Bestandteil der Architekturausnutzung 

In Hochsprache (C/C++) programmierbar 

Höherer Programmieraufwand 

• Vereinfacht durch Bibliotheken 

• Mehraufwand = Performance 

SP FP Matrixmultiplikation 

P4 (SSE3, 3.2GHz): 25.6GFLOPS 

3.2 GHZ Cell: 

• XL C Compiler 

Hohe Maximalperformance 

• SP FP 

• DP FP 

204 GFLOPS 

20 GFLOPS 

Quelle [26]

Quellen 

30 

[1] 

[2] 

[3] 

[4] 

[5] 

[6] 

[7] 

[8] 

„A Streaming Processing Unit for a CELL Processor“ 

Autoren: B. Flachs, S. Asano, S.H. Dhong, P. Hofstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, 

H. Oh, S.M. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, N. Yano 

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/E815CC047A60914687256FC000734156 

“The Design and Implementation of a First-Generation CELL Processor” 

Autoren: D. Pham, S. Asano, M. Bollinger, M.N. Day, H.P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, 

M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, K. Yazawa 

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/7FB9EC5D5BBF51ED87256FC000742186 

“Spufs: The Cell Synergistic Processing Unit as a virtual file system” 

Autor: Arnd Bergmann 

http://www-128.ibm.com/developerworks/power/library/pa-cell/ 

“Meet the Experts: Arnd Bergmann on Cell” 

Autoren: Arnd Bergmann, developerWorks 

http://www-128.ibm.com/developerworks/power/library/pa-expert4 

“Just like being there: Papers from the Fall Processor Forum 2005: Unleashing the power of the Cell Broadband Engine” 

Autor: developerWorks 

http://www-128.ibm.com/developerworks/power/library/pa-fpfunleashing/ 

“Cell Architecture Explained Version 2” 

Autor: Nicholas Blachford 

http://www.blachford.info/computer/Cell/Cell0_v2.html 

“Introduction to the Cell multiprocessor” 

Autoren: J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R. Maeurer, D. Shippy 

http://www.research.ibm.com/journal/rd/494/kahle.html 

“Hardware and Software Architectures for the Cell Broadband Engine processor” 

Michael Day, Peter Hofstee 

http://www.casesconference.org/cases2005/pdf/Cell-tutorial.pdf

Quellen 

31 

[9] 

[10] 

[11] 

[12] 

[13] 

[14] 

[15] 

[16] 

“A remote Procedure Call Implementation for the Cell Broadband Architecture” 

Part of: “Cell Broadband Engine (Cell BE) Software Sample and Library Source Code” 

http://www.alphaworks.ibm.com/tech/cellswopen&S_TACT=105AGX16&S_CMP=DWPA 

“CELL: A New Platform for Digital Entertainment” 

Autoren: Dominic Mallinson, Mark DeLoura 

http://www.research.scea.com/research/html/CellGDC05/index.html 

“SPE Runtime Management Library” 

http://www.bsc.es/projects/deepcomputing/linuxoncell/development/release2.0/libspe/libspe_v1.0.pdf 

“Optimizing Compiler for a CELL Processor” 

Autoren: Alexandre E. Eichnberger, Kathryn O’Brien, Kevin O’Brian, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prenner, 

Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, Michael Gschwind 

http://cag.csail.mit.edu/crg/papers/eichenberger05cell.pdf 

“SPU C/C++ Language Extensions” 

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E 

“Cell Moves into the Limelight” 

Autor: Kevin Krewell 

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D 

“Porting the GNU Tool Chain to the Cell Architecture” 

Autor: Ulrich Weigand 

http://www.gccsummit.org/2005/2005-GCC-Summit-Proceedings.pdf 

“Unleashing the Power: A programming example of large FFTs on Cell“ 

Autoren: Alex Chow, Gordon Fossum, Daniel A.Brokenshire 

http://www.power.org/news/events/barcelona/

Quellen 

32 

[17] 

[18] 

[19] 

[20] 

[21] 

[22] 

[23] 

[24] 

[25] 

[26] 

“SPU Application Binary Interface Specification” 

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/02E544E65760B0BF87257060006F8F20 

“Cell Broadband Engine Linux Reference Implementation Application Binary Interface Specification” 

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/44DA30A1555CBB73872570B20057D5C8/ 

“Efficient SIMD Code Generation for Runtime Alignment & Length Conversion” 

Peng Wu, Alexandre Eichenberger, Amy Wang 

http://www.research.ibm.com/cellcompiler/slides/cgo05.pdf 

”An Integrated Simdization Framework using virtual vectors” 

Peng Wu, Alexandre Eichenberger, Amy Wang, Peng Zhao 

http://www.research.ibm.com/cellcompiler/slides/ics05.pdf 

”Vectorization for SIMD Architectures with Alignment Constraints” 

Alexandre Eichenberger, Peng Wu, Kevin O’Brian 

http://www.research.ibm.com/cellcompiler/slides/pldi04.pdf 

“Cell Broadband Engine Programming Tutorial” 

Cell Broadband Engine Architecture Joint Software Reference Environment Series 

“Meet the experts: Alex Chow on Cell Broadband Engine programming models” 

Alex Chow 

http://www-128.ibm.com/developerworks/power/library/pa-expert8/ 

“Cell Broadband Engine Architecture” 

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/1AEEE1270EA2776387257060006E61BA 

“Synergistic Processor Unit Instruction Set Architecture” 

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/76CA6C7304210F3987257060006F2C44/ 

Cell Broadband Engine Architecture and its first implementation – A Performance View 

http://www-128.ibm.com/developerworks/power/library/pa-cellperf/

Fragen 

33

Grundlagen der CELL Programmierung

Erfolgreiche ePaper selbst erstellen

Template löschen?

Als Template speichern?