Design of a Hardware/Software RTOS for FPGAs with Processors
Design of a Hardware/Software RTOS for FPGAs with Processors
Design of a Hardware/Software RTOS for FPGAs with Processors
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Design</strong> <strong>of</strong> a <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong> <strong>for</strong><br />
<strong>FPGAs</strong> <strong>with</strong> <strong>Processors</strong><br />
The δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong> Generation Framework<br />
Vincent J. Mooney III<br />
http://codesign.ece.gatech.edu<br />
http:// codesign.ece.gatech.edu<br />
http://www.crest.gatech.edu<br />
http:// www.crest.gatech.edu<br />
Associate Pr<strong>of</strong>essor, School <strong>of</strong> Electrical and Computer Engineering<br />
Engineering<br />
Adjunct Associate Pr<strong>of</strong>essor, College <strong>of</strong> Computing<br />
Associate Director, Center <strong>for</strong> Research on Embedded Systems and Technology<br />
Georgia Institute <strong>of</strong> Technology<br />
Atlanta, Georgia, USA<br />
FPGA World 2004 HW/SW <strong>RTOS</strong> Project <strong>of</strong> the HW/SW Codesign Group at GT<br />
©Vincent J. Mooney III, 2004
Outline<br />
• Trends<br />
• Vision: <strong>Hardware</strong>/S<strong>of</strong>tware Real-Time<br />
Operating System<br />
• Custom <strong>RTOS</strong> <strong>Hardware</strong> IP Components<br />
• System-on-a-Chip Lock Cache (SoCLC)<br />
• SoC Deadlock Detection Unit (SoCDDU) and SoC<br />
Deadlock Avoidance Unit (SoCDAU)<br />
• The δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong> Generation<br />
Framework<br />
• Comparison <strong>with</strong> the RTU <strong>Hardware</strong> <strong>RTOS</strong><br />
• Conclusion<br />
FPGA World 2004 2<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Digital Silicon CMOS VLSI Trends<br />
Yesterday (1980s) Today Tomorrow<br />
memory<br />
processors<br />
gate arrays<br />
ASICs<br />
memory<br />
processors<br />
SoC<br />
reconfigurable<br />
gate arrays<br />
ASICs<br />
memory<br />
processors<br />
Plat<strong>for</strong>m SoC<br />
Custom SoC ≤ 9<br />
products<br />
reconfigurable<br />
gate arrays<br />
ASICs<br />
FPGA World 2004 3<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
≥ 10<br />
products
Why Should SoC Customizers/Programmers<br />
Customizers/Programmers<br />
Worry About <strong>RTOS</strong>es?<br />
<strong>RTOS</strong>es<br />
• Trend: logic design → RTL design → processors +<br />
custom RTL design → embedded s<strong>of</strong>tware design!<br />
• The <strong>RTOS</strong> is the state-<strong>of</strong>-the-art embedded<br />
s<strong>of</strong>tware manager<br />
• Prediction: the best embedded s<strong>of</strong>tware designers<br />
will work closely <strong>with</strong> a custom <strong>RTOS</strong> design =><br />
the best SoC hardware/s<strong>of</strong>tware designs will<br />
include trade<strong>of</strong>fs in the <strong>RTOS</strong><br />
FPGA World 2004 4<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Vision: Dynamic S<strong>of</strong>tware/<br />
<strong>Hardware</strong> <strong>RTOS</strong> <strong>Design</strong><br />
Key to System-on-a-Chip architecture<br />
optimization and customization<br />
FPGA World 2004 5<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
SoC Example: Broadcom BCM1400<br />
• Four Processor<br />
Cores<br />
–MIPS64<br />
– 1GHz<br />
– 8-way 1MB<br />
Shared L2<br />
• On-chip<br />
ZBbus<br />
– maintains<br />
coherency<br />
– proprietary<br />
• Off-chip HT/<br />
SPI-4 19Gb/s<br />
MIPS64<br />
Core0<br />
Memory<br />
Bridge<br />
MIPS64<br />
Core1<br />
HT/<br />
SPI-4<br />
MIPS64<br />
Core2<br />
MIPS64<br />
Core3<br />
ZBbus: 128Gb/s @ 1GHz<br />
L2<br />
Shared<br />
Mem<br />
Arbiter<br />
[Levy02] M. Levy, “Chip Combines Four 1GHz Cores,”<br />
Microprocessor Report, pp. 12-14, October 2002.<br />
FPGA World 2004 6<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
X<br />
HT/<br />
SPI-4<br />
HT/<br />
SPI-4<br />
Port0 Port1 Port2<br />
Packet<br />
DMA<br />
SoC<br />
Interfaces
Motivational Example: Home 2005<br />
Programmable<br />
PDA<br />
Projectors<br />
SoC Device Central Storage Unit (e.g., PC)<br />
wireless link wired link<br />
Projected Light Displays<br />
FPGA World 2004 7<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
• Microprocessor design<br />
– Compiler<br />
– Computer architecture<br />
Analogy<br />
• SoC design<br />
– Dynamic hw/sw <strong>RTOS</strong><br />
– SoC architecture<br />
FPGA World 2004 8<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Building Blocks<br />
• SoC Programming Model<br />
– multi-threading, shared mem., message passing,<br />
control-data flow graph<br />
• SoC Programming Environment<br />
– δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong><br />
• Microprocessor Programming Model<br />
– C/C++/Java/other serial language<br />
• Microprocessor Programming Environment<br />
– gcc, various<br />
FPGA World 2004 9<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Approach<br />
• δ Hw/Sw <strong>RTOS</strong> made up <strong>of</strong> library components<br />
• Library component = predefined C code, assembly<br />
code or HDL code<br />
• Similar to existing <strong>RTOS</strong>’s, except <strong>for</strong> the HDL code<br />
– ex.: SoC Lock Cache in hardware<br />
• <strong>RTOS</strong> HDL code can be automatically generated by<br />
a custom “IP Generator”<br />
– ex.: PARLAK SoC Lock Cache generator<br />
FPGA World 2004 10<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
The δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong><br />
Generation Framework<br />
SW <strong>RTOS</strong><br />
w/ sem<br />
FPGA World 2004 11<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
<strong>RTOS</strong>1<br />
Application<br />
User<br />
Input<br />
S<strong>of</strong>tware<br />
<strong>RTOS</strong> library<br />
SW <strong>RTOS</strong><br />
+<br />
SoCLC<br />
SW <strong>RTOS</strong><br />
w/ deadlock<br />
avoid.<br />
Executable HW file<br />
<strong>for</strong> each<br />
<strong>Hardware</strong><br />
<strong>RTOS</strong> library<br />
GUI tool<br />
SW <strong>RTOS</strong><br />
+<br />
SoCDDU<br />
Compile Stage <strong>for</strong> each system<br />
Base<br />
Architecture<br />
library<br />
Executable SW file<br />
<strong>for</strong> each<br />
SW <strong>RTOS</strong><br />
+ SoCLC<br />
+<br />
SoCDDU<br />
<strong>RTOS</strong>2 <strong>RTOS</strong>3 <strong>RTOS</strong>4 <strong>RTOS</strong>5<br />
Simulation in Seamless<br />
VCS XRAY<br />
CVE<br />
RTU<br />
<strong>RTOS</strong>6
Outline<br />
• Trends<br />
• Vision: <strong>Hardware</strong>/S<strong>of</strong>tware Real-Time<br />
Operating System<br />
• Custom <strong>RTOS</strong> <strong>Hardware</strong> IP Components<br />
• System-on-a-Chip Lock Cache (SoCLC)<br />
• SoC Deadlock Detection Unit (SoCDDU) and SoC<br />
Deadlock Avoidance Unit (SoCDAU)<br />
• The δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong> Generation<br />
Framework<br />
• Comparison <strong>with</strong> the RTU <strong>Hardware</strong> <strong>RTOS</strong><br />
• Conclusion<br />
FPGA World 2004 12<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
SoC Lock Cache<br />
• A hardware mechanism that resolves the critical section<br />
(CS) interactions among PEs<br />
• Lock variables are moved into a separate “lock cache”<br />
outside <strong>of</strong> the memory<br />
• Improves the per<strong>for</strong>mance criteria in terms <strong>of</strong> lock latency,<br />
lock delay and bandwidth consumption<br />
FPGA World 2004 13<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
S<strong>of</strong>tware/<strong>Hardware</strong> Architecture<br />
Application S<strong>of</strong>tware<br />
(Tasks)<br />
Atalanta-<strong>RTOS</strong><br />
Memory<br />
S<strong>of</strong>tware<br />
<strong>Hardware</strong><br />
Extension<br />
MPC750A MPC750B MPC750C<br />
MPC750D<br />
SoCLC<br />
Arbitration<br />
Logic<br />
•Multiple application tasks<br />
•Atalanta‐<strong>RTOS</strong><br />
•Four MPC750s<br />
• SoCLC provides lock syn‐<br />
chronization among PEs<br />
SoC Lock Cache<br />
FPGA World 2004 14<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Experiment<br />
Example: Database transaction application [1]<br />
short_Req3<br />
Client<br />
transaction1<br />
O 2<br />
O 3<br />
transaction3<br />
O 4<br />
long_Req1<br />
Access <strong>of</strong><br />
Object O 2<br />
by transaction1<br />
long_Req3<br />
Access <strong>of</strong><br />
Object O 4<br />
by transaction3<br />
Server<br />
transaction2<br />
[1] M. A. Olson, “Selecting and implementing an embedded database system,” IEEE Computer,<br />
pp.27-34, September 2000.<br />
FPGA World 2004 15<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
O 4<br />
O 2<br />
transaction4<br />
short_Req4
Experimental Result<br />
Comparison <strong>with</strong> database application example [2]<br />
• <strong>RTOS</strong>1 <strong>with</strong> semaphores and spin-locks<br />
• <strong>RTOS</strong>2 <strong>with</strong> SoCLC, no SW semaphores or spin-locks<br />
(clock cycles) * Without SoCLC With SoCLC Speedup<br />
Lock Latency 1200 908 1.32x<br />
Lock Delay 47264 23590 2.00x<br />
Execution Time 36.9M 29M 1.27x<br />
* Semaphores <strong>for</strong> long critical sections (CSes) and<br />
spin-locks <strong>for</strong> short CSes are used instead <strong>of</strong> SoCLC.<br />
[2] B. S. Akgul, J. Lee and V. Mooney, “System-on-a-chip processor synchronization hardware<br />
unit <strong>with</strong> task preemption support,” CASES ‘01, pp.149-157, November 2001.<br />
FPGA World 2004 16<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Outline<br />
• Trends<br />
• Vision: <strong>Hardware</strong>/S<strong>of</strong>tware Real-Time<br />
Operating System<br />
• Custom <strong>RTOS</strong> <strong>Hardware</strong> IP Components<br />
• System-on-a-Chip Lock Cache (SoCLC)<br />
• SoC Deadlock Detection Unit (SoCDDU) and SoC<br />
Deadlock Avoidance Unit (SoCDAU)<br />
• The δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong> Generation<br />
Framework<br />
• Comparison <strong>with</strong> the RTU <strong>Hardware</strong> <strong>RTOS</strong><br />
• Conclusion<br />
FPGA World 2004 17<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
SoC Deadlock Detection Unit<br />
(SoCDDU SoCDDU)<br />
• Matrix based parallel processing approach<br />
• Removable edge reduction technique to reveal cycles (i.e.,<br />
deadlock)<br />
– Removable edge*: not related to deadlock<br />
• Simple bit-wise Boolean operations<br />
– Implementation easier<br />
– Operation faster, O(min(m,n))<br />
– 2~3 orders <strong>of</strong> magnitude faster than s<strong>of</strong>tware<br />
• Novelty from previous algorithms<br />
– Does NOT trace exact cycles<br />
– Does NOT require linked lists<br />
P. Shiu, Y. Tan and V. Mooney, "A Novel Parallel Deadlock Detection<br />
Algorithm and Architecture," 9th International Workshop on<br />
<strong>Hardware</strong>/S<strong>of</strong>tware Codesign (CODES'01), pp. 30-36, April 2001.<br />
P1 P2<br />
Q1<br />
Q2 Q3<br />
FPGA World 2004 18<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
*<br />
P3
Architecture <strong>of</strong> the SoC Deadlock<br />
DDU<br />
(matrix)<br />
Start<br />
Reset<br />
Done<br />
Deadlock<br />
Avoidance Unit (SoCDAU ( SoCDAU)<br />
Cell access<br />
DAA<br />
Logic<br />
w/ FSM<br />
Address<br />
decoder<br />
Command<br />
registers<br />
Status<br />
registers<br />
DDU: Deadlock Detection Unit<br />
FSM: finite state machine<br />
Address<br />
control<br />
Data<br />
FPGA World 2004 19<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
SoCDDU/SoCDAU Experiment<br />
• Four resources<br />
– Q1: Video Interface (VI)<br />
– Q2: MPEG<br />
– Q3: DSP<br />
– Q4: Wireless Interface (WI)<br />
- System and Application<br />
• Each process requires two resources except P4<br />
– P1: processing a video stream (needs Q1 + Q2)<br />
– P2: separating/enhancing frames (needs Q2 + Q3)<br />
– P3: extracting special images (needs Q3 + Q1)<br />
– P4: transferring images (needs Q4)<br />
• One active process <strong>for</strong> each processing element (PE)<br />
HW<br />
SW<br />
<strong>RTOS</strong><br />
FPGA World 2004 20<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Experimental Results <strong>of</strong> SoCDAU<br />
• Deadlock avoidance simulation: Two kinds<br />
– SoCDAU: Synopsys VCS runs compiled Verilog code<br />
– DAA in s<strong>of</strong>tware: MPC755 runs compiled C code in Seamless CVE<br />
– Example application invokes deadlock avoidance 12 times<br />
P1 P2<br />
t1<br />
Q1<br />
t3<br />
P3<br />
t2<br />
Q2 Q3<br />
P1 P2<br />
Q1<br />
Q2 Q3<br />
FPGA World 2004 21<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
t4<br />
t5<br />
P3
Experimental Results <strong>of</strong> SoCDAU<br />
(cont’d) (cont d)<br />
• Deadlock avoidance simulation result<br />
• Per<strong>for</strong>mance improvement<br />
– 99% algorithm execution time reduction<br />
– 37% reduction in an application execution time<br />
Method <strong>of</strong> Deadlock<br />
Avoidance<br />
SoCDAU <strong>Hardware</strong><br />
DAA in S<strong>of</strong>tware<br />
Algorithm Exe.<br />
Time (cycles)<br />
7 (average)<br />
2188 (average)<br />
Normalized<br />
Exe. Time<br />
312X<br />
FPGA World 2004 22<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
1X<br />
Application Exe.<br />
Time (cycles)<br />
34791<br />
47704
Synthesis Results <strong>of</strong> SoCDAU<br />
• SoCDAU <strong>for</strong> 5 processes and 5 resources<br />
• Qualcore Logic library <strong>with</strong> TSMC .25µm technology<br />
• 0.01% <strong>of</strong> the total SoC area <strong>with</strong> four PEs and memory<br />
Module Name<br />
SoCDDU 5x5<br />
(inside SoCDAU)<br />
DAA Logic<br />
Total Area<br />
in terms <strong>of</strong> twoinput<br />
NAND<br />
gates<br />
364<br />
1472<br />
Lines <strong>of</strong> Verilog HDL<br />
Code<br />
203<br />
344<br />
TSMC: Taiwan Semiconductor Manufacturing Company<br />
PE: Processing Element<br />
FPGA World 2004 23<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Total area in<br />
Semi-custom VLSI<br />
Xilinx<br />
XC4000E 4003EPC84<br />
<strong>Hardware</strong> Area<br />
SoCLC<br />
(For 64 short CS locks +<br />
64 long CS locks)<br />
7435 gates using TSMC 0.25µm<br />
standard cell library<br />
SoCDDU<br />
(For 5 <strong>Processors</strong> x<br />
5 Resources)<br />
364 gates using AMI<br />
0.3µm standard cell library<br />
# Seq. logic 532 10<br />
# Other gates 9036 559<br />
FPGA World 2004 24<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Outline<br />
• Trends<br />
• Vision: <strong>Hardware</strong>/S<strong>of</strong>tware Real-Time<br />
Operating System<br />
• Custom <strong>RTOS</strong> <strong>Hardware</strong> IP Components<br />
• System-on-a-Chip Lock Cache (SoCLC)<br />
• SoC Deadlock Detection Unit (SoCDDU) and SoC<br />
Deadlock Avoidance Unit (SoCDAU)<br />
• The δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong> Generation<br />
Framework<br />
• Comparison <strong>with</strong> the RTU <strong>Hardware</strong> <strong>RTOS</strong><br />
• Conclusion<br />
FPGA World 2004 25<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
User<br />
Input<br />
δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong><br />
<strong>Hardware</strong><br />
<strong>RTOS</strong><br />
Library<br />
Generation Framework<br />
and current simulation plat<strong>for</strong>m<br />
Base<br />
Architecture<br />
Library<br />
GUI Tool<br />
S<strong>of</strong>tware<br />
<strong>RTOS</strong><br />
Library<br />
Top.v<br />
HW <strong>RTOS</strong><br />
User.h<br />
SW <strong>RTOS</strong><br />
Makefile<br />
Application<br />
HW<br />
Compile<br />
SW<br />
Compile<br />
Executable<br />
HW<br />
Compiled<br />
<strong>Hardware</strong><br />
Description<br />
Executable<br />
SW<br />
Modelsim<br />
or VCS<br />
Simulation<br />
in<br />
Seamless<br />
CVE<br />
XRAY<br />
Result<br />
and<br />
Feedback<br />
FPGA World 2004 26<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
δ <strong>Hardware</strong>/S<strong>of</strong>tware <strong>RTOS</strong><br />
Generation Framework Goals<br />
To help the user examine which configuration is<br />
most suitable <strong>for</strong> the user’s specific applications<br />
To help the user explore the <strong>RTOS</strong> design space<br />
be<strong>for</strong>e chip fabrication as well as after chip<br />
fabrication (in which case reconfigurable logic must<br />
be available on the chip)<br />
To help the user examine different System-on-a-<br />
Chip (SoC) architectures subject to a custom <strong>RTOS</strong><br />
FPGA World 2004 27<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Motivation (1/3)<br />
HW/SW <strong>RTOS</strong> partitioning approach<br />
Previous innovations in HW/SW <strong>RTOS</strong> components<br />
• System-on-a-Chip Lock Cache (SoCLC)<br />
• System-on-a-Chip Dynamic Memory Management Unit<br />
(SoCDMMU)<br />
• System-on-a-Chip Deadlock Detection Unit (SoCDDU) and<br />
Deadlock Avoidance Unit (SoCDAU)<br />
• RTU <strong>Hardware</strong> <strong>RTOS</strong><br />
FPGA World 2004 28<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Motivation (2/3)<br />
SoCDMMU: System-on-a-Chip Dynamic Memory<br />
Management Unit<br />
• Provides fast, deterministic and yet dynamic<br />
memory management <strong>of</strong> a global on-chip memory<br />
• Achieves flexible, efficient memory utilization<br />
• Provides APIs <strong>for</strong> applications<br />
FPGA World 2004 29<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Motivation (3/3)<br />
Constraints about using the previous HW/SW <strong>RTOS</strong><br />
innovations<br />
• Perhaps not enough chip space <strong>for</strong> all three <strong>of</strong> them<br />
• All <strong>of</strong> them may not be necessary<br />
⇒ The δ framework<br />
• Enables automatic generation <strong>of</strong> different mixes <strong>of</strong> the<br />
previous innovations <strong>for</strong> different versions <strong>of</strong> a HW/SW<br />
<strong>RTOS</strong><br />
• Enables selection <strong>of</strong> the RTU hardware <strong>RTOS</strong><br />
• Can be generalized to instantiate additional HW or SW<br />
<strong>RTOS</strong> components<br />
FPGA World 2004 30<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Our <strong>RTOS</strong> in Post-Fabricaton<br />
Post Fabricaton Scenario<br />
Application(s) run on the SoC<br />
using standard <strong>RTOS</strong> APIs<br />
Atalanta s<strong>of</strong>tware <strong>RTOS</strong><br />
• A multiprocessor SoC <strong>RTOS</strong><br />
The <strong>RTOS</strong> and device drivers are<br />
loaded into the L2 cache memory<br />
• All Processing Elements (PEs)<br />
share the kernel code and data<br />
structures<br />
<strong>Hardware</strong> <strong>RTOS</strong> components are<br />
downloaded into the reconfigurable<br />
logic<br />
HW/<br />
SW<br />
<strong>RTOS</strong><br />
FPGA World 2004 31<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Experimental Setup<br />
Six custom <strong>RTOS</strong>es<br />
• With semaphores and spin-locks, no<br />
HW components in the <strong>RTOS</strong><br />
• With SoCLC, no SW IPCs<br />
• With deadlock avoidance s<strong>of</strong>tware, no<br />
HW <strong>RTOS</strong> components<br />
• With SoCDAU<br />
• With SoCLC and SoCDAU<br />
• With RTU<br />
Each <strong>with</strong> the Base architecture<br />
Each <strong>with</strong> application(s)<br />
Each executable in Seamless CVE<br />
4 MPC750 processors<br />
Reconfigurable logic<br />
Single bus<br />
SW <strong>RTOS</strong><br />
w/ sem<br />
FPGA World 2004 32<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
<strong>RTOS</strong>1<br />
Application<br />
User<br />
Input<br />
S<strong>of</strong>tware<br />
<strong>RTOS</strong> library<br />
SW <strong>RTOS</strong><br />
+<br />
SoCLC<br />
SW <strong>RTOS</strong><br />
w/ deadlock<br />
avoid.<br />
Executable HW file<br />
<strong>for</strong> each<br />
<strong>Hardware</strong><br />
<strong>RTOS</strong> library<br />
GUI tool<br />
SW <strong>RTOS</strong><br />
+<br />
SoCDAU<br />
Compile Stage <strong>for</strong> each system<br />
Base<br />
Architecture<br />
library<br />
Executable SW file<br />
<strong>for</strong> each<br />
SW <strong>RTOS</strong><br />
+ SoCLC<br />
+<br />
SoCDAU<br />
<strong>RTOS</strong>2 <strong>RTOS</strong>3 <strong>RTOS</strong>4 <strong>RTOS</strong>5<br />
Simulation in Seamless<br />
VCS XRAY<br />
CVE<br />
RTU<br />
<strong>RTOS</strong>6
An <strong>RTOS</strong> in hardware<br />
RTU <strong>Hardware</strong> <strong>RTOS</strong><br />
LOCAL<br />
BUS<br />
FPGA World 2004 33<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
TDBI<br />
GBI<br />
• Real-Time Unit (RTU)<br />
– scheduling<br />
– IPC<br />
– dynamic task creation<br />
– timers<br />
• Custom hw => upper bound on # tasks<br />
• Reconfigurable hw => can alter max. # tasks, max. # priorities<br />
• Pr<strong>of</strong>. Lennart Lindh, Mälardalens U., Västerås, Sweden<br />
• RealFast, www.realfast.se<br />
Accelerator<br />
Interface<br />
MsgQLib<br />
RTC<br />
Scheduler<br />
IRQ
Methodology<br />
An SoC architecture <strong>with</strong> the RTU <strong>Hardware</strong> <strong>RTOS</strong><br />
MPC755-1<br />
L1<br />
RTU in<br />
Reconfig.<br />
Logic<br />
MPC755-2<br />
L1<br />
Arbiter,<br />
Intr.<br />
Controller,<br />
Clock<br />
MPC755-3<br />
Memory<br />
Controller<br />
and<br />
Memory<br />
FPGA World 2004 34<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
L1
Methodology<br />
An SoC architecture <strong>with</strong> a hardware/s<strong>of</strong>tware <strong>RTOS</strong><br />
MPC755-1<br />
SoCLC in<br />
Reconfig.<br />
Logic<br />
L1<br />
MPC755-2<br />
L1<br />
Bus Arbiter,<br />
Intr.<br />
Controller,<br />
Clock<br />
MPC755-3<br />
Memory<br />
(Atalanta<br />
<strong>RTOS</strong>)<br />
FPGA World 2004 35<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
L1
δ Framework<br />
–GUI<br />
Specilized SW<br />
<strong>RTOS</strong> component<br />
Methodology<br />
HW <strong>RTOS</strong><br />
component<br />
Number <strong>of</strong><br />
CPUs in system<br />
IPC module<br />
linking method<br />
FPGA World 2004 36<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Implementation<br />
Verilog top file generation example<br />
• Start <strong>with</strong> RTU description IP Library<br />
(i)<br />
clock clock_gen (SYSCLK);<br />
• Generate instantiation code<br />
Generate<br />
code<br />
cpu_mpc750 cpu1 (…);<br />
\rtu.rtu(struct) rtu_comp (…);<br />
multiple instantiations <strong>of</strong><br />
same unit if needed (e.g.,<br />
Desc RTU<br />
~~~<br />
enddesc<br />
arbiter arb (br_bar, bg_bar…);<br />
PEs) (ii) Add wires<br />
and initial states<br />
• Add wires and initial<br />
statements<br />
PEs 1,2,3,…<br />
Memory<br />
1,2,…<br />
SoCLC<br />
Clock<br />
Arbiter<br />
(iii)<br />
After<br />
Instantiation<br />
wire ADDR;<br />
wire DATA;<br />
wire BR_BAR;<br />
wire BG_BAR;<br />
wire SYSCLK;<br />
…<br />
initial begin … end;<br />
FPGA World 2004 37<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Comparison<br />
Experimental Results (1/3)<br />
• A system <strong>with</strong> RTU hardware <strong>RTOS</strong><br />
• A system <strong>with</strong> SoCLC hardware and s<strong>of</strong>tware <strong>RTOS</strong><br />
• A system <strong>with</strong> pure s<strong>of</strong>tware <strong>RTOS</strong><br />
6 tasks<br />
30 tasks<br />
Total Execution Time<br />
(in cycles)<br />
Reduction<br />
(in cycles)<br />
Reduction<br />
Pure SW *<br />
100398<br />
0%<br />
379440<br />
0%<br />
With SoCLC<br />
71365<br />
29%<br />
317916<br />
16%<br />
* A semaphore is used in pure s<strong>of</strong>tware and a hardware<br />
mechanism is used in SoCLC and RTU.<br />
With RTU<br />
67038<br />
33%<br />
279480<br />
26%<br />
FPGA World 2004 38<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Experimental Results (2/3)<br />
The number <strong>of</strong> interactions<br />
Times<br />
Number <strong>of</strong> semaphore<br />
interactions<br />
Number <strong>of</strong> context switches<br />
Number <strong>of</strong> short locks<br />
6 tasks<br />
FPGA World 2004 39<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004<br />
12<br />
3<br />
10<br />
30 tasks<br />
60<br />
30<br />
58
Experimental Results (4/4)<br />
The average number <strong>of</strong> cycles spent on communication, context switch<br />
and computation (6 task case)<br />
cycles<br />
communication<br />
context switch<br />
computation<br />
Pure SW<br />
18944<br />
3218<br />
8523<br />
With SoCLC<br />
3730<br />
3231<br />
8577<br />
With RTU<br />
2075<br />
2835<br />
8421<br />
FPGA World 2004 40<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Total area<br />
TSMC 0.25μm<br />
library from<br />
LEDA<br />
<strong>Hardware</strong> Area<br />
SoCLC (64 short CS locks +<br />
64 long CS locks)<br />
7435 gates<br />
RTU <strong>for</strong> 3 processors<br />
About 250000 gates<br />
FPGA World 2004 41<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004
Conclusion<br />
A framework <strong>for</strong> automatic generation <strong>of</strong> a custom<br />
HW/SW <strong>RTOS</strong><br />
Experimental results showing<br />
• a multiprocessor FPGA that utilizes the SoCDAU/SoCDDU<br />
has a 37% overall speedup <strong>of</strong> the application execution time<br />
over deadlock avoidance in s<strong>of</strong>tware<br />
• speedups <strong>with</strong> the SoCLC, RTU<br />
• addition hw <strong>RTOS</strong> component in references: SoCDMMU<br />
Future work<br />
support <strong>for</strong> heterogeneous processors<br />
support <strong>for</strong> multiple bus systems/structures<br />
FPGA World 2004 42<br />
HW/SW <strong>RTOS</strong> Project<br />
©Vincent J. Mooney III, 2004