Optimization techniques for ARM 7 microcontroller based embedded ...
Optimization techniques for ARM 7 microcontroller based embedded ...
Optimization techniques for ARM 7 microcontroller based embedded ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Optimization</strong> <strong>techniques</strong> <strong>for</strong> <strong>ARM</strong> 7<br />
<strong>microcontroller</strong> <strong>based</strong> <strong>embedded</strong> systems<br />
presented on gamma correction algorithm<br />
implementation<br />
Kosta Karpuzovski<br />
FEIT Skopje
We will discuss<br />
Motivation<br />
General C code optimization<br />
<strong>techniques</strong><br />
<strong>Optimization</strong> of the algorithm<br />
Gamma correction<br />
Fast power calculation<br />
Results<br />
General discussion <strong>for</strong> improvement
Motivation<br />
Allow <strong>ARM</strong>7 processor to make real<br />
time processing<br />
◦ Make algorithm simple enough<br />
<strong>ARM</strong>7TDMI processor to execute it <strong>for</strong> the<br />
whole picture in less then 1/25 seconds<br />
Green power<br />
◦ Faster execution leaves more time <strong>for</strong><br />
sleep mode operation<br />
Lower costs<br />
◦ Use cheaper components in design
General C code optimization<br />
<strong>techniques</strong><br />
<strong>Optimization</strong> of loops by<br />
countdown from max to zero<br />
Use of lookup tables<br />
instead of calculation<br />
◦ Use<br />
<strong>for</strong> (ArrayIndex = ImageLength; ArrayIndex !=<br />
0;)<br />
{<br />
ArrayIndex--;<br />
m = ImageArray[ArrayIndex];<br />
ImageArray[ArrayIndex] = LookupTable[m];<br />
}<br />
◦ Instead of<br />
<strong>for</strong> (ArrayIndex = 0; ArrayIndex = 254.5f)<br />
q=255.0f;<br />
ImageArray[ArrayIndex] = (unsigned int)q;<br />
}
General C code optimization<br />
<strong>techniques</strong> (2)<br />
Loop unrolling to minimize branching in code<br />
Use of Do While instead of For loop in the cases<br />
when we know that the loop will be executed at<br />
least once<br />
Use of ternary operator<br />
◦ MinRange = (MinRange > 255) MinRange : MinValue; is better<br />
then<br />
◦ MinRange = (255 < MinRange) MinValue : MinRange;<br />
Use of local variables in loops<br />
<br />
<br />
Use local variables in all calculations with global ones and<br />
return the value in the global variables once all calculations<br />
are done<br />
Use of break <strong>for</strong> premature loop exit
General C code optimization<br />
<strong>techniques</strong> (3)<br />
<strong>techniques</strong> (3)<br />
LoopCounter = ImageLength;<br />
Do {<br />
InputValue = *ImageArray_p;<br />
ImageArray_p++;<br />
MaxValue = (InputValue > MaxValue) InputValue : MaxValue;<br />
MinValue = (InputValue < MinValue) InputValue : MinValue;<br />
InputValue = *ImageArray_p;<br />
ImageArray_p++;<br />
MaxValue = (InputValue > MaxValue) InputValue : MaxValue;<br />
MinValue = (InputValue < MinValue) InputValue : MinValue;<br />
...<br />
InputValue = *ImageArray_p;<br />
ImageArray_p++;<br />
MaxValue = (InputValue > MaxValue) InputValue : MaxValue;<br />
MinValue = (InputValue < MinValue) InputValue : MinValue;<br />
LoopCounter -= 8;<br />
if ((0 == MinValue) && (0xff == MaxValue))<br />
break;<br />
} while (LoopCounter != 0);<br />
*MinValue_p = MinValue;<br />
*MaxValue_p = MaxValue;
General C code optimization<br />
<strong>techniques</strong> (4)<br />
Use of enumerated constants <strong>for</strong> switch() cases, to<br />
make it continuous from max to 0<br />
typedef enum {<br />
PARSER_EXECUTE_COMMAND = 0, /*< State to execute command */<br />
PARSER_WAIT_PAYLOAD, /*< State to receive payload */<br />
PARSER_WAIT_COMMAND, /*< State to receive command */<br />
PARSER_IDLE, /*< State <strong>for</strong> all initializations */<br />
} ParserState_e;<br />
static ParserState_e State = PARSER_IDLE;
General C code optimization<br />
<strong>techniques</strong> (5)<br />
<strong>techniques</strong> (5)<br />
switch(State) {<br />
case PARSER_IDLE:<br />
if (NULL == ReceiverContext_p)<br />
ReceiverContext_p=(ReceiverContext_t*)malloc(sizeof(ReceiverContext_t));<br />
assert(NULL != ReceiverContext_p);<br />
InitialiseReceiver(ReceiverContext_p, &State);<br />
State = PARSER_WAIT_COMMAND;<br />
case PARSER_WAIT_COMMAND:<br />
ReceiveCommand(ReceiverContext_p, &State, ReceivedChar);<br />
break;<br />
case PARSER_WAIT_PAYLOAD:<br />
ReceiveCommand(ReceiverContext_p, &State, ReceivedChar);<br />
ReceivePayload(ReceiverContext_p, &State, ReceivedChar);<br />
if (PARSER_WAIT_PAYLOAD == State)<br />
break;<br />
case PARSER_EXECUTE_COMMAND:<br />
ExecuteCommand(ReceiverContext_p, &State);<br />
break;<br />
default:<br />
State = PARSER_IDLE;<br />
break;<br />
}
General C code optimization<br />
<strong>techniques</strong> (6)<br />
Keep number of parameters transferred in function<br />
call to less or equal to 4<br />
◦ To allow compiler to transfer all parameters in registers<br />
Number of local variables to be 7 or less<br />
◦ To allow compiler to use registers <strong>for</strong> local calculations<br />
Use compiler invocation settings with optimization<br />
level set to maximum<br />
<strong>Optimization</strong> of the algorithm<br />
◦ Most important part of the process of optimization
<strong>Optimization</strong> of the algorithm<br />
Use open-minded approach when solving<br />
problems<br />
◦ Read about the subject and what is already done in the field<br />
◦ Choose algorithm that will potentially give best results on the plat<strong>for</strong>m you<br />
use<br />
<br />
◦ Try to find alternative solutions <strong>for</strong> all computational parts in the algorithm<br />
Avoid division in the case of <strong>ARM</strong> controllers<br />
Use faster or less memory demanding solutions then the ones offered by the generic libraries<br />
◦ Find the botomnecks and then optimize<br />
Do not optimize code that is executed only once, the gain will probably not mach the ef<strong>for</strong>t<br />
Specifics of gamma correction algorithm we are implementing<br />
◦ First we make range correction<br />
Input values in the range from InMin to InMax we transfer in OutMin to OutMax<br />
◦ We limit the range of values<br />
Input and output values are in the range between 0 and 255<br />
◦ We saturate the signal<br />
There are no values bigger then 255<br />
◦ We define optimization target<br />
Our main concern is speed of execution
<strong>Optimization</strong> of the algorithm (2)
<strong>Optimization</strong> of the algorithm (3)<br />
Gamma correct<br />
Find Minimum/<br />
Maximum<br />
Index = ImageLength<br />
Index != 0<br />
yes<br />
Exit<br />
no<br />
Index--;<br />
Value = ImageArray[Index];<br />
Value = (Value – ImgMin)*ExtRange/ImgRange<br />
+ ExtMinValue;<br />
Value = powf(Value,Gamma);<br />
Value = 255;<br />
yes<br />
Value >=254,5<br />
ImageArray[Index] = Value;<br />
no
<strong>Optimization</strong> of the algorithm (4)<br />
Fast gamma correct<br />
Find Minimum/<br />
Maximum<br />
Create corrected<br />
range power<br />
lookup<br />
Index = ImageLength<br />
Index != 0<br />
yes<br />
no<br />
Index--;<br />
Value = ImageArray[Index];<br />
Value = LookupTable[Value];<br />
ImageArray[Index] = Value;<br />
Exit
<strong>Optimization</strong> of the algorithm (5)<br />
Find minimum/maximum<br />
Check if all<br />
elements in image<br />
are passed<br />
yes<br />
no<br />
Value = *ImageArray_p;<br />
ImageArray_p++;<br />
Value ><br />
MaxValue<br />
yes<br />
MaxValue = Value;<br />
One of eight “Find min/max” units<br />
unrolled in this function<br />
no<br />
Value < MinValue<br />
yes<br />
MinValue = Value;<br />
no<br />
.<br />
.<br />
.<br />
Check if<br />
MinValue = 0 and<br />
MaxValue = 255<br />
yes<br />
no<br />
Exit
<strong>Optimization</strong> of the algorithm (5)
Gamma correction<br />
Gamma correction<br />
◦ General definition<br />
s<br />
= c( r −ε )<br />
◦ Simplified <strong>for</strong>mula<br />
γ<br />
s = r<br />
Fast power calculation<br />
γ
Fast power calculation<br />
Basic idea <strong>for</strong> fast power calculation<br />
◦ Use of IEEE-754 float representation<br />
<strong>for</strong>mat<br />
◦ Bit manipulations<br />
◦ Table lookup<br />
IEEE-754 floating point numbers<br />
representation<br />
sign exponent mantissa<br />
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM<br />
31 23 0
Fast power calculation (2)<br />
IEEE-754 floating point numbers<br />
representation<br />
◦ (+/-)(1 + mantissa/2 23 ) * 2<br />
(exponent - 127)<br />
◦ Number math<br />
( b+<br />
c)<br />
b<br />
a = a *<br />
2<br />
a<br />
( WholeNumber + DecimalPart )<br />
c<br />
WholeNumber<br />
*2<br />
DecimalPart<br />
◦ 2 DecimalPart is in the range 1
Fast power calculation (3)<br />
Mantissa table precision<br />
◦ Average error of 0,01% <strong>for</strong> 11bit table<br />
(8kB)<br />
◦ With 9bit table we experience shift of 1 in<br />
two values in the range (0-255)
Results<br />
Amount of data<br />
No C code<br />
optimizations<br />
Execution time gain<br />
19200 Byte (<strong>for</strong>est) high<br />
compiler optimization<br />
19200 Byte (tire) high<br />
compiler optimization<br />
19200 Byte (<strong>for</strong>est) no<br />
compiler optimization<br />
19200 Byte (tire) no<br />
compiler optimization<br />
C code<br />
optimized<br />
Ratio<br />
14,6s 0,083 176<br />
14,67s 0,085 173<br />
14,6s 0,1157 126<br />
14,67s 0,1183 124
General discussion <strong>for</strong><br />
improvement<br />
Better results with board with more<br />
RAM to keep both pictures<br />
Optimize Max/Min function with<br />
additional check if we reached 255 or<br />
0<br />
Limit gamma range from 0,3 to 3 and<br />
optimize further<br />
Time under 40ms