PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK ...

PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK) 

USING TBB(THREADING BUILDING BLOCKS) 

Parag Sheth 

B.E., L. D. College of Engineering, 2006 

PROJECT 

Submitted in partial satisfaction of 

the requirements for the degree of 

MASTER OF SCIENCE 

in 

COMPUTER SCIENCE 

at 

CALIFORNIA STATE UNIVERSITY, SACRAMENTO 

FALL 

2010

Approved by: 


USING TBB (THREADING BUILDING BLOCKS) 

A Project 

by 

Parag Sheth 

_______________________________, Committee Chair 

Ted Krovetz, Ph.D. 

_______________________________, Second Reader 

Chung-E Wang, Ph.D. 

____________________ 

Date 

ii

Student : Parag Sheth 

I certify that this student has met the requirements for format contained in the University 

format manual, and that this project is suitable for shelving in the Library and credit is to 

be awarded for the project. 

__________________________, Graduate Coordinator _____________________ 

Nikrouz Faroughi, Ph.D. Date 

Department of Computer Science 

iii

Abstract 

of 


USING TBB (THREADING BUILDING BLOCKS) 

by 

Parag Sheth 

My project would be to explore Intel’s open source library (for C++), named TBB 

(Threading Building Blocks), and make its use to analyze performance gain for the 

implementation of OCB (Offset Codebook). The analysis would begin with identifying 

the parallel portions inside an OCB algorithm, followed by its implementation using 

TBB. After that I would analyze the performance gain obtained by changing various 

parameters of the OCB algorithm. 

TBB(Threading Building Blocks) : 

It is Intel’s open source template library for C++. The aim of it is to provide task 

level parallelism as opposed to thread level parallelism. It makes implementation more 

portable and easy to understand. TBB library internally keeps a pool of worker threads. 

The application developer needs to specify the parallel portion of the application and 

most of the remaining work is taken care of by the TBB library. The library determines 

iv

equired no of threads for the task and schedules them on available processor cores. TBB 

uses work-stealing scheduler design to schedule its threads. 

For developers, the benefits of TBB are : 

1. It reduces the length of the code for a multithreaded application. 

2. It relieves the programmer from handling all the thread management stuff. 

3. It automatically identifies the underlying system and determines optimal no of threads. 

It also automatically balances the work load between these threads and makes 

maximum use of all the available processor cores to achieve maximum performance. 

4. The applications developed using TBB automatically becomes portable and scalable to 

machines with any no of core. 

OCB(Offset Codebook) : 

It is a shared key encryption - authentication scheme, built from a block cipher. 

OCB achieves authenticated encryption in essentially the same amount of time as other 

modes, like CBC, achieve privacy alone. Or in other words we can say that it takes about 

half the time as "conventional" modes, like CCM, to achieve privacy and authenticity 

combined. On top of this OCB is a simple, easy and highly parallelizable method which 

can be easily implemented in hardware and software. It can also be proved that it is as 

secure as its underlying primitive algorithms. 

v

Some of the key features of OCB are : 

1. It can encrypt messages of any bit length and messages don’t have to be multiple of the 

block length. 

2. Encryption and decryption depend on an n-bit nonce N, which must be selected as a 

new value for each encryption. The nonce need not be random or secret. 

3. It is an on-line algorithm, meaning one need not know the length of the header or 

message to proceed with encryption, and one need not know the length of header or 

cipher text to proceed with decryption. 

4. OCB is parallelizable : the bulk of its block cipher calls may be performed 

simultaneously. Thus OCB is suitable for encrypting messages in hardware at the 

highest network speeds. 

5. It needs very little memory to run. 

6. It is nearly endian-neutral. 

___________________________________, Committee Chair 

Ted Krovetz, Ph.D. 

_______________________ 

Date 

vi

TABLE OF CONTENTS 

Page 

List of Tables ..................................................................................................................... ix 

List of Figures .................................................................................................................... x 

Chapter 

1. INTRODUCTION TO AUTHENTICATED ENCRYPTION ....................................... 1 

Security Model ....................................................................................................... 1 

Notions of Security .................................................................................... 2 

Notions of Attacks ..................................................................................... 3 

Security of Message Authentication Code (MAC) .................................... 4 

Authenticated Encryption ...................................................................................... 4 

2. INTRODUCTION TO OFFSET CODEBOOK (OCB) ................................................ 7 

Overview ............................................................................................................... 7 

Notation and Basic Operation ............................................................................... 9 

OCB Parameters .................................................................................................... 9 

Header Authentication : PMAC ........................................................................... 10 

Encryption : OCB-ENCRYPT ............................................................................. 11 

Decryption : OCB-DECRYPT ............................................................................. 12 

Parallel Portion of the Encryption / Decryption Algorithm ................................. 13 

Security Consideration of OCB ........................................................................... 14 

vii

3. INTRODUCTION TO THREADING BUILDING BLOCKS (TBB) ........................ 15 

Overview .............................................................................................................. 15 

Task Scheduling ................................................................................................... 16 

TBB Provided Algorithms ................................................................................... 19 

Containers ............................................................................................................ 20 

Scalable Memory Allocation ................................................................................ 21 

4. OCB IMPLEMENTATION USING TBB ................................................................... 22 

Class Definition ................................................................................................... 22 

Class Definition : OCB_With_TBB ........................................................ 23 

Class Definition : encryptBlockParallel .................................................. 24 

Class Definition : xorBlockParallel ......................................................... 25 

Class Implementation .......................................................................................... 26 

5. RESULTS .................................................................................................................... 34 

Experiments ......................................................................................................... 34 

Conclusion ........................................................................................................... 40 

References ........................................................................................................................ 52 

viii

LIST OF TABLES 

Page 

1. Table 1 CPB Comparison at Different Processor Cores (Experiment A) ........... 41 

2. Table 2 CPB Comparison at Different Block Lengths (Experiment B) ............. 49 

3. Table 3 CPB Comparison at Different Chunk Sizes (Experiment C) ................. 51 

ix

LIST OF FIGURES 

1. Figure 1 Sample Task Graph ...................................................................................... 17 

2. Figure 2 Sample Ready Pool ...................................................................................... 18 

3. Figure 3 CPB Comparison at Different Processor Cores ........................................... 35 

Page 

4. Figure 4 CPB Comparison at Different Block Lengths .............................................. 37 

5. Figure 5 CPB Comparison at Different Chunk Sizes ................................................. 39 

x

Chapter 1 

INTRODUCTION TO AUTHENTICATED ENCRYPTION 

Authentication and encryption are two different objectives to be achieved while 

designing any secure communication system. Encryption refers to the privacy of the 

actual message while authentication refers to the mechanism which can prove that the 

sender of the message is actually the one who he or she claims to be. Formally speaking 

Encryption is the process of transforming information (referred to as plaintext) using an 

algorithm (called cipher) to make it unreadable to anyone except those possessing special 

knowledge, usually referred to as a key. [1] The concept of authenticity is similar to the 

concept of signature in real world. 

1.1 Security Model 

There are various algorithms available for encryption and authentication purpose. 

One needs to make sure that the algorithm that he or she is planning to use can give 

enough security against all the possible type of attacks in that scenario. But to understand 

that we need to formally define the security model. In other words, we need to formalize 

different types of possible attacks on the system and different levels of security that an 

algorithm can provide against those attacks. It is quite possible that some encryption 

algorithms are secure against a particular type of attack but they are easily broken when 

thrown against another kind of attack. 

1

1.1.1 Notions of Security 

There are essentially 3 notions of security that needs to be defined - Perfect Security, 

Semantic Security and Polynomial Security. 

1. Perfect Security : The algorithm is said to be having perfect security or information 

theoretic security if the adversary with infinite amount of computational power can 

learn nothing about the plaintext given the ciphertext. This is actually a very strong 

definition and no such algorithm is possible in real world. 

2. Semantic Security : This notion is similar to the perfect security but here an adversary 

is given only polynomial amount of time. Polynomial time can be defined as t = f(|M|), 

where |M| is the length of the given message. In other words the algorithm is said to 

have semantic security if an adversary can learn nothing about the plaintext given the 

ciphertext in a certain amount of finite time (polynomial time). 

3. Polynomial Security : This is an extended concept of semantic security and it is also 

provable. Here an adversary is allowed to select 2 messages M1 and M2 of the same 

length. Now the adversary is given ciphertext of one of this messages Ci, where i is a 

randomly chosen unknown bit. The algorithm is said to be having polynomial security 

if an adversary cannot identify the message (either M1 or M2) related to the ciphertext 

Ci with significantly higher probability (probability higher than 1/2). It can be proved 

that if the algorithm is polynomially secured than it is also semantically secured. Here 

the advantage of an adversary Adv A = | Pr ( A ( guess, Ci, y, M1, M2 ) = b ) - 1/2 ) |, 

2

where y is a secret key. The scheme is polynomially secured if Adv A ≤ 1 / p(k), for all 

adversaries A and all polynomials p and sufficiently large k. 

1.1.2 Notions of Attacks 

There are mainly 3 different kind of attacks. Passive attack, chosen ciphertext attack and 

adaptive chosen ciphertext attack. 

1. Passive Attack : This is the weakest form of attack in which an adversary is allowed to 

observe only ciphertexts. An adversary also has an access to the encryption black box 

to which he / she can submit plaintext blocks and observe the returned ciphertexts. 

2. Chosen Ciphertext Attack : Here an adversary is given an access to the decryption box 

where she can submit any number of ciphertexts and observe the returned plaintext 

messages. In the next stage she is given a challenge ciphertext and is asked to get the 

plaintext or at least some information about the plaintext. In this later stage she is not 

allowed to use the decryption box. 

3. Adaptive Chosen Ciphertext Attack. This is a very strong type of attack. Here, in 

addition to all the accesses given in CCA, an adversary is also allowed to use the 

decryption box during the challenge stage except for the challenge ciphertext. 

Based on the notions above we can say that “A public key encryption algorithm is said to 

be secured if it is polynomially secured against an adaptive chosen ciphertext attack.” [2] 

Similar kind of approach defines the security for a symmetric encryption algorithm. The 

3

actual difference between them is that in case of public key encryption scheme, the 

algorithm needs to be probabilistic while in case of symmetric key encryption scheme, 

deterministic algorithm can be used. 

1.1.3 Security of Message Authentication Code (MAC) 

Security of a MAC can be defined in various ways but selective forgery is a 

widely used notion. In this notion an adversary is asked to choose a plaintext message 

M1. The MAC generator algorithm returns the MAC S1 for some random key K. Now 

the challenge for an adversary is to generate another valid pair of (M2, S2) where, M1 ≠ 

M2. If the adversary succeeds in generating such a pair than this is know as a selective 

forgery. 

1.2 Authenticated Encryption 

After discussing the security model, now we are in a position to discuss 

Authenticated Encryption. Various practices of providing privacy and authenticity have 

been used for years. In traditional approach, encryption and authentication algorithms are 

applied one after the other to achieve data security and authenticity. These kind of 

schemes are known as “generic compositions”. Here encryption and authentication 

algorithms can be applied in any order and based on that, the schemes are known as 

Encryption then Authentication (EtA) or Authentication then Encryption (AtE) or Encrypt 

and MAC (E&M). One such EtM generic composition scheme can be described as 

4

follows. For example Bob wants to send a message to Alice. They both share a secret key 

K. Bob first encrypts the message using this key K and possibly a nonce N. The nonce 

here can be any random number or a counter value. After generating the ciphertext, it is 

applied to the authentication algorithm along with the key. This will generate a tag known 

as message authentication tag T. Now Bob can send this triplet (C, N, T) to Alice. On the 

other end, Alice can apply exactly the reverse process to retrieve original plaintext and to 

verify the authenticity of the received message. There are various such schemes available 

but the problem with them is that both encryption and authentication functions need to be 

applied separately, which takes almost double time and processing power. Some times 

designers make a mistake of using regular hash instead of a secure hash (MAC). This 

approach is almost always broken. So as a conclusion, it would be best for any generic 

composition scheme that ‘Encrypt then MAC’ (EtM) approach with a provably secured 

encryption scheme and a provably secured MAC (each with independent keys) are used. 

The concept of Authenticated encryption (AE) is to provide a single method 

which can achieve data security and authenticity in a single pass, and thus improving the 

efficiency. Some researchers have pointed out that even though the individual elements of 

the scheme are secure, the combined scheme - if not designed properly - may lead to the 

insecure implementation. A properly designed AE scheme can also provide security 

against the chosen ciphertext attack. This is a kind of attack where an adversary can 

submit carefully chosen ciphertext to the decryption oracle. By analyzing the pattern in 

5

plaintext, an adversary may be able to get some information about the secret key being 

used. Instead an ideal AE scheme will just refuse to decrypt the message without giving 

much information, if the message is not properly authenticated. Thus an adversary cannot 

submit just any random ciphertext and expect its corresponding plaintext. This approach 

avoids CCA. Another important aspects while designing an AE schemes are efficiency, 

parallelizability, simplicity and portability. One such scheme known as Offset Codebook 

(OCB) is described in the next chapter. 

The security of any AE scheme is dependent on its primitive algorithms. It is very 

difficult to provide a proof of security for such primitives. But once we have shown that 

no known attacks seem to work, it is possible to show that the schemes, based on these 

primitives are as secure as its underlying algorithms. For that matter it can be proved that 

OCB scheme is as secure as its underlying encryption algorithm. 

6

2.1 Overview 

Chapter 2 

INTRODUCTION TO OFFSET CODEBOOK (OCB) 

Offset Codebook (OCB) mode of authenticated encryption was developed by 

Philip Rogaway, who credited the design to Mihir Bellare, John Black and Ted Krovetz 

for their support. This mode is based on the IAMP scheme developed by Charanjit Jutla. 

The OCB scheme improves the original IAMP scheme in certain criteria such as 1) 

Minimizing number of block cypher calls, 2) Giving direction when the length of the 

original message is not the multiple of block length n, 3) Avoids multiple encryption 

keys, 4) Makes use of a nonce which is required to be unique for each encryption but is 

not required to be secret or random. There are two versions of the OCB scheme. The 

initial version is 1.0 and the current version is 2.0 which is an improvement over version 

1.0. The key differentiators between two versions are that the version 2.0 allows 

associated data to be included with the message and a new method for generating the 

sequence of offsets. This associated data travels in plain text along with the cipher text 

but it needs to be authenticated. This is similar to the header requirement discussed in 

chapter 1. 

OCB uses a block cypher - typically AES. It allows a predefined header to be 

authenticated along with the message. OCB also requires a unique nonce N along with 

7

each encryption. It typically requires h + m + 2 block cypher calls in total, where h is the 

block length of header and m is the block length of the original message. Once header is 

authenticated there is virtually no cost in subsequent authentication of H. So OCB uses m 

+ 2 block cypher calls. OCB is also highly parallelizable. As it will be discussed later on, 

some parts of the OCB algorithm can be done independently and it implies that the 

efficiency of the OCB operation can be improved dramatically if underlying hardware 

supports robust parallel processing. Another advantage of OCB is that it is an online 

scheme. In other words it is not required to know the length of the complete message 

before starting the encryption. Similarly it is not required to know the length of the 

complete cypher text before starting the decryption. OCB generated output is of the same 

length as the original message plus the length of the authentication tag. This is a huge 

advantage as it minimizes the actual data being transferred. This might be a cause of 

concern in cases where traffic analysis is possible. In such scenario other schemes such as 

padding needs to be used. Following sections describe the actual algorithm or pseudo 

code for OCB and it’s constructs. 

8

2.2 Notation and Basic Operation [3] 

c^i The integer c raised to the i-th power 

ceil(x) The smallest integer no smaller than x 

bitlength(S) The length of string S in bits 

zeros(n) The string made of zero bits 

S xor T The string that is the bitwise exclusive-or of S and T. Strings S and T must 

have the same length 

S[i] The i-th bit of the string S (indices begin at 1). 

S[i..j] The substring of S consisting of bits i through j. 

S || T The string S concatenated with string T (eg, 000 || 111 = 000111). 

S

2.4 Header Authentication : PMAC [3] 

Function Name : PMAC 

Input : K, string of KEYLEN bits 

H, string of any length // Header to co-authenticate 

Output : Auth, string of BLOCKLEN bits // Header authenticator 

// Break H into blocks 

m = max(1, ceil(bitlength(H) / BLOCKLEN)) 

Let H_1, H_2, ..., H_m be strings such that H = H_1 || H_2 || ... || H_m and bitlength(H_i) 

= BLOCKLEN for all 0 

// Initialize strings used for offsets and checksums 

Offset = ENCIPHER(K, zeros(BLOCKLEN)) 

Offset = times3(Offset) 


Checksum = zeros(BLOCKLEN) 

// Accumulate the first m - 1 blocks 

for i = 1 to m - 1 do // Skip if m < 2 


Checksum = Checksum xor ENCIPHER(K, H_i xor Offset) 

end for 

// Accumulate the final block 


if bitlength(H_m) = BLOCKLEN then 


Checksum = Checksum xor H_m 

else 



Tmp = H_m || 1 || zeros(BLOCKLEN - (bitlength(H_m) + 1)) 

Checksum = Checksum xor Tmp 

end if 

// Compute result 

Auth = ENCIPHER(K, Offset xor Checksum) 

10

2.5 Encryption : OCB-ENCRYPT [3] 

Function Name : OCB-ENCRYPT 

Input : K, string of KEYLEN bits // Key 

N, string of BLOCKLEN bits // Nonce 

H, string of any length // Header 

M, string of any length // Plaintext 

Output : C, string of length equal to M // Cipher text core 

T, string of BLOCKLEN bits // Authentication tag 

// Break M into blocks 

m = max(1,ceil(bitlength(M) / BLOCKLEN)) 

Let M_1, M_2, ..., M_m be strings such that M = M_1 || M_2 || ... || M_m and bitlength 

(M_i) = BLOCKLEN for all 0 

// Initialize strings used for offsets and checksums 

Offset = ENCIPHER(K,N) 


// Encrypt and accumulate first m - 1 blocks 

for i = 1 to m - 1 do // Skip if m < 2 


Checksum = Checksum xor M_i 

C_i = Offset xor ENCIPHER(K, M_i xor Offset) 

end for 

// Encrypt and accumulate final block 


b = bitlength(M_m) // Value in 0..BLOCKLEN 

Pad = ENCIPHER(K, num2str(b, BLOCKLEN) xor Offset) 

C_m = M_m xor Pad[1..b] // Encrypt M_m 

Tmp = M_m || Pad[b+1..BLOCKLEN] 


// Compute authentication tag 


T = ENCIPHER(K, Checksum xor Offset) 

if bitlength(H) > 0 then 

T = T xor PMAC(K, H) 

end if 

// Assemble the ciphertext 

C = C_1 || C_2 || ... || C_m 

11

2.6 Decryption : OCB-DECRYPT [3] 

Function Name : OCB-DECRYPT 

Input : K, string of KEYLEN bits // Key 

N, string of BLOCKLEN bits // Nonce 

H, string of any length // Header 

C, string of any length // Cipher text core 

Output : M, string // Plaintext 

V, boolean // Validity indicator 

m = max(1,ceil(bitlength(C) / BLOCKLEN)) 

Let C_1, C_2, ..., C_m be strings such that C = C_1 || C_2 || ... || C_m and bitlength(C_i) 

= BLOCKLEN for all 0 

Offset = ENCIPHER(K,N) 


// Decrypt and accumulate 

for i = 1 to m - 1 do // Skip if a < 2 


M_i = Offset xor DECIPHER(K, C_i xor Offset) 

Checksum = Checksum xor M_i 

end for 


b = bitlength(C_m) // Value in 0..BLOCKLEN 

Pad = ENCIPHER(K, num2str(b, BLOCKLEN) xor Offset) 

M_m = C_m xor Pad[1..b] 

Tmp = M_m || Pad[b+1..BLOCKLEN] 


// Compute valid authentication tag 


FullValidTag = ENCIPHER(K, Offset xor Checksum) 

if bitlength(H) > 0 then 

FullValidTag = FullValidTag xor PMAC(K, H) 

end if 

if T = FullValidTag[1..bitlength(T)] then 

V = true 

M = M_1 || M_2 || ... || M_m 

else 

V = false 

M = 

end if 

12

2.7 Parallel Portion of the Encryption / Decryption Algorithm 

If we carefully look at the encryption algorithm, it is evident that the main loop to 

encrypt the plain text is highly parallelizable. Calculating the cipher text for message 

block ‘i’, requires the knowledge of only current offset value and it is not dependent on 

any other message block or the cipher text blocks. We can calculate offset separately. It is 

also possible to calculate the offset values in advance as it is not dependent on the actual 

plaintext values. Now until we reach the last message block, n number of processes 

(depending on the available independent processing units) can encrypt independent block 

(s) of message and they will also have to keep track of their independent checksum 

values. As checksum is calculated by XORing the message block and current checksum 

value and - XOR being the associative and commutative operation - individual processing 

units can keep track of their own checksum values. And at the end, final checksum can be 

calculated by XORing the individual checksum values. Similar parallelizability exists in 

the decryption algorithm as well. Thus it is quite possible to take advantage of this highly 

parallel scheme and implement it such that the implementation becomes scalable and 

efficient in terms of using all the available processing power. Threading Building Block 

(TBB) is one such mechanism provided by Intel, which allows the developer to use all 

the available processor cores without making the implementation super complicated. 

Subsequent chapters will discus TBB in detail and the implementation of OCB using 

TBB. 

13

2.8 Security Consideration of OCB 

1. OCB scheme is as secure as the underlying block cipher scheme. So the designer 

should choose only well trusted block cipher. The privacy and authenticity decreases as 

per s^2 / 2^BLOCKLEN, where s is the total number of blocks that the adversary 

acquires. Thus the BLOCKLEN should be selected carefully. Choosing a smaller value 

for BLOCKLEN will result in the higher probability of adversary’s success. Usually 

the BLOCKLEN of 128 is sufficient. 

2. For the secure operation, it is required that the nonce value is not repeated for the same 

encryption key. If there are multiple parties communicating with the same key than 

they should divide the nonce space such that they do not overlap. Nonce is not required 

to be a secret. A simple counter can also work fine. 

3. Designer can also choose the length of the authentication tag. But choosing a small 

value for the tab length increases the chances of an adversary being capable of forgery. 

4. OCB scheme (or any other authenticated encryption scheme for that matter) can 

provide security against chosen cipher text attack. But for that, designer needs to make 

sure that when the decryption or authentication fails, the system should not give all the 

details to the adversary. 

14

3.1 Overview 

Chapter 3 

INTRODUCTION TO THREADING BUILDING BLOCKS (TBB) 

It is often quite challenging to develop a multi-threaded application that can scale 

itself based on the available processor cores. Ideally if some application gives X amount 

of performance on a dual core machine than that performance should improve on a quad 

core machine. If the developer tries to develop this kind of scalable application using raw 

threads such as POSIX or windows thread than he or she needs to manage a lot of thread 

overhead. In addition to that he needs to take care of a lot of low level stuff such as load 

balancing, memory contention and cache performance. Threading Building Blocks (TBB) 

is such a template library for C++ - developed by Intel - which can help developers in 

such cases. It is a very high level library where developer need to identify different tasks 

in the application and not the threads. TBB will automatically map all the tasks to 

appropriate number of threads and will run it efficiently. TBB can identify available 

number of processor cores and can load balance the task to get maximum performance. 

TBB offers a vast range of advantages over the native threads, such as 

1. It is platform independent and processor independent. 

2. It can be seamlessly integrated with other threading libraries in the same application. 

15

3. TBB targets data parallel programming. Instead of parallelizing independent tasks, it 

tries to divide one data intensive task into multiple threads. This approach often works 

better in terms of efficiency. 

4. Instead of relying on a global task queue, it uses task stealing mechanism thus avoiding 

the main point of contention. Task stealing is described in more details in the following 

sub section. 

3.2 Task Scheduling 

Task scheduling is the heart of TBB. It is the component which allocates tasks 

onto available worker threads and maintains the load balancing. When ever the task 

scheduler is initialized it creates a task graph. Each node in the graph represents a task. 

Each arrow points to another task which is it’s parent task. Each node also keeps a count 

of its child threads - a reference count. Nodes also keep one more counter called the 

depth count, which is usually one more than its parent. One such task graph is shown in 

figure 1. There are 2 options to traverse this task graph, breadth first search and depth 

first search. Usually depth first search is more efficient in terms of sequential execution. 

There are two main reason for that. 

1. The deepest task is the last one created. So it is more likely to be in cache. Executing 

such a task will obviously improve performance. 

2. It minimizes the memory usage. If the graph is unfolded in breadth first fashion than it 

will create many more new tasks simultaneously which will occupy memory. 

16

On the other hand, choosing depth first approach reduces parallelism. So TBB task 

scheduler uses a fine mixture of both the schemes. 

Depth = 2 

RefCount = 0 

Depth = 1 

RefCount = 2 

Depth = 0 

RefCount = 2 

Depth = 2 

RefCount = 2 

Figure 1 : Sample Task Graph 

Depth = 1 

RefCount = 0 

17

The scheduler creates enough number of worker threads based on the available 

number of processing units. Each thread has an Execute() method. Once the execute 

method calls one task than that task is bound with that thread and cannot move to another 

thread. That thread can execute some other task if the current task is in sleep mode. The 

task graph is searched in the breadth first fashion and the tasks are assigned to the worker 

threads. After that each thread keeps its own ready pool. This pool is basically an array of 

lists. The array is indexed on the depth of the task node and each list works as a stack 

(LIFO). Newly created task (in ready state) is pushed in a queue which is at the level of 

it’s depth. The task will always go to the ready pool of the thread that created it. 

Shallow 

Deep 

Task A 

Task B Task C 

Task D 

Figure 2 : Sample Ready Pool 

18

When it comes to selecting the task to execute, following rules are followed in order. 

1. Run the task returned by the Execute() method of the previous task 

2. Select the task which is at the deepest non-empty list in the pool. 

3. If thread’s own pool is empty then it can steal a task from other threads shallowest list 

of the pool. 

In summary, TBB task scheduler uses breadth first task stealing and depth first 

work strategy. Breadth first maximizes the parallelism while depth first ensures that the 

threads work efficiently once they have enough work to do. 

3.3 TBB Provided Algorithms 

TBB provides number of constructs / algorithms to help parallelize most common 

parallel structures in the software. Some of them are listed below with a short description. 

1. parallel_for and parallel_reduce : They are used when there is a need to parallelize a 

fixed number of independent loops. 

2. parallel_scan : It is used to parallelize a loop of which each iteration is dependent on 

the other loop iteration. For example a running sum calculation can be parallelized 

using this construct. 

3. parallel_while : This is used when there is a need to parallelize a continuous 

unstructured stream of work. New work can be added on the go. 

19

4. pipeline : This construct can efficiently parallelizes the typical pipeline structured 

segment of the code. 

5. parallel_sort : The complexity of this construct while sorting is no higher than O(n log 

n) for a single processor. As more processors become available, complexity approaches 

O(n). 

In each of these constructs it is possible to provide the chunk size manually. In that case 

TBB will divide the task accordingly and each worker thread will work on the specified 

chuck of data. TBB also provides auto_partition functionality where it determines the 

chunk size based on the available resources and the parallelism of the code. 

auto_partition works well in cases where actual data size is not available in advance. 

3.4 Containers 

TBB provides highly concurrent containers. These containers are quite similar to 

the STL containers but they are thread safe. Usually STL containers are not thread safe so 

there is practice to put a lock during their access, which eventually kills the purpose of 

parallelism. TBB currently provides only 3 such containers namely concurrent queue, 

vector and a hash-map. TBB also uses 2 different kind of locking mechanism to provide 

maximum parallelizability. 

1. Fine grained lock : It only locks the portion of container which is being used. So as far 

as two threads are accessing different portions of the same container than their access 

is actually parallel. 

20

2. Lock-free algorithm : This algorithm allows concurrent access without locking the 

container but it keeps track of any possible corruption and provides a correction for it. 

This type of concurrent thread-safe containers are not as fast as the one from STL. But if 

used properly, the gain from parallelism can outperform the slowness of these containers. 

3.5 Scalable Memory Allocation 

Memory allocation is a huge bottleneck when it comes to the multi processor 

systems. All the parallel threads try to allocate memory from the same heap and that 

reduces parallelism. There is also a chance of false sharing in this case. False sharing is 

caused due to the way processor accesses the memory. Even if it needs to read only one 

byte, it has to read entire cache line. So if more than 1 processes are using different bytes 

in the same cache line, than cache miss ratio is going to be very high and that will impact 

performance a lot. To avoid these problems TBB offers 2 different allocators using which 

we can minimize the bottleneck. 

1. scalable_allocator : Allocating memory using this allocator ensures that each thread is 

given memory from a different pool. 

2. cache_aligned_allocator : Using this allocator makes sure that besides using separate 

pools for each thread, each memory allocation is aligned with the cache line. This is 

likely to increase memory wastage, hence should be used very carefully. 

21

Chapter 4 

OCB IMPLEMENTATION USING TBB 

This chapter includes the actual implementation of OCB algorithm using the 

template library TBB. As discussed in chapter 1, this implementation tries to parallelize 

the main loop in the algorithm which does the actual encryption and authentication. 

Calculation of offset values is not parallelized as it is something that can be calculated in 

advance. 

4.1 Class Definitions 

Definition of class encryptBlockParallel in section 4.1.2 is the main portion where 

TBB operation is defined. Each worker thread created by TBB for encryption process, 

calls the function defined in this class. By carefully analyzing the function, we can see 

that TBB actually passes a range of values defined by the blocked_range template. The 

worker thread will work in that specific range of values before going to take the next task 

from the ready pool. Similarly class definition for xorBlockParallel in section 4.1.3 tries 

to parallelize the xor operation on 2 blocks passed as it’s parameters. This function can 

really improve the performance if the block size is too large. TBB can automatically 

detect the cost behind dividing the xor operation and will do so if feasible. So in our case 

where block size is 16 bytes, it is very less likely that TBB will spread that operation 

across multiple worker threads. 

22

4.1.1 Class Definition : OCB_With_TBB 

# define BLOCK_LEN 16 

typedef unsigned char byte; 

typedef byte BLOCK[BLOCK_LEN]; 

using namespace tbb; 

using namespace std; 

class OCB_With_TBB 

{ 

private: 

int nextBlockNum; 

HANDLE hThread1, hThread2, hSemaphore; 

DWORD dwThreadID1, dwThreadID2; 

BLOCK *pOffset; 

BLOCK key, nonce; 

BLOCK checksum; 

BLOCK currentOffset; 

CRijndael objAES; 

unsigned int lenHeader; 

unsigned int lenPlainText; 

public: 

}; 

bool times2(BLOCK* input, BLOCK* output); 

bool times3(BLOCK* input, BLOCK* output); 

void xorBlock(BLOCK* retBlock, BLOCK* leftBlock, BLOCK* rightBlock); 

void pmac(BLOCK* result); 

BLOCK *pPlainText, *pCipherText, *pHeader; 

BLOCK authTag; 

OCB_With_TBB(void); 

~OCB_With_TBB(void); 

void Initialize(BLOCK* pKey, BLOCK* pNonce, BLOCK* pPlainText, unsigned 

int lenPlainText, BLOCK* pHeader, unsigned int lenHeader); 

void OCBEncrypt(); 

// int getNextBlockNum(); 

void xorBlockWithOffset(BLOCK* retBlock, BLOCK* input, int blockNum); 

CRijndael* getAESObject(); 

23

4.1.2 Class Definition : encryptBlockParallel 

class encryptBlockParallel 

{ 

OCB_With_TBB* const objOCB; 

public: 

}; 

void operator()(const blocked_range& range)const 

{ 

BLOCK tempBlock1, tempBlock2; 

} 

for(int i = 0; i < 100000; ++i) 

{ 

for(size_t index = range.begin(); index < range.end(); ++index) 

{ 

objOCB->xorBlockWithOffset(&tempBlock1, 

& ((objOCB->pPlainText)[index]), index); 

(objOCB->getAESObject())->EncryptBlock((const char*) 

&tempBlock1, (char*)&tempBlock2); 

objOCB->xorBlockWithOffset(&((objOCB->pCipherText) 

[index]), &tempBlock2, index); 

} 

} 

encryptBlockParallel(OCB_With_TBB* objOCB) : objOCB(objOCB) 

{ 

} 

24

4.1.3 Class Definition : xorBlockParallel 

class xorBlockParallel 

{ 

BLOCK const *ret; 

BLOCK const *left; 

BLOCK const *right; 

public: 

}; 

void operator()(const blocked_range& range)const 

{ 

for(size_t index = range.begin(); index < range.end(); ++index) 

{ 

((unsigned char*)ret)[index] = ((unsigned char*)left)[index] ^ 

((unsigned char*)right)[index]; 

} 

} 

xorBlockParallel(BLOCK* retBlock, BLOCK* leftBlock, BLOCK* rightBlock) 

{ 

ret = retBlock; 

left = leftBlock; 

right = rightBlock; 

} 

25

4.2 Class Implementation 

Below is the implementation of class OCB_With_TBB and its related functions. 

The main thing to notice here is a call to parallel_for() in function OCBEncrypt. This is a 

TBB provided functionality which will cause the loop to be divided into tasks and each 

tasks will be given to the available worker threads. The last parameter to parallel_for is 

specified as “auto_partitioner” which indicates that TBB will divide work on its own. 

TBB also provides a way to specify our own partitioner where a programmer can specify 

how to divide the loop. In other words we can specify the grain size. 

# define PROCESSOR_FREQUENCY 2680000000 

void OCB_With_TBB::xorBlock(BLOCK* retBlock, BLOCK* leftBlock, BLOCK* 

rightBlock) 

{ 

int loopIndex = 0; 

} 

/* parallel_for(blocked_range(0, BLOCK_LEN), xorBlockParallel 

(retBlock, leftBlock, rightBlock), auto_partitioner()); */ 

for(loopIndex = 0; loopIndex < BLOCK_LEN; ++loopIndex) 

{ 

((unsigned char*)retBlock)[loopIndex] = ((unsigned char*)leftBlock) 

[loopIndex] ^ ((unsigned char*)rightBlock)[loopIndex]; 

} 

void OCB_With_TBB::xorBlockWithOffset(BLOCK* retBlock, BLOCK* input, int 

blockNum) 

{ 

xorBlock(retBlock, input, &(pOffset[blockNum])); 

} 

26

ool OCB_With_TBB::times2(BLOCK* input, BLOCK* output) 

{ 

int loopIndex = 0; 

if(NULL != input && NULL != output) 

{ 

unsigned char carry = 0; 

carry = ((unsigned char*)input)[0] >> 7; 

for(loopIndex = 0; loopIndex < BLOCK_LEN - 1; ++loopIndex) 

{ 

((unsigned char*)output)[loopIndex] = (((unsigned char*)input) 

[loopIndex] > 7); 

} 

((unsigned char*)output)[BLOCK_LEN - 1] = (((unsigned char*)input) 

[BLOCK_LEN - 1]

void OCB_With_TBB::Initialize(BLOCK* pKey, BLOCK* pNonce, BLOCK* 

pPlainText, unsigned int lenPlainText, BLOCK* pHeader, unsigned int lenHeader) 

{ 

BLOCK temp; 

unsigned int loopIndex = 0; 

unsigned int numPlainTextBlocks = ceil((double)lenPlainText / 

(double)BLOCK_LEN); 

unsigned int numHeaderBlocks = ceil((double)lenHeader / 

(double)BLOCK_LEN); 

nextBlockNum = 0; 

hSemaphore = CreateSemaphore(NULL, 1, 1, NULL); 

if(NULL != key && NULL != nonce && NULL != pPlainText) 

{ 

this->lenPlainText = lenPlainText; 

this->lenHeader = lenHeader; 

memcpy(&key, pKey, sizeof(key)); 

memcpy(&nonce, pNonce, sizeof(nonce)); 

this->pPlainText = (BLOCK*)calloc(numPlainTextBlocks, 

BLOCK_LEN); 

this->pCipherText = (BLOCK*)calloc(numPlainTextBlocks, 

BLOCK_LEN); 

this->pOffset = (BLOCK*)calloc(numPlainTextBlocks, BLOCK_LEN); 

this->pHeader = (BLOCK*)calloc(numHeaderBlocks, BLOCK_LEN); 

memcpy(this->pPlainText, pPlainText, lenPlainText); 

memcpy(this->pHeader, pHeader, lenHeader); 

memset(&authTag, 0, sizeof(authTag)); 

// Initializing AES 

objAES.MakeKey((const char*)&key, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 

sizeof(key), BLOCK_LEN); 

// Initializing Offset 

28

} 

} 

memset(temp, 0, sizeof(temp)); 

memset(currentOffset, 0, sizeof(currentOffset)); 

objAES.EncryptBlock((const char*)&nonce, (char*)&currentOffset); 

memset(&checksum, 0, sizeof(checksum)); 

// Now pre-calculating offset and checksum values 

if(1 < numPlainTextBlocks) 

{ 

times2(&currentOffset, &temp); 

memcpy(pOffset, &temp, BLOCK_LEN); 

} 

xorBlock(&temp, &checksum, pPlainText); 

memcpy(&checksum, &temp, BLOCK_LEN); 

for(loopIndex = 1; loopIndex < numPlainTextBlocks - 1; 

++loopIndex) 

{ 

times2(&(pOffset[loopIndex - 1]), &temp); 

memcpy(&(pOffset[loopIndex]), &temp, 

BLOCK_LEN); 

xorBlock(&temp, &checksum, &(pPlainText[loopIndex])); 

memcpy(&checksum, &temp, sizeof(checksum)); 

} 

memcpy(&currentOffset, pOffset[loopIndex - 1], BLOCK_LEN); 

29

void OCB_With_TBB::OCBEncrypt() 

{ 

unsigned int bitLength = 0; 

unsigned int loopIndex = 0, numBlocks = 0, numPlainTextBlocks = 0; 

BLOCK tempBlock1, tempBlock2, pad; 

numPlainTextBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN); 

long int startTime, endTime; 

double cpuCycles; 

startTime = clock(); 

parallel_for(blocked_range(0, numPlainTextBlocks), 

encryptBlockParallel(this), auto_partitioner()); 

endTime = clock(); 

cpuCycles = ((double)(endTime - startTime) / (double)CLOCKS_PER_SEC) * 

PROCESSOR_FREQUENCY; 

cpuCycles = cpuCycles / (double)100000; // loop count 

cpuCycles = (cpuCycles / (double)(numPlainTextBlocks * 16)); 

// Now processing last block 

numBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN); 

times2(&currentOffset, &tempBlock1); 

memcpy(&currentOffset, &tempBlock1, BLOCK_LEN); 

if(1 < numBlocks) 

{ 

memcpy(&(pOffset[numBlocks - 1]), currentOffset, BLOCK_LEN); 

} 

memset(&tempBlock1, 0, sizeof(tempBlock1)); 

//numPlainTextBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN); 

if(1 < numPlainTextBlocks) 

{ 

bitLength = ((this->lenPlainText) - ((numPlainTextBlocks - 1) * 

BLOCK_LEN)) * 8; 

} 

30

else 

{ 

bitLength = this->lenPlainText * 8; 

} 

for(loopIndex = 0; loopIndex < sizeof(bitLength); ++loopIndex) 

{ 

// following line is specific to a little endian machine 

tempBlock1[sizeof(tempBlock1) - sizeof(bitLength) + loopIndex] |= 

((unsigned char*)(&bitLength))[(sizeof(bitLength) - loopIndex) - 1]; 

} 

xorBlock(&tempBlock2, &tempBlock1, &currentOffset); 

(this->getAESObject())->EncryptBlock((const char*)&tempBlock2, (char*) 

&pad); 

for(loopIndex = 0; loopIndex < ceil((double)bitLength / 8); ++loopIndex) 

{ 

((unsigned char*)(&(pCipherText[numBlocks - 1])))[loopIndex] = 

((unsigned char*)(&(pPlainText[numBlocks - 1])))[loopIndex] ^ 

((unsigned char*(&pad))[loopIndex]; 

} 


memcpy(&tempBlock1, pPlainText[numBlocks - 1], (bitLength / 8)); 

memcpy(&((unsigned char*)&tempBlock1)[(bitLength / 8)], &((unsigned char*) 

&pad)[(bitLength / 8)], (BLOCK_LEN - (bitLength / 8))); 

xorBlock(&tempBlock2, &checksum, &tempBlock1); 

memcpy(&checksum, &tempBlock2, BLOCK_LEN); 

// Computing authentication tag 



times3(&currentOffset, &tempBlock1); 


(getAESObject())->EncryptBlock((const char*)&tempBlock2, (char*)&authTag); 

31

} 

if(lenHeader > 0) 

{ 

pmac(&tempBlock1); 

xorBlock(&tempBlock2, &authTag, &tempBlock1); 

memcpy(&authTag, &tempBlock2, sizeof(BLOCK)); 

} 

void OCB_With_TBB::pmac(BLOCK* result) 

{ 

unsigned int numHeaderBlocks, loopIndex; 

BLOCK offset, checksum, tempBlock1, tempBlock2; 

numHeaderBlocks = ceil((double)lenHeader / (double)BLOCK_LEN); 

memset(&offset, 0, sizeof(offset)); 

memset(&checksum, 0, sizeof(checksum)); 



objAES.EncryptBlock((const char*)&tempBlock1, (char*)&offset); 

times3(&offset, &tempBlock1); 

memcpy(&offset, &tempBlock1, sizeof(offset)); 



for(loopIndex = 0; loopIndex < numHeaderBlocks - 1; ++loopIndex) 

{ 


memcpy(offset, &tempBlock1, sizeof(offset)); 

} 

xorBlock(&tempBlock1, &(pHeader[loopIndex]), &offset); 

objAES.EncryptBlock((const char*)&tempBlock1, (char*)&tempBlock2); 


memcpy(&checksum, &tempBlock1, sizeof(checksum)); 

32

} 

// Now processing last block 



if(0 == (lenHeader % BLOCK_LEN)) 

{ 



} 

else 

{ 

} 

xorBlock(&tempBlock1, &checksum, &pHeader[numHeaderBlocks - 1]); 



memcpy(&offset, &tempBlock1, BLOCK_LEN); 


memcpy(&offset, &tempBlock1, BLOCK_LEN); 

memset(&tempBlock1, 0, BLOCK_LEN); 

// assuming lenHeader in bytes and not in bits 

memcpy(&tempBlock1, &(pHeader[numHeaderBlocks - 1]), 

BLOCK_LEN); 

((unsigned char*)&tempBlock1)[lenHeader % BLOCK_LEN] = 0x80; 



xorBlock(&tempBlock1, &offset, &checksum); 

objAES.EncryptBlock((const char*)&tempBlock1, (char*)result); 

33

5.1 Experiments 

Chapter 5 

RESULTS 

Experiment A : First experiment was carried on a machine with 2.67 GHz Intel 

xeon processor, 6 GB RAM and 8 MB cache memory. There was also a need to compare 

results for a machine with 2 core processor and a machine with 4 core processor. I used 

one of the setting in Windows operating system to change the visible processor cores. 

Using that we can configure the OS such that it can see either 1, 2 or 4 processor cores. 

The advantage of this approach is that all the results are comparable as they are taken on 

the same physical machine except the visible number of processor cores. All the results 

are calculated in Cycles Per Byte (CPB) unit. Below is a chart based on the numbers, I 

collected from the experiment [Figure 3]. The results clearly indicate that performance 

does really improve when number of visible cores is changed from 1 to 2 to 4. But if we 

compare the performance for the same number of processors cores between with and 

without TBB cases, than there is not much difference in the numbers. But the major point 

to note here is that “without TBB” implementation needs to change when it is to be 

executed optimally on a 1 core machine vs a 2 core machine vs a 4 core machine. We 

need to change the number of worker threads and also we need to divide the data range to 

be processed. Where as the “with TBB” implementation need absolutely no change. Once 

34

compiled code can work on all 3 different configuration machine and can give optimum 

performance. 

CPB(Cycles Per Byte) 

40 

30 

20 

10 

0 

1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 

Number of Blocks 

Without TBB - 1 Thread 

Without TBB - 2 Thread 

Without TBB 4 Thread 

With TBB - 2 Core 

With TBB - 4 Core 

Figure 3 : CPB Comparison at Different Processor Cores 

Note : Corresponding CPB numbers are listed in table 1 at the end of the chapter. 

35

Experiment B : This experiment was carried out on a machine with Intel Core 2 

Quad 2.66 GHz processor, 2 GB RAM and 6 MB cache memory. This experiment was to 

compare performance of OCB execution at different block sizes. The bloc sizes compared 

were 16, 24 and 32 bytes. The chart below in figure 4 is prepared based on the results of 

this experiment. The results indicate that the performance for 16 byte block length is 

nearly same in both the cases. In case of 32 byte block length, TBB performs slightly 

better. But there is a striking difference in performance when block length is kept at 24 

bytes. Here 24 byte block is not aligned to the word boundary in the actual memory. So 

the obvious suspect would be incorrect division of data range by TBB. If the range is 

divided such that two adjacent blocks are given to two different worker threads than there 

would be lot of cache miss during the entire run. And the data will go back and forth 

between different processor cache. To eliminate this suspicion, I tried to change the way 

paralel_for is called in the implementation. I tried to statically divide the range instead of 

relying on an auto_partitioner object provided by TBB. But to the surprise, results did not 

change. It indicates that the range division done by TBB is not the culprit. The only 

possibility remains is that the task scheduler - which assigns tasks to the available worker 

threads - causes too much overhead and thus causes a significant drop in performance. 

36

CPB (Cycles per Byte) 

900 

675 

450 

225 

0 

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 

Number of Blocks 

Block Length 16 - With TBB 

Block Length 16 - Without TBB 





Figure 4 : CPB Comparison at Different Block Lengths 


37

Experiment C : Third experiment was carried out on a machine with Intel Core 2 

Quad 2.66 GHz processor, 2 GB RAM and 6 MB cache memory. This was to observe 

TBB performance at various chunk sizes on a multi-core vs single core machine. In this 

experiment, I changed the partitioner parameter in parallel_for call. Instead of using 

auto_partitioner, I statically divided the range into chunks of 1, 2, 4, 8, 16, 32, 64, 128 

and 258 blocks (1 block = 16 bytes). So for chunk size of n blocks, each TBB worker 

thread will work on n number of blocks before asking for another work. The results of 

experiment are displayed below in figure 5. It indicates that increasing the chuck size on 

a multicore processor does really improve the overall performance. But it does not 

improve the performance on a single core processor. The reason is quite clear. On a 

multicore machine there are more than 1 active threads running at the same time to take 

advantage of increased chunk size. While on a single core machine there is only one 

thread active at a time. So even though chunk size is increased, there is only one active 

thread available to take real advantage of it. Hence the line for single core machine in the 

below chart is almost horizontal. The results can also be explained in opposite way. In a 

multicore machine, performance degrades rapidly by decreasing the chunk size. Mainly 

because there are more number of active threads at any given time, which can 

simultaneously ask for new task. And simultaneous calls to the task manager will of 

course cause some penalty to the performance. 

38

Cycles per Byte (CPB) 

80 

60 

40 

20 

0 

1 2 4 8 16 32 64 128 256 

Chunk Size (in blocks) 

1 CPU Core - With TBB 

4 CPU Core - With TBB 

Figure 5 : CPB Comparison at Different Chunk Sizes 


39

5.2 Conclusion 

Based on above experiments, I can say that in most of the cases TBB performs 

optimally or nearly optimal. The major advantage of TBB is that we do not need to 

recompile the code for different platforms. Once the code is compiled with TBB, than it 

can be executed on any machine and we can expect better utilization of available 

resources. But there are certain cases where TBB does not perform well for example, 

OCB execution with block length set to 24 bytes. In this case TBB performs worse than 

the implementation with native threads. So if we know the actual data size of input data 

in advance than it is better to compare performance of both the methods and if TBB is not 

worse than the other than we should deploy the application developed using TBB. 

40

Table 1 : CPB comparison at Different Processor Cores (Experiment A) 

Num. of 

Blocks 

Without 

TBB - 1 

Thread - 4 

Core System 

Without 

TBB - 2 

Threads - 4 

Core System 

Without 

TBB - 4 

Threads - 4 

Core System 

With TBB - 

2 Core 

System 

With TBB - 

4 Core 

System 

1 25.031250 26.700000 25.031250 26.800000 25.125000 

2 25.865625 12.515625 13.350000 13.400000 13.400000 

3 26.143750 17.800000 8.343750 17.308333 17.308333 

4 32.540625 12.932813 6.675000 12.981250 6.700000 

5 26.032500 20.692500 15.686250 21.105000 10.385000 

6 26.143750 17.521875 13.071875 13.120833 13.120833 

7 29.799107 14.780357 7.390179 18.664286 7.417857 

8 25.865625 16.270312 9.803906 12.981250 9.840625 

9 28.925000 17.429167 8.529167 20.286111 8.561111 

10 26.032500 13.016250 10.513125 13.065000 7.872500 

11 28.520455 16.535795 9.405682 16.597727 9.593182 

12 28.090625 12.932812 6.535938 15.354167 8.654167 

13 28.112019 16.045673 10.012500 15.976923 7.988462 

14 25.984821 14.899554 9.297321 14.955357 9.332143 

15 27.812500 13.906250 8.677500 15.745000 7.035000 

16 27.638672 14.601563 6.466406 14.656250 8.165625 

17 27.583456 15.313235 9.227206 15.370588 9.163235 

18 28.925000 14.462500 8.714583 14.516667 8.747222 

19 27.402632 15.106579 6.850658 15.163158 8.286842 

20 27.284062 13.016250 7.759688 14.321250 6.532500 

21 27.335714 14.939286 8.661607 14.915476 8.694048 

22 28.368750 12.970739 8.343750 14.313636 8.298864 

23 27.135326 15.889402 7.908424 14.710870 7.938043 

24 27.117188 14.045313 7.578906 13.120833 6.560417 

25 28.168500 14.618250 8.343750 15.678000 8.375000 

26 27.983654 13.991827 8.022837 14.044231 7.988462 

27 27.009028 15.451389 7.663889 14.516667 7.754630 

28 27.891964 13.945982 7.449777 13.998214 7.477679 

29 27.850862 14.328233 8.113578 14.439655 8.086207 

30 27.756875 14.796250 7.787500 14.795833 7.872500 

31 26.861492 14.265121 7.536290 14.318548 6.754032 

32 26.856445 14.653711 7.352930 14.708594 7.328125 

41

Num. of 

Blocks 

Without 

TBB - 1 

Thread - 4 

Core System 

Without 

TBB - 2 

Threads - 4 

Core System 

Without 

TBB - 4 

Threads - 4 

Core System 

With TBB - 

2 Core 

System 

With TBB - 

4 Core 

System 

33 28.368750 14.159091 7.888636 14.262879 7.918182 

34 27.583456 13.791728 7.656618 13.794118 7.685294 

35 27.510536 14.875714 7.437857 14.931429 8.231429 

36 27.488021 14.462500 7.231250 14.516667 7.258333 

37 27.421622 14.071622 7.712331 14.124324 7.741216 

38 28.105263 14.403947 8.255921 14.457895 7.581579 

39 27.341827 13.991827 7.316827 14.731410 7.344231 

40 27.367500 14.351250 7.175625 13.735000 7.202500 

41 27.921037 14.571037 7.611128 14.666463 7.639634 

42 27.891964 13.667857 7.429911 14.277381 7.457738 

43 27.243314 14.514244 7.257122 14.607558 7.907558 

44 27.799858 14.184375 7.699006 13.666477 7.118750 

45 27.775417 14.462500 7.527917 14.516667 7.556111 

46 27.171603 14.148098 7.364266 14.201087 6.809239 

47 27.694149 14.415160 7.740160 14.433511 6.664362 

48 27.638672 14.114844 7.057422 14.167708 7.642187 

49 27.653571 14.337628 7.969133 19.723980 7.452041 

50 27.567750 14.050875 7.275750 14.639500 7.303000 

51 28.074265 14.298897 7.656618 14.352451 7.192647 

52 27.534375 14.023918 7.028005 14.559615 7.022115 

53 27.518632 14.231604 7.839976 14.284906 7.900943 

54 27.967014 14.462500 7.725694 14.051389 7.258333 

55 27.458523 14.199545 7.099773 14.709545 7.126364 

56 27.415179 13.945982 6.972991 13.998214 7.447768 

57 27.871053 14.169737 7.758224 14.222807 7.346491 

58 27.361746 13.896659 7.192888 15.768103 7.653017 

59 27.803072 14.566208 7.495233 14.620763 7.523305 

60 27.756875 13.878437 6.953125 13.930417 6.979167 

61 27.329201 14.088627 7.249488 20.978689 7.715984 

62 27.722782 14.265121 7.563206 23.612097 7.159274 

63 27.680060 14.065179 7.019345 20.312698 7.471032 

64 27.664746 13.819336 6.909668 17.168750 7.354297 

65 27.624231 14.428269 7.599231 14.868846 7.215385 

42

Num. of 

Blocks 

Without 

TBB - 1 

Thread - 4 

Core System 

Without 

TBB - 2 

Threads - 4 

Core System 

Without 

TBB - 4 

Threads - 4 

Core System 

With TBB - 

2 Core 

System 

With TBB - 

4 Core 

System 

66 27.610227 13.805114 7.104830 14.237500 7.537500 

67 27.596642 14.371175 7.397295 14.450000 7.025000 

68 27.927022 13.791728 6.871324 14.212868 7.291176 

69 27.546467 14.317391 7.545652 14.395290 7.185507 

70 27.534375 14.136696 7.080268 14.165714 7.465714 

71 27.851673 13.937588 7.333099 14.367254 7.360563 

72 27.488021 14.091667 7.231250 14.144444 7.258333 

73 27.820120 14.264384 7.132192 13.973630 7.158904 

74 27.421622 13.733361 7.374071 14.463851 7.424324 

75 27.768000 14.217750 7.298000 14.293333 7.325333 

76 27.753947 14.052632 6.850658 14.435855 6.876316 

77 27.718588 14.195211 7.433523 19.012338 7.461364 

78 27.363221 14.013221 7.338221 14.387821 7.365705 

79 27.692801 14.173813 6.928481 15.223418 7.272468 

80 27.659531 13.996641 7.154766 14.363125 7.181563 

81 27.627083 14.132870 7.396065 14.516667 7.113580 

82 27.941387 13.980869 7.305869 14.033232 7.312805 

83 27.604744 14.114006 7.217846 14.489759 7.244880 

84 27.574107 13.945982 7.112054 13.998214 7.158631 

85 27.563824 14.076397 7.362132 16.277059 7.370000 

86 27.553779 13.932122 7.257122 14.295930 6.992151 

87 27.831681 14.059698 7.192888 14.709195 7.508621 

88 27.515412 13.918892 6.788778 14.256534 8.013352 

89 27.787500 14.325000 7.312500 14.096348 7.641011 

90 27.478750 13.887708 7.527917 14.218889 7.556111 

91 27.745261 14.303571 7.151786 14.357143 7.178571 

92 27.733899 13.857880 6.783832 14.201087 7.100543 

93 27.704839 13.995968 7.572177 14.336559 7.294355 

94 27.427859 14.131117 7.189827 14.166223 7.234574 

95 27.666118 13.964803 7.131711 14.299211 7.422895 

96 27.656055 14.114844 6.779297 14.167708 7.083854 

97 27.646198 13.952126 7.242719 14.263402 7.269845 

98 27.636543 14.082207 7.441263 14.134949 7.195663 

43

Num. of 

Blocks 

Without 

TBB - 1 

Thread - 4 

Core System 

Without 

TBB - 2 

Threads - 4 

Core System 

Without 

TBB - 4 

Threads - 4 

Core System 

With TBB - 

2 Core 

System 

With TBB - 

4 Core 

System 

99 27.610227 13.923106 7.096402 14.262879 7.122980 

100 27.851438 14.067562 7.025438 14.371500 7.051750 

101 27.575681 14.176114 7.220235 13.963861 7.512624 

102 27.567096 13.775368 7.149449 14.352451 7.176225 

103 27.801699 14.160073 7.080036 14.196845 7.090291 

104 27.534375 14.007873 6.995913 17.345913 7.038221 

105 27.526429 13.890357 7.199464 16.909524 7.226429 

106 27.754776 13.995460 7.115802 14.300708 7.395283 

107 27.495386 14.114194 7.064895 14.167056 7.326168 

108 27.719792 13.983507 6.999479 14.035880 7.010185 

109 27.695126 14.084862 7.164908 14.383486 7.191743 

110 27.701250 13.971989 7.099773 14.009091 7.126364 

111 27.662162 14.071622 7.035811 14.365766 7.062162 

112 27.668471 13.945982 6.972991 13.998214 6.999107 

113 27.645133 14.044082 7.369082 14.333850 7.159513 

114 27.622204 13.935526 7.070230 14.208114 7.331798 

115 27.628696 14.032011 7.023261 14.317609 7.049565 

116 27.591918 14.141218 6.962716 13.963147 7.205388 

117 27.598558 14.020353 7.331090 14.301923 7.143803 

118 27.576801 13.887394 7.070975 14.166525 7.310381 

119 27.569433 14.009086 6.997532 14.272689 8.783193 

120 27.770781 13.878437 7.161719 13.944375 7.844583 

121 27.541271 14.205062 7.102531 14.244421 8.859504 

122 27.739549 13.869775 7.030635 14.141393 7.702254 

123 27.717530 13.974085 6.987043 14.230691 9.137602 

124 27.507460 14.063256 6.930696 14.115927 7.159274 

125 27.701250 13.950750 7.289100 14.217400 7.316400 

126 27.680060 14.051935 7.019345 14.104563 7.471032 

127 27.672343 13.941289 7.174311 14.402362 7.201181 

128 27.664746 13.819336 6.909668 14.080469 7.144922 

129 27.644331 14.126163 7.063081 16.412403 7.089535 

130 27.637067 14.017500 7.214135 14.070000 7.241154 

131 27.617176 13.910496 6.955248 14.154389 7.173092 

44

Num. of 

Blocks 

Without 

TBB - 1 

Thread - 4 

Core System 

Without 

TBB - 2 

Threads - 4 

Core System 

Without 

TBB - 4 

Threads - 4 

Core System 

With TBB - 

2 Core 

System 

With TBB - 

4 Core 

System 

132 27.610227 14.007386 7.104830 14.262879 7.131439 

133 27.603383 14.090273 7.038863 14.331955 7.468233 

134 27.783442 13.798321 7.185588 14.050000 7.400000 

135 27.565278 14.079306 6.946944 14.119630 6.972963 

136 27.571186 13.963511 7.079917 14.225184 7.303493 

137 27.747536 14.068659 7.223130 15.062774 7.054562 

138 27.534375 13.954620 7.170788 21.775000 7.197645 

139 27.720459 14.046313 6.927113 15.978777 7.145863 

140 27.701250 13.945982 7.068348 14.369107 7.082857 

141 27.516622 14.036436 7.195745 14.279078 7.234574 

142 27.675396 13.925836 7.156822 14.166725 7.171831 

143 27.680245 14.026836 6.908392 14.255070 7.859615 

144 27.650260 13.917839 7.057422 14.330556 7.258333 

145 27.655216 14.005991 7.181379 14.416552 7.208276 

146 27.637243 13.898630 7.132192 14.145719 7.342466 

147 27.619515 13.997066 7.083673 14.220408 7.281122 

148 27.624578 14.071622 7.035811 14.294088 7.243243 

149 27.595973 13.977181 7.156586 14.209396 7.183389 

150 27.779125 13.884000 7.120000 14.103500 8.196333 

151 27.584106 14.134644 7.072848 14.198675 7.088245 

152 27.567311 13.876974 7.015337 14.094243 7.217928 

153 27.736152 13.949877 7.143995 14.177288 7.181699 

154 27.556047 14.032670 7.108442 14.248377 7.124188 

155 27.712016 13.942137 7.051815 14.156452 7.251129 

156 27.534375 14.013221 7.006611 14.237500 7.698558 

157 27.688495 13.923965 7.132046 14.146815 7.158758 

158 27.682239 14.004826 7.086907 14.057278 7.113449 

159 27.665566 14.084670 7.031840 14.305975 7.226730 

160 27.659531 13.996641 6.998320 14.038594 7.024531 

161 27.653571 13.899340 7.120691 14.284317 7.136957 

162 27.637384 13.978356 7.231250 14.030710 7.423765 

163 27.631633 14.056403 7.023083 14.273466 7.213804 

164 27.778582 13.807889 6.980259 14.012805 7.169817 

45

Num. of 

Blocks 

Without 

TBB - 1 

Thread - 4 

Core System 

Without 

TBB - 2 

Threads - 4 

Core System 

Without 

TBB - 4 

Threads - 4 

Core System 

With TBB - 

2 Core 

System 

With TBB - 

4 Core 

System 

165 27.610227 14.047841 7.099773 14.252727 7.126364 

166 27.755535 13.953163 7.057003 14.015512 7.244880 

167 27.589334 14.029491 7.014746 15.335778 7.191467 

168 27.742969 13.945982 6.972991 17.418006 7.308185 

169 27.568935 14.021450 7.089719 17.314941 7.274852 

170 27.720882 13.929154 7.048015 15.370588 7.833088 

171 27.549013 14.013596 6.997039 14.213012 7.033041 

172 27.699310 13.922420 6.966061 14.130378 7.138227 

173 27.693533 13.996279 7.070484 14.048699 7.251879 

174 27.678233 13.915841 7.029849 14.112356 7.056178 

175 27.663107 14.131929 6.989679 15.974714 7.915571 

176 27.667116 13.899929 6.949964 14.104261 7.128267 

177 27.793644 13.972246 7.061547 14.175989 7.087994 

178 27.637500 13.893750 7.171875 18.933146 7.189326 

179 27.632263 14.105133 6.973324 18.537291 7.879050 

180 27.627083 13.887708 6.943854 14.088611 7.407222 

181 27.760256 13.949275 7.191298 14.140331 7.218232 

182 27.607727 14.019334 7.014251 14.219093 7.325824 

183 27.739549 13.942725 7.112705 14.132240 7.139344 

184 27.588791 14.002989 6.928940 14.201087 6.954891 

185 27.719291 13.927297 7.180135 14.124324 8.049054 

186 27.570262 13.995968 6.997984 14.192473 7.024194 

187 27.708389 14.063904 7.094418 14.250936 7.129947 

188 27.694149 13.847074 7.065559 14.041489 7.225665 

189 27.680060 14.047520 7.019345 14.241931 7.329233 

190 27.674901 13.973586 7.131711 14.158158 7.563947 

191 27.538743 13.909162 6.945844 14.233115 7.936518 

192 27.795117 13.958398 7.048730 14.019401 7.075130 

193 27.651101 14.033063 7.020855 14.215803 7.177332 

194 27.637597 13.952126 7.105090 14.142526 7.131701 

195 27.641346 13.889135 7.077212 14.198846 7.103718 

196 27.619515 13.945982 6.904879 14.134949 7.067474 

197 27.623319 14.002253 7.140895 14.190736 7.159137 

46

Num. of 

Blocks 

Without 

TBB - 1 

Thread - 4 

Core System 

Without 

TBB - 2 

Threads - 4 

Core System 

Without 

TBB - 4 

Threads - 4 

Core System 

With TBB - 

2 Core 

System 

With TBB - 

4 Core 

System 

198 27.736648 13.939962 6.969981 14.119066 7.258333 

199 27.605653 13.995697 7.060741 14.182789 7.095603 

200 27.592781 13.925719 7.025438 14.111875 7.051750 

201 27.588340 13.989272 6.998787 14.166667 7.150000 

202 27.707859 14.052197 7.088057 14.104827 7.371658 

203 27.694674 13.974754 7.053140 14.282882 7.079557 

204 27.567096 13.906250 7.018566 14.089706 7.044853 

205 27.684970 14.098902 6.984329 14.151707 7.141220 

206 27.672087 13.900850 7.071936 14.204976 7.098422 

207 27.667391 13.962681 7.045833 14.136353 7.072222 

208 27.662740 14.015895 6.883594 14.197236 7.038221 

209 27.650150 13.948834 7.098176 14.129306 7.244976 

210 27.645625 14.009554 7.072321 14.181667 7.218452 

211 27.759775 13.943158 7.030895 14.122393 7.930450 

212 27.628833 13.995460 6.997730 14.174292 7.268868 

213 27.624472 13.937588 7.090229 15.578286 7.360563 

214 27.729322 13.989428 7.057097 14.284463 7.083528 

215 27.608110 14.040785 7.016512 14.101163 7.050581 

216 27.603906 13.859896 6.991753 14.152199 7.134259 

217 27.707402 14.034418 7.082575 14.210484 7.101382 

218 27.702781 13.977695 7.042431 14.145298 7.191743 

219 27.583904 14.020548 7.010274 14.195434 7.044178 

220 27.686080 13.964403 6.985994 14.130909 7.126364 

221 27.681618 14.022031 7.067647 14.188235 7.094118 

222 27.677196 13.951351 7.035811 14.124324 8.005293 

223 27.665331 14.008520 7.116508 14.181166 7.030493 

224 27.661021 13.945982 6.972991 14.117857 7.111272 

225 27.649333 14.002667 7.409250 14.166778 7.198778 

226 27.645133 14.051466 7.022041 14.104093 7.055752 

227 27.751239 13.989565 7.116079 14.160022 7.135352 

228 27.636842 13.928207 6.960444 14.097917 7.339145 

229 27.618177 13.983979 7.163237 14.263100 7.073035 

230 27.737527 14.039266 7.016005 14.084565 7.158804 

47

Num. of 

Blocks 

Without 

TBB - 1 

Thread - 4 

Core System 

Without 

TBB - 2 

Threads - 4 

Core System 

Without 

TBB - 4 

Threads - 4 

Core System 

With TBB - 

2 Core 

System 

With TBB - 

4 Core 

System 

231 27.610227 13.971266 7.101218 14.255628 7.120563 

232 27.714197 14.026131 6.955523 14.078664 7.097091 

233 27.709844 13.965933 7.154855 14.241094 7.181652 

234 27.698397 14.020353 7.117147 14.072863 7.143803 

235 27.580532 14.067207 6.980346 14.226809 7.227447 

236 27.689936 13.894465 7.056833 14.173623 7.196822 

237 27.678718 14.061155 7.034098 14.226899 7.053376 

238 27.674606 14.002075 7.109716 14.047479 7.136345 

239 27.670528 13.943488 7.079969 14.212971 7.113494 

240 27.659531 13.989688 7.050469 14.160729 7.076875 

241 27.648626 14.042427 7.125078 14.199274 7.151763 

242 27.755036 13.984401 7.102531 14.036777 7.129132 

243 27.640818 14.036728 7.073302 14.192695 7.092901 

244 27.630123 13.869775 7.037474 14.031557 7.180533 

245 27.626327 14.024311 7.117730 14.398163 7.144388 

246 27.731098 13.967302 6.987043 16.246138 7.224289 

247 27.612070 14.018851 7.060096 18.730162 7.086538 

248 27.709325 13.854662 7.031628 14.115927 7.057964 

249 27.604744 14.114006 7.110617 14.059237 7.238153 

250 27.694575 13.950750 7.082175 14.217400 7.316400 

251 27.690613 14.001544 7.153685 14.053984 7.080378 

252 27.587351 13.945982 7.025967 14.097917 7.052282 

253 27.676186 13.996393 7.097134 14.148123 7.123715 

254 27.777461 13.941289 7.174311 14.197933 7.306693 

255 27.661985 14.082941 7.048015 14.142255 7.172941 

256 27.658228 14.034448 7.320337 14.191699 7.144922 

48

Table 2 : CPB Comparison at Different Block Lengths (Experiment B) 

Num. of 

Block Length = 16 Block Length = 24 Block Length = 32 

Blocks 

With TBB 

Without 

TBB 

With TBB 

Without 

TBB 

With TBB 

Without 

TBB 

1 78.1375 51.5375 172.9000 172.9000 194.5125 194.5125 

2 39.0688 25.7688 346.3542 251.0375 370.3219 493.7625 

3 26.0458 26.0458 496.5333 282.9944 554.1667 640.6167 

4 19.5344 19.5344 593.2354 337.4875 714.2516 824.8078 

5 30.9225 30.9225 512.4933 270.2117 610.4700 563.5875 

6 26.0458 30.4792 513.7125 285.7653 614.8479 653.7781 

7 25.8875 22.0875 554.1667 314.1333 649.4438 749.6688 

8 19.5344 22.8594 597.5302 335.5479 714.3555 823.1453 

9 26.0458 25.8611 548.3787 275.2361 656.5951 650.8688 

10 23.2750 23.4413 550.7308 303.0183 662.4231 718.2831 

11 21.3102 18.8920 571.4970 325.8500 677.7710 774.5739 

12 21.6125 19.5344 588.8021 336.2868 706.8396 822.5911 

13 24.0423 25.9606 562.0955 299.6763 672.3534 700.3601 

14 22.2063 22.2063 570.3167 311.7583 667.9688 749.6094 

15 20.8367 20.8367 574.9294 326.7367 687.4992 787.9696 

16 21.0930 17.8719 589.8411 338.8036 716.7973 823.0934 

17 21.4169 22.8838 568.4446 308.6382 688.3728 731.1577 

18 20.2271 20.2271 581.1361 319.4463 676.1295 769.2295 

19 20.4750 19.1625 576.9750 332.6750 693.1750 797.7375 

20 19.5344 18.2044 592.2379 337.7092 713.0463 818.9059 

21 23.5125 21.0583 572.3222 311.7583 694.5688 748.3625 

22 21.2347 20.0256 580.1117 321.9205 683.0608 772.8358 

23 21.4679 19.2272 580.5257 329.0304 694.5997 800.2046 

24 19.4651 19.4651 593.1431 334.0701 715.4292 814.4518 

25 21.8120 20.8145 576.3333 316.5843 694.0938 761.6578 

26 20.9731 18.9269 580.8093 322.3971 688.3709 787.2897 

27 20.1963 18.2875 582.4086 329.7086 697.5419 805.7583 

28 19.5344 18.5250 590.6229 337.6854 714.3406 818.2766 

29 20.5806 19.7207 578.0532 317.6713 697.7914 764.5207 

30 20.7813 19.0633 585.9389 327.3278 692.7083 779.7125 

31 20.1109 19.3065 583.7699 332.3927 696.3462 746.6234 

32 19.4824 18.6512 592.5773 335.5133 713.5242 801.6107 

49

Num. of 

Block Length = 16 Block Length = 24 Block Length = 32 

Blocks 

With TBB 

Without 

TBB 

With TBB 

Without 

TBB 

With TBB 

Without 

TBB 

33 21.2598 19.6981 580.4308 323.2639 698.2248 738.7545 

34 20.6346 20.5857 584.7110 328.0341 696.0252 781.5950 

35 20.0450 18.5725 584.3450 331.4867 702.4775 802.2988 

36 19.4420 18.7493 592.1887 334.3472 715.4292 795.5293 

37 20.3993 19.6804 581.3059 322.9444 698.2051 757.8978 

38 20.4750 19.1188 566.9125 329.0292 691.4469 786.1219 

39 19.9926 19.3106 557.7190 327.6972 701.7029 781.9718 

40 18.8278 18.2044 553.3077 333.3867 692.2858 795.2153 

41 20.2744 19.6256 554.5992 320.1732 705.4880 772.0082 

42 19.7917 18.5646 554.1667 327.8028 702.6042 792.8938 

43 19.9113 18.7128 568.2657 331.4432 704.9773 804.3794 

44 20.1011 18.2875 587.2152 338.0920 715.2528 775.1595 

45 20.2086 19.0633 583.7961 334.0270 704.2535 774.6881 

46 19.7332 19.1910 569.5870 329.4159 701.6473 797.6386 

47 21.0112 18.8181 561.5477 327.9252 703.5735 806.1003 

48 18.9456 18.3914 567.1434 335.8943 713.5415 808.2521 

49 20.1196 19.0679 560.8845 327.6369 683.6098 773.7411 

50 19.7505 18.7198 559.3758 328.3327 700.0621 787.6094 

51 19.3632 18.8417 565.7064 334.1299 705.6987 749.9994 

52 18.9909 18.4793 580.4683 337.0399 712.1095 770.5528 

53 19.6050 19.1031 578.6755 327.4184 706.0292 758.2255 

54 19.2419 18.7801 579.8225 329.6676 703.5300 792.7662 

55 19.8593 17.9550 588.1824 332.1776 705.3836 781.6622 

56 19.0000 18.0797 594.6802 335.8448 713.1977 801.7852 

57 20.0667 19.1333 588.4861 329.0389 705.0167 785.2250 

58 19.6920 18.8321 590.5888 331.4299 704.2837 788.2543 

59 19.3864 18.0339 589.1073 333.4393 705.5481 794.4918 

60 19.4790 17.7610 593.9928 334.2364 714.5702 801.1588 

61 20.0045 19.1596 589.3790 327.0492 705.8403 779.5081 

62 19.7087 18.8506 588.5071 331.2665 705.3558 790.1970 

63 19.7917 18.5514 575.8759 335.6315 709.3993 761.9792 

64 19.4824 18.2615 595.0365 336.3445 713.5502 787.2067 

50

Table 3 : CPB Comparison at Different Chunk Sizes (Experiment C) 

Chunk 

Size 

(Blocks) 

Cycles 

per 

Byte 

(CPB) 

Avg 

Chunk 

Size 

(Blocks) 

Cycles 

per 

Byte 

(CPB) 

Avg 

1 CPU Core 

1 2 4 8 16 32 64 128 256 

76.0983 75.6957 75.5009 75.5009 75.6957 75.6957 74.4748 73.8773 73.0591 

74.4878 74.4748 74.28 74.28 74.4878 74.6826 73.2669 72.4486 72.4486 

74.28 74.8904 74.28 74.28 74.267 74.4878 73.0591 72.2408 72.4486 

74.4748 74.4748 74.4748 74.267 74.28 74.6826 73.0591 72.4616 72.4486 

74.28 74.28 74.28 74.28 74.28 74.4748 73.0591 72.4486 72.4551 

74.6826 74.4748 74.28 74.28 74.4748 74.4748 72.8513 72.4486 72.4486 

74.4748 74.28 74.267 74.0722 74.28 74.4748 72.8513 72.4486 72.4486 

74.28 74.28 74.28 74.267 74.28 74.6826 73.0591 72.4486 72.4486 

74.28 74.4748 74.28 74.28 74.267 74.4878 73.0591 72.4486 72.4551 

74.28 74.28 74.28 74.28 74.28 74.6826 73.0591 72.2538 72.3447 

74.5618 74.5605 74.4203 74.3787 74.4592 74.6826 73.1799 72.5525 72.5006 

4 CPU Cores 

1 2 4 8 16 32 64 128 256 

29.6263 29.0288 28.4054 27.4053 25.3661 24.353 21.5151 

29.6263 29.0158 28.2105 27.6001 25.5739 24.5608 21.5086 

29.4314 29.0288 28.4054 27.3923 25.3661 24.5479 21.5151 

29.6263 28.808 28.4184 27.3923 25.5609 24.5608 21.5086 

29.6393 28.821 28.2105 27.6001 25.5739 24.7557 21.4112 NA NA 

29.4185 29.0158 28.4054 27.3923 25.5739 24.7687 21.5086 

29.6263 28.821 28.2105 27.3923 25.3661 24.1452 21.4112 

29.6393 29.0158 28.2105 27.4053 25.5739 24.5608 21.5151 

29.4185 28.821 28.2105 27.4053 25.5739 24.5479 21.5086 

29.5613 28.9307 28.2986 27.4428 25.5032 24.5334 21.4891 

51

REFERENCES 

[1] Definition of Encryption - www.wikipedia.org 

[2] Nigel Smart, “Cryptography: An Introduction”. 

[3] The OCB Authenticated-Encryption Algorithm . 

http://www.cs.ucdavis.edu/~rogaway/papers/draft-krovetz-ocb-00.txt 

52

PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK ...

Create successful ePaper yourself

Delete template?

Save as template?