14.07.2013 Views

PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK ...

PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK ...

PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>PERFORMANCE</strong> <strong>ANALYSIS</strong> <strong>OF</strong> <strong>OCB</strong> (<strong>OF</strong>FSET <strong>CODEBOOK</strong>)<br />

USING TBB(THREADING BUILDING BLOCKS)<br />

Parag Sheth<br />

B.E., L. D. College of Engineering, 2006<br />

PROJECT<br />

Submitted in partial satisfaction of<br />

the requirements for the degree of<br />

MASTER <strong>OF</strong> SCIENCE<br />

in<br />

COMPUTER SCIENCE<br />

at<br />

CALIFORNIA STATE UNIVERSITY, SACRAMENTO<br />

FALL<br />

2010


Approved by:<br />

<strong>PERFORMANCE</strong> <strong>ANALYSIS</strong> <strong>OF</strong> <strong>OCB</strong> (<strong>OF</strong>FSET <strong>CODEBOOK</strong>)<br />

USING TBB (THREADING BUILDING BLOCKS)<br />

A Project<br />

by<br />

Parag Sheth<br />

_______________________________, Committee Chair<br />

Ted Krovetz, Ph.D.<br />

_______________________________, Second Reader<br />

Chung-E Wang, Ph.D.<br />

____________________<br />

Date<br />

ii


Student : Parag Sheth<br />

I certify that this student has met the requirements for format contained in the University<br />

format manual, and that this project is suitable for shelving in the Library and credit is to<br />

be awarded for the project.<br />

__________________________, Graduate Coordinator _____________________<br />

Nikrouz Faroughi, Ph.D. Date<br />

Department of Computer Science<br />

iii


Abstract<br />

of<br />

<strong>PERFORMANCE</strong> <strong>ANALYSIS</strong> <strong>OF</strong> <strong>OCB</strong> (<strong>OF</strong>FSET <strong>CODEBOOK</strong>)<br />

USING TBB (THREADING BUILDING BLOCKS)<br />

by<br />

Parag Sheth<br />

My project would be to explore Intel’s open source library (for C++), named TBB<br />

(Threading Building Blocks), and make its use to analyze performance gain for the<br />

implementation of <strong>OCB</strong> (Offset Codebook). The analysis would begin with identifying<br />

the parallel portions inside an <strong>OCB</strong> algorithm, followed by its implementation using<br />

TBB. After that I would analyze the performance gain obtained by changing various<br />

parameters of the <strong>OCB</strong> algorithm.<br />

TBB(Threading Building Blocks) :<br />

It is Intel’s open source template library for C++. The aim of it is to provide task<br />

level parallelism as opposed to thread level parallelism. It makes implementation more<br />

portable and easy to understand. TBB library internally keeps a pool of worker threads.<br />

The application developer needs to specify the parallel portion of the application and<br />

most of the remaining work is taken care of by the TBB library. The library determines<br />

iv


equired no of threads for the task and schedules them on available processor cores. TBB<br />

uses work-stealing scheduler design to schedule its threads.<br />

For developers, the benefits of TBB are :<br />

1. It reduces the length of the code for a multithreaded application.<br />

2. It relieves the programmer from handling all the thread management stuff.<br />

3. It automatically identifies the underlying system and determines optimal no of threads.<br />

It also automatically balances the work load between these threads and makes<br />

maximum use of all the available processor cores to achieve maximum performance.<br />

4. The applications developed using TBB automatically becomes portable and scalable to<br />

machines with any no of core.<br />

<strong>OCB</strong>(Offset Codebook) :<br />

It is a shared key encryption - authentication scheme, built from a block cipher.<br />

<strong>OCB</strong> achieves authenticated encryption in essentially the same amount of time as other<br />

modes, like CBC, achieve privacy alone. Or in other words we can say that it takes about<br />

half the time as "conventional" modes, like CCM, to achieve privacy and authenticity<br />

combined. On top of this <strong>OCB</strong> is a simple, easy and highly parallelizable method which<br />

can be easily implemented in hardware and software. It can also be proved that it is as<br />

secure as its underlying primitive algorithms.<br />

v


Some of the key features of <strong>OCB</strong> are :<br />

1. It can encrypt messages of any bit length and messages don’t have to be multiple of the<br />

block length.<br />

2. Encryption and decryption depend on an n-bit nonce N, which must be selected as a<br />

new value for each encryption. The nonce need not be random or secret.<br />

3. It is an on-line algorithm, meaning one need not know the length of the header or<br />

message to proceed with encryption, and one need not know the length of header or<br />

cipher text to proceed with decryption.<br />

4. <strong>OCB</strong> is parallelizable : the bulk of its block cipher calls may be performed<br />

simultaneously. Thus <strong>OCB</strong> is suitable for encrypting messages in hardware at the<br />

highest network speeds.<br />

5. It needs very little memory to run.<br />

6. It is nearly endian-neutral.<br />

___________________________________, Committee Chair<br />

Ted Krovetz, Ph.D.<br />

_______________________<br />

Date<br />

vi


TABLE <strong>OF</strong> CONTENTS<br />

Page<br />

List of Tables ..................................................................................................................... ix<br />

List of Figures .................................................................................................................... x<br />

Chapter<br />

1. INTRODUCTION TO AUTHENTICATED ENCRYPTION ....................................... 1<br />

Security Model ....................................................................................................... 1<br />

Notions of Security .................................................................................... 2<br />

Notions of Attacks ..................................................................................... 3<br />

Security of Message Authentication Code (MAC) .................................... 4<br />

Authenticated Encryption ...................................................................................... 4<br />

2. INTRODUCTION TO <strong>OF</strong>FSET <strong>CODEBOOK</strong> (<strong>OCB</strong>) ................................................ 7<br />

Overview ............................................................................................................... 7<br />

Notation and Basic Operation ............................................................................... 9<br />

<strong>OCB</strong> Parameters .................................................................................................... 9<br />

Header Authentication : PMAC ........................................................................... 10<br />

Encryption : <strong>OCB</strong>-ENCRYPT ............................................................................. 11<br />

Decryption : <strong>OCB</strong>-DECRYPT ............................................................................. 12<br />

Parallel Portion of the Encryption / Decryption Algorithm ................................. 13<br />

Security Consideration of <strong>OCB</strong> ........................................................................... 14<br />

vii


3. INTRODUCTION TO THREADING BUILDING BLOCKS (TBB) ........................ 15<br />

Overview .............................................................................................................. 15<br />

Task Scheduling ................................................................................................... 16<br />

TBB Provided Algorithms ................................................................................... 19<br />

Containers ............................................................................................................ 20<br />

Scalable Memory Allocation ................................................................................ 21<br />

4. <strong>OCB</strong> IMPLEMENTATION USING TBB ................................................................... 22<br />

Class Definition ................................................................................................... 22<br />

Class Definition : <strong>OCB</strong>_With_TBB ........................................................ 23<br />

Class Definition : encryptBlockParallel .................................................. 24<br />

Class Definition : xorBlockParallel ......................................................... 25<br />

Class Implementation .......................................................................................... 26<br />

5. RESULTS .................................................................................................................... 34<br />

Experiments ......................................................................................................... 34<br />

Conclusion ........................................................................................................... 40<br />

References ........................................................................................................................ 52<br />

viii


LIST <strong>OF</strong> TABLES<br />

Page<br />

1. Table 1 CPB Comparison at Different Processor Cores (Experiment A) ........... 41<br />

2. Table 2 CPB Comparison at Different Block Lengths (Experiment B) ............. 49<br />

3. Table 3 CPB Comparison at Different Chunk Sizes (Experiment C) ................. 51<br />

ix


LIST <strong>OF</strong> FIGURES<br />

1. Figure 1 Sample Task Graph ...................................................................................... 17<br />

2. Figure 2 Sample Ready Pool ...................................................................................... 18<br />

3. Figure 3 CPB Comparison at Different Processor Cores ........................................... 35<br />

Page<br />

4. Figure 4 CPB Comparison at Different Block Lengths .............................................. 37<br />

5. Figure 5 CPB Comparison at Different Chunk Sizes ................................................. 39<br />

x


Chapter 1<br />

INTRODUCTION TO AUTHENTICATED ENCRYPTION<br />

Authentication and encryption are two different objectives to be achieved while<br />

designing any secure communication system. Encryption refers to the privacy of the<br />

actual message while authentication refers to the mechanism which can prove that the<br />

sender of the message is actually the one who he or she claims to be. Formally speaking<br />

Encryption is the process of transforming information (referred to as plaintext) using an<br />

algorithm (called cipher) to make it unreadable to anyone except those possessing special<br />

knowledge, usually referred to as a key. [1] The concept of authenticity is similar to the<br />

concept of signature in real world.<br />

1.1 Security Model<br />

There are various algorithms available for encryption and authentication purpose.<br />

One needs to make sure that the algorithm that he or she is planning to use can give<br />

enough security against all the possible type of attacks in that scenario. But to understand<br />

that we need to formally define the security model. In other words, we need to formalize<br />

different types of possible attacks on the system and different levels of security that an<br />

algorithm can provide against those attacks. It is quite possible that some encryption<br />

algorithms are secure against a particular type of attack but they are easily broken when<br />

thrown against another kind of attack.<br />

1


1.1.1 Notions of Security<br />

There are essentially 3 notions of security that needs to be defined - Perfect Security,<br />

Semantic Security and Polynomial Security.<br />

1. Perfect Security : The algorithm is said to be having perfect security or information<br />

theoretic security if the adversary with infinite amount of computational power can<br />

learn nothing about the plaintext given the ciphertext. This is actually a very strong<br />

definition and no such algorithm is possible in real world.<br />

2. Semantic Security : This notion is similar to the perfect security but here an adversary<br />

is given only polynomial amount of time. Polynomial time can be defined as t = f(|M|),<br />

where |M| is the length of the given message. In other words the algorithm is said to<br />

have semantic security if an adversary can learn nothing about the plaintext given the<br />

ciphertext in a certain amount of finite time (polynomial time).<br />

3. Polynomial Security : This is an extended concept of semantic security and it is also<br />

provable. Here an adversary is allowed to select 2 messages M1 and M2 of the same<br />

length. Now the adversary is given ciphertext of one of this messages Ci, where i is a<br />

randomly chosen unknown bit. The algorithm is said to be having polynomial security<br />

if an adversary cannot identify the message (either M1 or M2) related to the ciphertext<br />

Ci with significantly higher probability (probability higher than 1/2). It can be proved<br />

that if the algorithm is polynomially secured than it is also semantically secured. Here<br />

the advantage of an adversary Adv A = | Pr ( A ( guess, Ci, y, M1, M2 ) = b ) - 1/2 ) |,<br />

2


where y is a secret key. The scheme is polynomially secured if Adv A ≤ 1 / p(k), for all<br />

adversaries A and all polynomials p and sufficiently large k.<br />

1.1.2 Notions of Attacks<br />

There are mainly 3 different kind of attacks. Passive attack, chosen ciphertext attack and<br />

adaptive chosen ciphertext attack.<br />

1. Passive Attack : This is the weakest form of attack in which an adversary is allowed to<br />

observe only ciphertexts. An adversary also has an access to the encryption black box<br />

to which he / she can submit plaintext blocks and observe the returned ciphertexts.<br />

2. Chosen Ciphertext Attack : Here an adversary is given an access to the decryption box<br />

where she can submit any number of ciphertexts and observe the returned plaintext<br />

messages. In the next stage she is given a challenge ciphertext and is asked to get the<br />

plaintext or at least some information about the plaintext. In this later stage she is not<br />

allowed to use the decryption box.<br />

3. Adaptive Chosen Ciphertext Attack. This is a very strong type of attack. Here, in<br />

addition to all the accesses given in CCA, an adversary is also allowed to use the<br />

decryption box during the challenge stage except for the challenge ciphertext.<br />

Based on the notions above we can say that “A public key encryption algorithm is said to<br />

be secured if it is polynomially secured against an adaptive chosen ciphertext attack.” [2]<br />

Similar kind of approach defines the security for a symmetric encryption algorithm. The<br />

3


actual difference between them is that in case of public key encryption scheme, the<br />

algorithm needs to be probabilistic while in case of symmetric key encryption scheme,<br />

deterministic algorithm can be used.<br />

1.1.3 Security of Message Authentication Code (MAC)<br />

Security of a MAC can be defined in various ways but selective forgery is a<br />

widely used notion. In this notion an adversary is asked to choose a plaintext message<br />

M1. The MAC generator algorithm returns the MAC S1 for some random key K. Now<br />

the challenge for an adversary is to generate another valid pair of (M2, S2) where, M1 ≠<br />

M2. If the adversary succeeds in generating such a pair than this is know as a selective<br />

forgery.<br />

1.2 Authenticated Encryption<br />

After discussing the security model, now we are in a position to discuss<br />

Authenticated Encryption. Various practices of providing privacy and authenticity have<br />

been used for years. In traditional approach, encryption and authentication algorithms are<br />

applied one after the other to achieve data security and authenticity. These kind of<br />

schemes are known as “generic compositions”. Here encryption and authentication<br />

algorithms can be applied in any order and based on that, the schemes are known as<br />

Encryption then Authentication (EtA) or Authentication then Encryption (AtE) or Encrypt<br />

and MAC (E&M). One such EtM generic composition scheme can be described as<br />

4


follows. For example Bob wants to send a message to Alice. They both share a secret key<br />

K. Bob first encrypts the message using this key K and possibly a nonce N. The nonce<br />

here can be any random number or a counter value. After generating the ciphertext, it is<br />

applied to the authentication algorithm along with the key. This will generate a tag known<br />

as message authentication tag T. Now Bob can send this triplet (C, N, T) to Alice. On the<br />

other end, Alice can apply exactly the reverse process to retrieve original plaintext and to<br />

verify the authenticity of the received message. There are various such schemes available<br />

but the problem with them is that both encryption and authentication functions need to be<br />

applied separately, which takes almost double time and processing power. Some times<br />

designers make a mistake of using regular hash instead of a secure hash (MAC). This<br />

approach is almost always broken. So as a conclusion, it would be best for any generic<br />

composition scheme that ‘Encrypt then MAC’ (EtM) approach with a provably secured<br />

encryption scheme and a provably secured MAC (each with independent keys) are used.<br />

The concept of Authenticated encryption (AE) is to provide a single method<br />

which can achieve data security and authenticity in a single pass, and thus improving the<br />

efficiency. Some researchers have pointed out that even though the individual elements of<br />

the scheme are secure, the combined scheme - if not designed properly - may lead to the<br />

insecure implementation. A properly designed AE scheme can also provide security<br />

against the chosen ciphertext attack. This is a kind of attack where an adversary can<br />

submit carefully chosen ciphertext to the decryption oracle. By analyzing the pattern in<br />

5


plaintext, an adversary may be able to get some information about the secret key being<br />

used. Instead an ideal AE scheme will just refuse to decrypt the message without giving<br />

much information, if the message is not properly authenticated. Thus an adversary cannot<br />

submit just any random ciphertext and expect its corresponding plaintext. This approach<br />

avoids CCA. Another important aspects while designing an AE schemes are efficiency,<br />

parallelizability, simplicity and portability. One such scheme known as Offset Codebook<br />

(<strong>OCB</strong>) is described in the next chapter.<br />

The security of any AE scheme is dependent on its primitive algorithms. It is very<br />

difficult to provide a proof of security for such primitives. But once we have shown that<br />

no known attacks seem to work, it is possible to show that the schemes, based on these<br />

primitives are as secure as its underlying algorithms. For that matter it can be proved that<br />

<strong>OCB</strong> scheme is as secure as its underlying encryption algorithm.<br />

6


2.1 Overview<br />

Chapter 2<br />

INTRODUCTION TO <strong>OF</strong>FSET <strong>CODEBOOK</strong> (<strong>OCB</strong>)<br />

Offset Codebook (<strong>OCB</strong>) mode of authenticated encryption was developed by<br />

Philip Rogaway, who credited the design to Mihir Bellare, John Black and Ted Krovetz<br />

for their support. This mode is based on the IAMP scheme developed by Charanjit Jutla.<br />

The <strong>OCB</strong> scheme improves the original IAMP scheme in certain criteria such as 1)<br />

Minimizing number of block cypher calls, 2) Giving direction when the length of the<br />

original message is not the multiple of block length n, 3) Avoids multiple encryption<br />

keys, 4) Makes use of a nonce which is required to be unique for each encryption but is<br />

not required to be secret or random. There are two versions of the <strong>OCB</strong> scheme. The<br />

initial version is 1.0 and the current version is 2.0 which is an improvement over version<br />

1.0. The key differentiators between two versions are that the version 2.0 allows<br />

associated data to be included with the message and a new method for generating the<br />

sequence of offsets. This associated data travels in plain text along with the cipher text<br />

but it needs to be authenticated. This is similar to the header requirement discussed in<br />

chapter 1.<br />

<strong>OCB</strong> uses a block cypher - typically AES. It allows a predefined header to be<br />

authenticated along with the message. <strong>OCB</strong> also requires a unique nonce N along with<br />

7


each encryption. It typically requires h + m + 2 block cypher calls in total, where h is the<br />

block length of header and m is the block length of the original message. Once header is<br />

authenticated there is virtually no cost in subsequent authentication of H. So <strong>OCB</strong> uses m<br />

+ 2 block cypher calls. <strong>OCB</strong> is also highly parallelizable. As it will be discussed later on,<br />

some parts of the <strong>OCB</strong> algorithm can be done independently and it implies that the<br />

efficiency of the <strong>OCB</strong> operation can be improved dramatically if underlying hardware<br />

supports robust parallel processing. Another advantage of <strong>OCB</strong> is that it is an online<br />

scheme. In other words it is not required to know the length of the complete message<br />

before starting the encryption. Similarly it is not required to know the length of the<br />

complete cypher text before starting the decryption. <strong>OCB</strong> generated output is of the same<br />

length as the original message plus the length of the authentication tag. This is a huge<br />

advantage as it minimizes the actual data being transferred. This might be a cause of<br />

concern in cases where traffic analysis is possible. In such scenario other schemes such as<br />

padding needs to be used. Following sections describe the actual algorithm or pseudo<br />

code for <strong>OCB</strong> and it’s constructs.<br />

8


2.2 Notation and Basic Operation [3]<br />

c^i The integer c raised to the i-th power<br />

ceil(x) The smallest integer no smaller than x<br />

bitlength(S) The length of string S in bits<br />

zeros(n) The string made of zero bits<br />

S xor T The string that is the bitwise exclusive-or of S and T. Strings S and T must<br />

have the same length<br />

S[i] The i-th bit of the string S (indices begin at 1).<br />

S[i..j] The substring of S consisting of bits i through j.<br />

S || T The string S concatenated with string T (eg, 000 || 111 = 000111).<br />

S


2.4 Header Authentication : PMAC [3]<br />

Function Name : PMAC<br />

Input : K, string of KEYLEN bits<br />

H, string of any length // Header to co-authenticate<br />

Output : Auth, string of BLOCKLEN bits // Header authenticator<br />

// Break H into blocks<br />

m = max(1, ceil(bitlength(H) / BLOCKLEN))<br />

Let H_1, H_2, ..., H_m be strings such that H = H_1 || H_2 || ... || H_m and bitlength(H_i)<br />

= BLOCKLEN for all 0 < i < m.<br />

// Initialize strings used for offsets and checksums<br />

Offset = ENCIPHER(K, zeros(BLOCKLEN))<br />

Offset = times3(Offset)<br />

Offset = times3(Offset)<br />

Checksum = zeros(BLOCKLEN)<br />

// Accumulate the first m - 1 blocks<br />

for i = 1 to m - 1 do // Skip if m < 2<br />

Offset = times2(Offset)<br />

Checksum = Checksum xor ENCIPHER(K, H_i xor Offset)<br />

end for<br />

// Accumulate the final block<br />

Offset = times2(Offset)<br />

if bitlength(H_m) = BLOCKLEN then<br />

Offset = times3(Offset)<br />

Checksum = Checksum xor H_m<br />

else<br />

Offset = times3(Offset)<br />

Offset = times3(Offset)<br />

Tmp = H_m || 1 || zeros(BLOCKLEN - (bitlength(H_m) + 1))<br />

Checksum = Checksum xor Tmp<br />

end if<br />

// Compute result<br />

Auth = ENCIPHER(K, Offset xor Checksum)<br />

10


2.5 Encryption : <strong>OCB</strong>-ENCRYPT [3]<br />

Function Name : <strong>OCB</strong>-ENCRYPT<br />

Input : K, string of KEYLEN bits // Key<br />

N, string of BLOCKLEN bits // Nonce<br />

H, string of any length // Header<br />

M, string of any length // Plaintext<br />

Output : C, string of length equal to M // Cipher text core<br />

T, string of BLOCKLEN bits // Authentication tag<br />

// Break M into blocks<br />

m = max(1,ceil(bitlength(M) / BLOCKLEN))<br />

Let M_1, M_2, ..., M_m be strings such that M = M_1 || M_2 || ... || M_m and bitlength<br />

(M_i) = BLOCKLEN for all 0 < i < m.<br />

// Initialize strings used for offsets and checksums<br />

Offset = ENCIPHER(K,N)<br />

Checksum = zeros(BLOCKLEN)<br />

// Encrypt and accumulate first m - 1 blocks<br />

for i = 1 to m - 1 do // Skip if m < 2<br />

Offset = times2(Offset)<br />

Checksum = Checksum xor M_i<br />

C_i = Offset xor ENCIPHER(K, M_i xor Offset)<br />

end for<br />

// Encrypt and accumulate final block<br />

Offset = times2(Offset)<br />

b = bitlength(M_m) // Value in 0..BLOCKLEN<br />

Pad = ENCIPHER(K, num2str(b, BLOCKLEN) xor Offset)<br />

C_m = M_m xor Pad[1..b] // Encrypt M_m<br />

Tmp = M_m || Pad[b+1..BLOCKLEN]<br />

Checksum = Checksum xor Tmp<br />

// Compute authentication tag<br />

Offset = times3(Offset)<br />

T = ENCIPHER(K, Checksum xor Offset)<br />

if bitlength(H) > 0 then<br />

T = T xor PMAC(K, H)<br />

end if<br />

// Assemble the ciphertext<br />

C = C_1 || C_2 || ... || C_m<br />

11


2.6 Decryption : <strong>OCB</strong>-DECRYPT [3]<br />

Function Name : <strong>OCB</strong>-DECRYPT<br />

Input : K, string of KEYLEN bits // Key<br />

N, string of BLOCKLEN bits // Nonce<br />

H, string of any length // Header<br />

C, string of any length // Cipher text core<br />

Output : M, string // Plaintext<br />

V, boolean // Validity indicator<br />

m = max(1,ceil(bitlength(C) / BLOCKLEN))<br />

Let C_1, C_2, ..., C_m be strings such that C = C_1 || C_2 || ... || C_m and bitlength(C_i)<br />

= BLOCKLEN for all 0 < i < m.<br />

Offset = ENCIPHER(K,N)<br />

Checksum = zeros(BLOCKLEN)<br />

// Decrypt and accumulate<br />

for i = 1 to m - 1 do // Skip if a < 2<br />

Offset = times2(Offset)<br />

M_i = Offset xor DECIPHER(K, C_i xor Offset)<br />

Checksum = Checksum xor M_i<br />

end for<br />

Offset = times2(Offset)<br />

b = bitlength(C_m) // Value in 0..BLOCKLEN<br />

Pad = ENCIPHER(K, num2str(b, BLOCKLEN) xor Offset)<br />

M_m = C_m xor Pad[1..b]<br />

Tmp = M_m || Pad[b+1..BLOCKLEN]<br />

Checksum = Checksum xor Tmp<br />

// Compute valid authentication tag<br />

Offset = times3(Offset)<br />

FullValidTag = ENCIPHER(K, Offset xor Checksum)<br />

if bitlength(H) > 0 then<br />

FullValidTag = FullValidTag xor PMAC(K, H)<br />

end if<br />

if T = FullValidTag[1..bitlength(T)] then<br />

V = true<br />

M = M_1 || M_2 || ... || M_m<br />

else<br />

V = false<br />

M = <br />

end if<br />

12


2.7 Parallel Portion of the Encryption / Decryption Algorithm<br />

If we carefully look at the encryption algorithm, it is evident that the main loop to<br />

encrypt the plain text is highly parallelizable. Calculating the cipher text for message<br />

block ‘i’, requires the knowledge of only current offset value and it is not dependent on<br />

any other message block or the cipher text blocks. We can calculate offset separately. It is<br />

also possible to calculate the offset values in advance as it is not dependent on the actual<br />

plaintext values. Now until we reach the last message block, n number of processes<br />

(depending on the available independent processing units) can encrypt independent block<br />

(s) of message and they will also have to keep track of their independent checksum<br />

values. As checksum is calculated by XORing the message block and current checksum<br />

value and - XOR being the associative and commutative operation - individual processing<br />

units can keep track of their own checksum values. And at the end, final checksum can be<br />

calculated by XORing the individual checksum values. Similar parallelizability exists in<br />

the decryption algorithm as well. Thus it is quite possible to take advantage of this highly<br />

parallel scheme and implement it such that the implementation becomes scalable and<br />

efficient in terms of using all the available processing power. Threading Building Block<br />

(TBB) is one such mechanism provided by Intel, which allows the developer to use all<br />

the available processor cores without making the implementation super complicated.<br />

Subsequent chapters will discus TBB in detail and the implementation of <strong>OCB</strong> using<br />

TBB.<br />

13


2.8 Security Consideration of <strong>OCB</strong><br />

1. <strong>OCB</strong> scheme is as secure as the underlying block cipher scheme. So the designer<br />

should choose only well trusted block cipher. The privacy and authenticity decreases as<br />

per s^2 / 2^BLOCKLEN, where s is the total number of blocks that the adversary<br />

acquires. Thus the BLOCKLEN should be selected carefully. Choosing a smaller value<br />

for BLOCKLEN will result in the higher probability of adversary’s success. Usually<br />

the BLOCKLEN of 128 is sufficient.<br />

2. For the secure operation, it is required that the nonce value is not repeated for the same<br />

encryption key. If there are multiple parties communicating with the same key than<br />

they should divide the nonce space such that they do not overlap. Nonce is not required<br />

to be a secret. A simple counter can also work fine.<br />

3. Designer can also choose the length of the authentication tag. But choosing a small<br />

value for the tab length increases the chances of an adversary being capable of forgery.<br />

4. <strong>OCB</strong> scheme (or any other authenticated encryption scheme for that matter) can<br />

provide security against chosen cipher text attack. But for that, designer needs to make<br />

sure that when the decryption or authentication fails, the system should not give all the<br />

details to the adversary.<br />

14


3.1 Overview<br />

Chapter 3<br />

INTRODUCTION TO THREADING BUILDING BLOCKS (TBB)<br />

It is often quite challenging to develop a multi-threaded application that can scale<br />

itself based on the available processor cores. Ideally if some application gives X amount<br />

of performance on a dual core machine than that performance should improve on a quad<br />

core machine. If the developer tries to develop this kind of scalable application using raw<br />

threads such as POSIX or windows thread than he or she needs to manage a lot of thread<br />

overhead. In addition to that he needs to take care of a lot of low level stuff such as load<br />

balancing, memory contention and cache performance. Threading Building Blocks (TBB)<br />

is such a template library for C++ - developed by Intel - which can help developers in<br />

such cases. It is a very high level library where developer need to identify different tasks<br />

in the application and not the threads. TBB will automatically map all the tasks to<br />

appropriate number of threads and will run it efficiently. TBB can identify available<br />

number of processor cores and can load balance the task to get maximum performance.<br />

TBB offers a vast range of advantages over the native threads, such as<br />

1. It is platform independent and processor independent.<br />

2. It can be seamlessly integrated with other threading libraries in the same application.<br />

15


3. TBB targets data parallel programming. Instead of parallelizing independent tasks, it<br />

tries to divide one data intensive task into multiple threads. This approach often works<br />

better in terms of efficiency.<br />

4. Instead of relying on a global task queue, it uses task stealing mechanism thus avoiding<br />

the main point of contention. Task stealing is described in more details in the following<br />

sub section.<br />

3.2 Task Scheduling<br />

Task scheduling is the heart of TBB. It is the component which allocates tasks<br />

onto available worker threads and maintains the load balancing. When ever the task<br />

scheduler is initialized it creates a task graph. Each node in the graph represents a task.<br />

Each arrow points to another task which is it’s parent task. Each node also keeps a count<br />

of its child threads - a reference count. Nodes also keep one more counter called the<br />

depth count, which is usually one more than its parent. One such task graph is shown in<br />

figure 1. There are 2 options to traverse this task graph, breadth first search and depth<br />

first search. Usually depth first search is more efficient in terms of sequential execution.<br />

There are two main reason for that.<br />

1. The deepest task is the last one created. So it is more likely to be in cache. Executing<br />

such a task will obviously improve performance.<br />

2. It minimizes the memory usage. If the graph is unfolded in breadth first fashion than it<br />

will create many more new tasks simultaneously which will occupy memory.<br />

16


On the other hand, choosing depth first approach reduces parallelism. So TBB task<br />

scheduler uses a fine mixture of both the schemes.<br />

Depth = 2<br />

RefCount = 0<br />

Depth = 1<br />

RefCount = 2<br />

Depth = 0<br />

RefCount = 2<br />

Depth = 2<br />

RefCount = 2<br />

Figure 1 : Sample Task Graph<br />

Depth = 1<br />

RefCount = 0<br />

17


The scheduler creates enough number of worker threads based on the available<br />

number of processing units. Each thread has an Execute() method. Once the execute<br />

method calls one task than that task is bound with that thread and cannot move to another<br />

thread. That thread can execute some other task if the current task is in sleep mode. The<br />

task graph is searched in the breadth first fashion and the tasks are assigned to the worker<br />

threads. After that each thread keeps its own ready pool. This pool is basically an array of<br />

lists. The array is indexed on the depth of the task node and each list works as a stack<br />

(LIFO). Newly created task (in ready state) is pushed in a queue which is at the level of<br />

it’s depth. The task will always go to the ready pool of the thread that created it.<br />

Shallow<br />

Deep<br />

Task A<br />

Task B Task C<br />

Task D<br />

Figure 2 : Sample Ready Pool<br />

18


When it comes to selecting the task to execute, following rules are followed in order.<br />

1. Run the task returned by the Execute() method of the previous task<br />

2. Select the task which is at the deepest non-empty list in the pool.<br />

3. If thread’s own pool is empty then it can steal a task from other threads shallowest list<br />

of the pool.<br />

In summary, TBB task scheduler uses breadth first task stealing and depth first<br />

work strategy. Breadth first maximizes the parallelism while depth first ensures that the<br />

threads work efficiently once they have enough work to do.<br />

3.3 TBB Provided Algorithms<br />

TBB provides number of constructs / algorithms to help parallelize most common<br />

parallel structures in the software. Some of them are listed below with a short description.<br />

1. parallel_for and parallel_reduce : They are used when there is a need to parallelize a<br />

fixed number of independent loops.<br />

2. parallel_scan : It is used to parallelize a loop of which each iteration is dependent on<br />

the other loop iteration. For example a running sum calculation can be parallelized<br />

using this construct.<br />

3. parallel_while : This is used when there is a need to parallelize a continuous<br />

unstructured stream of work. New work can be added on the go.<br />

19


4. pipeline : This construct can efficiently parallelizes the typical pipeline structured<br />

segment of the code.<br />

5. parallel_sort : The complexity of this construct while sorting is no higher than O(n log<br />

n) for a single processor. As more processors become available, complexity approaches<br />

O(n).<br />

In each of these constructs it is possible to provide the chunk size manually. In that case<br />

TBB will divide the task accordingly and each worker thread will work on the specified<br />

chuck of data. TBB also provides auto_partition functionality where it determines the<br />

chunk size based on the available resources and the parallelism of the code.<br />

auto_partition works well in cases where actual data size is not available in advance.<br />

3.4 Containers<br />

TBB provides highly concurrent containers. These containers are quite similar to<br />

the STL containers but they are thread safe. Usually STL containers are not thread safe so<br />

there is practice to put a lock during their access, which eventually kills the purpose of<br />

parallelism. TBB currently provides only 3 such containers namely concurrent queue,<br />

vector and a hash-map. TBB also uses 2 different kind of locking mechanism to provide<br />

maximum parallelizability.<br />

1. Fine grained lock : It only locks the portion of container which is being used. So as far<br />

as two threads are accessing different portions of the same container than their access<br />

is actually parallel.<br />

20


2. Lock-free algorithm : This algorithm allows concurrent access without locking the<br />

container but it keeps track of any possible corruption and provides a correction for it.<br />

This type of concurrent thread-safe containers are not as fast as the one from STL. But if<br />

used properly, the gain from parallelism can outperform the slowness of these containers.<br />

3.5 Scalable Memory Allocation<br />

Memory allocation is a huge bottleneck when it comes to the multi processor<br />

systems. All the parallel threads try to allocate memory from the same heap and that<br />

reduces parallelism. There is also a chance of false sharing in this case. False sharing is<br />

caused due to the way processor accesses the memory. Even if it needs to read only one<br />

byte, it has to read entire cache line. So if more than 1 processes are using different bytes<br />

in the same cache line, than cache miss ratio is going to be very high and that will impact<br />

performance a lot. To avoid these problems TBB offers 2 different allocators using which<br />

we can minimize the bottleneck.<br />

1. scalable_allocator : Allocating memory using this allocator ensures that each thread is<br />

given memory from a different pool.<br />

2. cache_aligned_allocator : Using this allocator makes sure that besides using separate<br />

pools for each thread, each memory allocation is aligned with the cache line. This is<br />

likely to increase memory wastage, hence should be used very carefully.<br />

21


Chapter 4<br />

<strong>OCB</strong> IMPLEMENTATION USING TBB<br />

This chapter includes the actual implementation of <strong>OCB</strong> algorithm using the<br />

template library TBB. As discussed in chapter 1, this implementation tries to parallelize<br />

the main loop in the algorithm which does the actual encryption and authentication.<br />

Calculation of offset values is not parallelized as it is something that can be calculated in<br />

advance.<br />

4.1 Class Definitions<br />

Definition of class encryptBlockParallel in section 4.1.2 is the main portion where<br />

TBB operation is defined. Each worker thread created by TBB for encryption process,<br />

calls the function defined in this class. By carefully analyzing the function, we can see<br />

that TBB actually passes a range of values defined by the blocked_range template. The<br />

worker thread will work in that specific range of values before going to take the next task<br />

from the ready pool. Similarly class definition for xorBlockParallel in section 4.1.3 tries<br />

to parallelize the xor operation on 2 blocks passed as it’s parameters. This function can<br />

really improve the performance if the block size is too large. TBB can automatically<br />

detect the cost behind dividing the xor operation and will do so if feasible. So in our case<br />

where block size is 16 bytes, it is very less likely that TBB will spread that operation<br />

across multiple worker threads.<br />

22


4.1.1 Class Definition : <strong>OCB</strong>_With_TBB<br />

# define BLOCK_LEN 16<br />

typedef unsigned char byte;<br />

typedef byte BLOCK[BLOCK_LEN];<br />

using namespace tbb;<br />

using namespace std;<br />

class <strong>OCB</strong>_With_TBB<br />

{<br />

private:<br />

int nextBlockNum;<br />

HANDLE hThread1, hThread2, hSemaphore;<br />

DWORD dwThreadID1, dwThreadID2;<br />

BLOCK *pOffset;<br />

BLOCK key, nonce;<br />

BLOCK checksum;<br />

BLOCK currentOffset;<br />

CRijndael objAES;<br />

unsigned int lenHeader;<br />

unsigned int lenPlainText;<br />

public:<br />

};<br />

bool times2(BLOCK* input, BLOCK* output);<br />

bool times3(BLOCK* input, BLOCK* output);<br />

void xorBlock(BLOCK* retBlock, BLOCK* leftBlock, BLOCK* rightBlock);<br />

void pmac(BLOCK* result);<br />

BLOCK *pPlainText, *pCipherText, *pHeader;<br />

BLOCK authTag;<br />

<strong>OCB</strong>_With_TBB(void);<br />

~<strong>OCB</strong>_With_TBB(void);<br />

void Initialize(BLOCK* pKey, BLOCK* pNonce, BLOCK* pPlainText, unsigned<br />

int lenPlainText, BLOCK* pHeader, unsigned int lenHeader);<br />

void <strong>OCB</strong>Encrypt();<br />

// int getNextBlockNum();<br />

void xorBlockWithOffset(BLOCK* retBlock, BLOCK* input, int blockNum);<br />

CRijndael* getAESObject();<br />

23


4.1.2 Class Definition : encryptBlockParallel<br />

class encryptBlockParallel<br />

{<br />

<strong>OCB</strong>_With_TBB* const obj<strong>OCB</strong>;<br />

public:<br />

};<br />

void operator()(const blocked_range& range)const<br />

{<br />

BLOCK tempBlock1, tempBlock2;<br />

}<br />

for(int i = 0; i < 100000; ++i)<br />

{<br />

for(size_t index = range.begin(); index < range.end(); ++index)<br />

{<br />

obj<strong>OCB</strong>->xorBlockWithOffset(&tempBlock1,<br />

& ((obj<strong>OCB</strong>->pPlainText)[index]), index);<br />

(obj<strong>OCB</strong>->getAESObject())->EncryptBlock((const char*)<br />

&tempBlock1, (char*)&tempBlock2);<br />

obj<strong>OCB</strong>->xorBlockWithOffset(&((obj<strong>OCB</strong>->pCipherText)<br />

[index]), &tempBlock2, index);<br />

}<br />

}<br />

encryptBlockParallel(<strong>OCB</strong>_With_TBB* obj<strong>OCB</strong>) : obj<strong>OCB</strong>(obj<strong>OCB</strong>)<br />

{<br />

}<br />

24


4.1.3 Class Definition : xorBlockParallel<br />

class xorBlockParallel<br />

{<br />

BLOCK const *ret;<br />

BLOCK const *left;<br />

BLOCK const *right;<br />

public:<br />

};<br />

void operator()(const blocked_range& range)const<br />

{<br />

for(size_t index = range.begin(); index < range.end(); ++index)<br />

{<br />

((unsigned char*)ret)[index] = ((unsigned char*)left)[index] ^<br />

((unsigned char*)right)[index];<br />

}<br />

}<br />

xorBlockParallel(BLOCK* retBlock, BLOCK* leftBlock, BLOCK* rightBlock)<br />

{<br />

ret = retBlock;<br />

left = leftBlock;<br />

right = rightBlock;<br />

}<br />

25


4.2 Class Implementation<br />

Below is the implementation of class <strong>OCB</strong>_With_TBB and its related functions.<br />

The main thing to notice here is a call to parallel_for() in function <strong>OCB</strong>Encrypt. This is a<br />

TBB provided functionality which will cause the loop to be divided into tasks and each<br />

tasks will be given to the available worker threads. The last parameter to parallel_for is<br />

specified as “auto_partitioner” which indicates that TBB will divide work on its own.<br />

TBB also provides a way to specify our own partitioner where a programmer can specify<br />

how to divide the loop. In other words we can specify the grain size.<br />

# define PROCESSOR_FREQUENCY 2680000000<br />

void <strong>OCB</strong>_With_TBB::xorBlock(BLOCK* retBlock, BLOCK* leftBlock, BLOCK*<br />

rightBlock)<br />

{<br />

int loopIndex = 0;<br />

}<br />

/* parallel_for(blocked_range(0, BLOCK_LEN), xorBlockParallel<br />

(retBlock, leftBlock, rightBlock), auto_partitioner()); */<br />

for(loopIndex = 0; loopIndex < BLOCK_LEN; ++loopIndex)<br />

{<br />

((unsigned char*)retBlock)[loopIndex] = ((unsigned char*)leftBlock)<br />

[loopIndex] ^ ((unsigned char*)rightBlock)[loopIndex];<br />

}<br />

void <strong>OCB</strong>_With_TBB::xorBlockWithOffset(BLOCK* retBlock, BLOCK* input, int<br />

blockNum)<br />

{<br />

xorBlock(retBlock, input, &(pOffset[blockNum]));<br />

}<br />

26


ool <strong>OCB</strong>_With_TBB::times2(BLOCK* input, BLOCK* output)<br />

{<br />

int loopIndex = 0;<br />

if(NULL != input && NULL != output)<br />

{<br />

unsigned char carry = 0;<br />

carry = ((unsigned char*)input)[0] >> 7;<br />

for(loopIndex = 0; loopIndex < BLOCK_LEN - 1; ++loopIndex)<br />

{<br />

((unsigned char*)output)[loopIndex] = (((unsigned char*)input)<br />

[loopIndex] > 7);<br />

}<br />

((unsigned char*)output)[BLOCK_LEN - 1] = (((unsigned char*)input)<br />

[BLOCK_LEN - 1]


void <strong>OCB</strong>_With_TBB::Initialize(BLOCK* pKey, BLOCK* pNonce, BLOCK*<br />

pPlainText, unsigned int lenPlainText, BLOCK* pHeader, unsigned int lenHeader)<br />

{<br />

BLOCK temp;<br />

unsigned int loopIndex = 0;<br />

unsigned int numPlainTextBlocks = ceil((double)lenPlainText /<br />

(double)BLOCK_LEN);<br />

unsigned int numHeaderBlocks = ceil((double)lenHeader /<br />

(double)BLOCK_LEN);<br />

nextBlockNum = 0;<br />

hSemaphore = CreateSemaphore(NULL, 1, 1, NULL);<br />

if(NULL != key && NULL != nonce && NULL != pPlainText)<br />

{<br />

this->lenPlainText = lenPlainText;<br />

this->lenHeader = lenHeader;<br />

memcpy(&key, pKey, sizeof(key));<br />

memcpy(&nonce, pNonce, sizeof(nonce));<br />

this->pPlainText = (BLOCK*)calloc(numPlainTextBlocks,<br />

BLOCK_LEN);<br />

this->pCipherText = (BLOCK*)calloc(numPlainTextBlocks,<br />

BLOCK_LEN);<br />

this->pOffset = (BLOCK*)calloc(numPlainTextBlocks, BLOCK_LEN);<br />

this->pHeader = (BLOCK*)calloc(numHeaderBlocks, BLOCK_LEN);<br />

memcpy(this->pPlainText, pPlainText, lenPlainText);<br />

memcpy(this->pHeader, pHeader, lenHeader);<br />

memset(&authTag, 0, sizeof(authTag));<br />

// Initializing AES<br />

objAES.MakeKey((const char*)&key, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",<br />

sizeof(key), BLOCK_LEN);<br />

// Initializing Offset<br />

28


}<br />

}<br />

memset(temp, 0, sizeof(temp));<br />

memset(currentOffset, 0, sizeof(currentOffset));<br />

objAES.EncryptBlock((const char*)&nonce, (char*)&currentOffset);<br />

memset(&checksum, 0, sizeof(checksum));<br />

// Now pre-calculating offset and checksum values<br />

if(1 < numPlainTextBlocks)<br />

{<br />

times2(&currentOffset, &temp);<br />

memcpy(pOffset, &temp, BLOCK_LEN);<br />

}<br />

xorBlock(&temp, &checksum, pPlainText);<br />

memcpy(&checksum, &temp, BLOCK_LEN);<br />

for(loopIndex = 1; loopIndex < numPlainTextBlocks - 1;<br />

++loopIndex)<br />

{<br />

times2(&(pOffset[loopIndex - 1]), &temp);<br />

memcpy(&(pOffset[loopIndex]), &temp,<br />

BLOCK_LEN);<br />

xorBlock(&temp, &checksum, &(pPlainText[loopIndex]));<br />

memcpy(&checksum, &temp, sizeof(checksum));<br />

}<br />

memcpy(&currentOffset, pOffset[loopIndex - 1], BLOCK_LEN);<br />

29


void <strong>OCB</strong>_With_TBB::<strong>OCB</strong>Encrypt()<br />

{<br />

unsigned int bitLength = 0;<br />

unsigned int loopIndex = 0, numBlocks = 0, numPlainTextBlocks = 0;<br />

BLOCK tempBlock1, tempBlock2, pad;<br />

numPlainTextBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN);<br />

long int startTime, endTime;<br />

double cpuCycles;<br />

startTime = clock();<br />

parallel_for(blocked_range(0, numPlainTextBlocks),<br />

encryptBlockParallel(this), auto_partitioner());<br />

endTime = clock();<br />

cpuCycles = ((double)(endTime - startTime) / (double)CLOCKS_PER_SEC) *<br />

PROCESSOR_FREQUENCY;<br />

cpuCycles = cpuCycles / (double)100000; // loop count<br />

cpuCycles = (cpuCycles / (double)(numPlainTextBlocks * 16));<br />

// Now processing last block<br />

numBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN);<br />

times2(&currentOffset, &tempBlock1);<br />

memcpy(&currentOffset, &tempBlock1, BLOCK_LEN);<br />

if(1 < numBlocks)<br />

{<br />

memcpy(&(pOffset[numBlocks - 1]), currentOffset, BLOCK_LEN);<br />

}<br />

memset(&tempBlock1, 0, sizeof(tempBlock1));<br />

//numPlainTextBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN);<br />

if(1 < numPlainTextBlocks)<br />

{<br />

bitLength = ((this->lenPlainText) - ((numPlainTextBlocks - 1) *<br />

BLOCK_LEN)) * 8;<br />

}<br />

30


else<br />

{<br />

bitLength = this->lenPlainText * 8;<br />

}<br />

for(loopIndex = 0; loopIndex < sizeof(bitLength); ++loopIndex)<br />

{<br />

// following line is specific to a little endian machine<br />

tempBlock1[sizeof(tempBlock1) - sizeof(bitLength) + loopIndex] |=<br />

((unsigned char*)(&bitLength))[(sizeof(bitLength) - loopIndex) - 1];<br />

}<br />

xorBlock(&tempBlock2, &tempBlock1, &currentOffset);<br />

(this->getAESObject())->EncryptBlock((const char*)&tempBlock2, (char*)<br />

&pad);<br />

for(loopIndex = 0; loopIndex < ceil((double)bitLength / 8); ++loopIndex)<br />

{<br />

((unsigned char*)(&(pCipherText[numBlocks - 1])))[loopIndex] =<br />

((unsigned char*)(&(pPlainText[numBlocks - 1])))[loopIndex] ^<br />

((unsigned char*(&pad))[loopIndex];<br />

}<br />

memset(&tempBlock1, 0, sizeof(tempBlock1));<br />

memcpy(&tempBlock1, pPlainText[numBlocks - 1], (bitLength / 8));<br />

memcpy(&((unsigned char*)&tempBlock1)[(bitLength / 8)], &((unsigned char*)<br />

&pad)[(bitLength / 8)], (BLOCK_LEN - (bitLength / 8)));<br />

xorBlock(&tempBlock2, &checksum, &tempBlock1);<br />

memcpy(&checksum, &tempBlock2, BLOCK_LEN);<br />

// Computing authentication tag<br />

memset(&tempBlock1, 0, sizeof(tempBlock1));<br />

memset(&tempBlock2, 0, sizeof(tempBlock1));<br />

times3(&currentOffset, &tempBlock1);<br />

xorBlock(&tempBlock2, &checksum, &tempBlock1);<br />

(getAESObject())->EncryptBlock((const char*)&tempBlock2, (char*)&authTag);<br />

31


}<br />

if(lenHeader > 0)<br />

{<br />

pmac(&tempBlock1);<br />

xorBlock(&tempBlock2, &authTag, &tempBlock1);<br />

memcpy(&authTag, &tempBlock2, sizeof(BLOCK));<br />

}<br />

void <strong>OCB</strong>_With_TBB::pmac(BLOCK* result)<br />

{<br />

unsigned int numHeaderBlocks, loopIndex;<br />

BLOCK offset, checksum, tempBlock1, tempBlock2;<br />

numHeaderBlocks = ceil((double)lenHeader / (double)BLOCK_LEN);<br />

memset(&offset, 0, sizeof(offset));<br />

memset(&checksum, 0, sizeof(checksum));<br />

memset(&tempBlock1, 0, sizeof(tempBlock1));<br />

memset(&tempBlock2, 0, sizeof(tempBlock2));<br />

objAES.EncryptBlock((const char*)&tempBlock1, (char*)&offset);<br />

times3(&offset, &tempBlock1);<br />

memcpy(&offset, &tempBlock1, sizeof(offset));<br />

times3(&offset, &tempBlock1);<br />

memcpy(&offset, &tempBlock1, sizeof(offset));<br />

for(loopIndex = 0; loopIndex < numHeaderBlocks - 1; ++loopIndex)<br />

{<br />

times2(&offset, &tempBlock1);<br />

memcpy(offset, &tempBlock1, sizeof(offset));<br />

}<br />

xorBlock(&tempBlock1, &(pHeader[loopIndex]), &offset);<br />

objAES.EncryptBlock((const char*)&tempBlock1, (char*)&tempBlock2);<br />

xorBlock(&tempBlock1, &checksum, &tempBlock2);<br />

memcpy(&checksum, &tempBlock1, sizeof(checksum));<br />

32


}<br />

// Now processing last block<br />

times2(&offset, &tempBlock1);<br />

memcpy(&offset, &tempBlock1, sizeof(offset));<br />

if(0 == (lenHeader % BLOCK_LEN))<br />

{<br />

times3(&offset, &tempBlock1);<br />

memcpy(&offset, &tempBlock1, sizeof(offset));<br />

}<br />

else<br />

{<br />

}<br />

xorBlock(&tempBlock1, &checksum, &pHeader[numHeaderBlocks - 1]);<br />

memcpy(&checksum, &tempBlock1, BLOCK_LEN);<br />

times3(&offset, &tempBlock1);<br />

memcpy(&offset, &tempBlock1, BLOCK_LEN);<br />

times3(&offset, &tempBlock1);<br />

memcpy(&offset, &tempBlock1, BLOCK_LEN);<br />

memset(&tempBlock1, 0, BLOCK_LEN);<br />

// assuming lenHeader in bytes and not in bits<br />

memcpy(&tempBlock1, &(pHeader[numHeaderBlocks - 1]),<br />

BLOCK_LEN);<br />

((unsigned char*)&tempBlock1)[lenHeader % BLOCK_LEN] = 0x80;<br />

xorBlock(&tempBlock2, &checksum, &tempBlock1);<br />

memcpy(&checksum, &tempBlock2, BLOCK_LEN);<br />

xorBlock(&tempBlock1, &offset, &checksum);<br />

objAES.EncryptBlock((const char*)&tempBlock1, (char*)result);<br />

33


5.1 Experiments<br />

Chapter 5<br />

RESULTS<br />

Experiment A : First experiment was carried on a machine with 2.67 GHz Intel<br />

xeon processor, 6 GB RAM and 8 MB cache memory. There was also a need to compare<br />

results for a machine with 2 core processor and a machine with 4 core processor. I used<br />

one of the setting in Windows operating system to change the visible processor cores.<br />

Using that we can configure the OS such that it can see either 1, 2 or 4 processor cores.<br />

The advantage of this approach is that all the results are comparable as they are taken on<br />

the same physical machine except the visible number of processor cores. All the results<br />

are calculated in Cycles Per Byte (CPB) unit. Below is a chart based on the numbers, I<br />

collected from the experiment [Figure 3]. The results clearly indicate that performance<br />

does really improve when number of visible cores is changed from 1 to 2 to 4. But if we<br />

compare the performance for the same number of processors cores between with and<br />

without TBB cases, than there is not much difference in the numbers. But the major point<br />

to note here is that “without TBB” implementation needs to change when it is to be<br />

executed optimally on a 1 core machine vs a 2 core machine vs a 4 core machine. We<br />

need to change the number of worker threads and also we need to divide the data range to<br />

be processed. Where as the “with TBB” implementation need absolutely no change. Once<br />

34


compiled code can work on all 3 different configuration machine and can give optimum<br />

performance.<br />

CPB(Cycles Per Byte)<br />

40<br />

30<br />

20<br />

10<br />

0<br />

1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248<br />

Number of Blocks<br />

Without TBB - 1 Thread<br />

Without TBB - 2 Thread<br />

Without TBB 4 Thread<br />

With TBB - 2 Core<br />

With TBB - 4 Core<br />

Figure 3 : CPB Comparison at Different Processor Cores<br />

Note : Corresponding CPB numbers are listed in table 1 at the end of the chapter.<br />

35


Experiment B : This experiment was carried out on a machine with Intel Core 2<br />

Quad 2.66 GHz processor, 2 GB RAM and 6 MB cache memory. This experiment was to<br />

compare performance of <strong>OCB</strong> execution at different block sizes. The bloc sizes compared<br />

were 16, 24 and 32 bytes. The chart below in figure 4 is prepared based on the results of<br />

this experiment. The results indicate that the performance for 16 byte block length is<br />

nearly same in both the cases. In case of 32 byte block length, TBB performs slightly<br />

better. But there is a striking difference in performance when block length is kept at 24<br />

bytes. Here 24 byte block is not aligned to the word boundary in the actual memory. So<br />

the obvious suspect would be incorrect division of data range by TBB. If the range is<br />

divided such that two adjacent blocks are given to two different worker threads than there<br />

would be lot of cache miss during the entire run. And the data will go back and forth<br />

between different processor cache. To eliminate this suspicion, I tried to change the way<br />

paralel_for is called in the implementation. I tried to statically divide the range instead of<br />

relying on an auto_partitioner object provided by TBB. But to the surprise, results did not<br />

change. It indicates that the range division done by TBB is not the culprit. The only<br />

possibility remains is that the task scheduler - which assigns tasks to the available worker<br />

threads - causes too much overhead and thus causes a significant drop in performance.<br />

36


CPB (Cycles per Byte)<br />

900<br />

675<br />

450<br />

225<br />

0<br />

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64<br />

Number of Blocks<br />

Block Length 16 - With TBB<br />

Block Length 16 - Without TBB<br />

Block Length 24 - With TBB<br />

Block Length 24 - Without TBB<br />

Block Length 32 - With TBB<br />

Block Length 32 - Without TBB<br />

Figure 4 : CPB Comparison at Different Block Lengths<br />

Note : Corresponding CPB numbers are listed in table 2 at the end of the chapter.<br />

37


Experiment C : Third experiment was carried out on a machine with Intel Core 2<br />

Quad 2.66 GHz processor, 2 GB RAM and 6 MB cache memory. This was to observe<br />

TBB performance at various chunk sizes on a multi-core vs single core machine. In this<br />

experiment, I changed the partitioner parameter in parallel_for call. Instead of using<br />

auto_partitioner, I statically divided the range into chunks of 1, 2, 4, 8, 16, 32, 64, 128<br />

and 258 blocks (1 block = 16 bytes). So for chunk size of n blocks, each TBB worker<br />

thread will work on n number of blocks before asking for another work. The results of<br />

experiment are displayed below in figure 5. It indicates that increasing the chuck size on<br />

a multicore processor does really improve the overall performance. But it does not<br />

improve the performance on a single core processor. The reason is quite clear. On a<br />

multicore machine there are more than 1 active threads running at the same time to take<br />

advantage of increased chunk size. While on a single core machine there is only one<br />

thread active at a time. So even though chunk size is increased, there is only one active<br />

thread available to take real advantage of it. Hence the line for single core machine in the<br />

below chart is almost horizontal. The results can also be explained in opposite way. In a<br />

multicore machine, performance degrades rapidly by decreasing the chunk size. Mainly<br />

because there are more number of active threads at any given time, which can<br />

simultaneously ask for new task. And simultaneous calls to the task manager will of<br />

course cause some penalty to the performance.<br />

38


Cycles per Byte (CPB)<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1 2 4 8 16 32 64 128 256<br />

Chunk Size (in blocks)<br />

1 CPU Core - With TBB<br />

4 CPU Core - With TBB<br />

Figure 5 : CPB Comparison at Different Chunk Sizes<br />

Note : Corresponding CPB numbers are listed in table 3 at the end of the chapter.<br />

39


5.2 Conclusion<br />

Based on above experiments, I can say that in most of the cases TBB performs<br />

optimally or nearly optimal. The major advantage of TBB is that we do not need to<br />

recompile the code for different platforms. Once the code is compiled with TBB, than it<br />

can be executed on any machine and we can expect better utilization of available<br />

resources. But there are certain cases where TBB does not perform well for example,<br />

<strong>OCB</strong> execution with block length set to 24 bytes. In this case TBB performs worse than<br />

the implementation with native threads. So if we know the actual data size of input data<br />

in advance than it is better to compare performance of both the methods and if TBB is not<br />

worse than the other than we should deploy the application developed using TBB.<br />

40


Table 1 : CPB comparison at Different Processor Cores (Experiment A)<br />

Num. of<br />

Blocks<br />

Without<br />

TBB - 1<br />

Thread - 4<br />

Core System<br />

Without<br />

TBB - 2<br />

Threads - 4<br />

Core System<br />

Without<br />

TBB - 4<br />

Threads - 4<br />

Core System<br />

With TBB -<br />

2 Core<br />

System<br />

With TBB -<br />

4 Core<br />

System<br />

1 25.031250 26.700000 25.031250 26.800000 25.125000<br />

2 25.865625 12.515625 13.350000 13.400000 13.400000<br />

3 26.143750 17.800000 8.343750 17.308333 17.308333<br />

4 32.540625 12.932813 6.675000 12.981250 6.700000<br />

5 26.032500 20.692500 15.686250 21.105000 10.385000<br />

6 26.143750 17.521875 13.071875 13.120833 13.120833<br />

7 29.799107 14.780357 7.390179 18.664286 7.417857<br />

8 25.865625 16.270312 9.803906 12.981250 9.840625<br />

9 28.925000 17.429167 8.529167 20.286111 8.561111<br />

10 26.032500 13.016250 10.513125 13.065000 7.872500<br />

11 28.520455 16.535795 9.405682 16.597727 9.593182<br />

12 28.090625 12.932812 6.535938 15.354167 8.654167<br />

13 28.112019 16.045673 10.012500 15.976923 7.988462<br />

14 25.984821 14.899554 9.297321 14.955357 9.332143<br />

15 27.812500 13.906250 8.677500 15.745000 7.035000<br />

16 27.638672 14.601563 6.466406 14.656250 8.165625<br />

17 27.583456 15.313235 9.227206 15.370588 9.163235<br />

18 28.925000 14.462500 8.714583 14.516667 8.747222<br />

19 27.402632 15.106579 6.850658 15.163158 8.286842<br />

20 27.284062 13.016250 7.759688 14.321250 6.532500<br />

21 27.335714 14.939286 8.661607 14.915476 8.694048<br />

22 28.368750 12.970739 8.343750 14.313636 8.298864<br />

23 27.135326 15.889402 7.908424 14.710870 7.938043<br />

24 27.117188 14.045313 7.578906 13.120833 6.560417<br />

25 28.168500 14.618250 8.343750 15.678000 8.375000<br />

26 27.983654 13.991827 8.022837 14.044231 7.988462<br />

27 27.009028 15.451389 7.663889 14.516667 7.754630<br />

28 27.891964 13.945982 7.449777 13.998214 7.477679<br />

29 27.850862 14.328233 8.113578 14.439655 8.086207<br />

30 27.756875 14.796250 7.787500 14.795833 7.872500<br />

31 26.861492 14.265121 7.536290 14.318548 6.754032<br />

32 26.856445 14.653711 7.352930 14.708594 7.328125<br />

41


Num. of<br />

Blocks<br />

Without<br />

TBB - 1<br />

Thread - 4<br />

Core System<br />

Without<br />

TBB - 2<br />

Threads - 4<br />

Core System<br />

Without<br />

TBB - 4<br />

Threads - 4<br />

Core System<br />

With TBB -<br />

2 Core<br />

System<br />

With TBB -<br />

4 Core<br />

System<br />

33 28.368750 14.159091 7.888636 14.262879 7.918182<br />

34 27.583456 13.791728 7.656618 13.794118 7.685294<br />

35 27.510536 14.875714 7.437857 14.931429 8.231429<br />

36 27.488021 14.462500 7.231250 14.516667 7.258333<br />

37 27.421622 14.071622 7.712331 14.124324 7.741216<br />

38 28.105263 14.403947 8.255921 14.457895 7.581579<br />

39 27.341827 13.991827 7.316827 14.731410 7.344231<br />

40 27.367500 14.351250 7.175625 13.735000 7.202500<br />

41 27.921037 14.571037 7.611128 14.666463 7.639634<br />

42 27.891964 13.667857 7.429911 14.277381 7.457738<br />

43 27.243314 14.514244 7.257122 14.607558 7.907558<br />

44 27.799858 14.184375 7.699006 13.666477 7.118750<br />

45 27.775417 14.462500 7.527917 14.516667 7.556111<br />

46 27.171603 14.148098 7.364266 14.201087 6.809239<br />

47 27.694149 14.415160 7.740160 14.433511 6.664362<br />

48 27.638672 14.114844 7.057422 14.167708 7.642187<br />

49 27.653571 14.337628 7.969133 19.723980 7.452041<br />

50 27.567750 14.050875 7.275750 14.639500 7.303000<br />

51 28.074265 14.298897 7.656618 14.352451 7.192647<br />

52 27.534375 14.023918 7.028005 14.559615 7.022115<br />

53 27.518632 14.231604 7.839976 14.284906 7.900943<br />

54 27.967014 14.462500 7.725694 14.051389 7.258333<br />

55 27.458523 14.199545 7.099773 14.709545 7.126364<br />

56 27.415179 13.945982 6.972991 13.998214 7.447768<br />

57 27.871053 14.169737 7.758224 14.222807 7.346491<br />

58 27.361746 13.896659 7.192888 15.768103 7.653017<br />

59 27.803072 14.566208 7.495233 14.620763 7.523305<br />

60 27.756875 13.878437 6.953125 13.930417 6.979167<br />

61 27.329201 14.088627 7.249488 20.978689 7.715984<br />

62 27.722782 14.265121 7.563206 23.612097 7.159274<br />

63 27.680060 14.065179 7.019345 20.312698 7.471032<br />

64 27.664746 13.819336 6.909668 17.168750 7.354297<br />

65 27.624231 14.428269 7.599231 14.868846 7.215385<br />

42


Num. of<br />

Blocks<br />

Without<br />

TBB - 1<br />

Thread - 4<br />

Core System<br />

Without<br />

TBB - 2<br />

Threads - 4<br />

Core System<br />

Without<br />

TBB - 4<br />

Threads - 4<br />

Core System<br />

With TBB -<br />

2 Core<br />

System<br />

With TBB -<br />

4 Core<br />

System<br />

66 27.610227 13.805114 7.104830 14.237500 7.537500<br />

67 27.596642 14.371175 7.397295 14.450000 7.025000<br />

68 27.927022 13.791728 6.871324 14.212868 7.291176<br />

69 27.546467 14.317391 7.545652 14.395290 7.185507<br />

70 27.534375 14.136696 7.080268 14.165714 7.465714<br />

71 27.851673 13.937588 7.333099 14.367254 7.360563<br />

72 27.488021 14.091667 7.231250 14.144444 7.258333<br />

73 27.820120 14.264384 7.132192 13.973630 7.158904<br />

74 27.421622 13.733361 7.374071 14.463851 7.424324<br />

75 27.768000 14.217750 7.298000 14.293333 7.325333<br />

76 27.753947 14.052632 6.850658 14.435855 6.876316<br />

77 27.718588 14.195211 7.433523 19.012338 7.461364<br />

78 27.363221 14.013221 7.338221 14.387821 7.365705<br />

79 27.692801 14.173813 6.928481 15.223418 7.272468<br />

80 27.659531 13.996641 7.154766 14.363125 7.181563<br />

81 27.627083 14.132870 7.396065 14.516667 7.113580<br />

82 27.941387 13.980869 7.305869 14.033232 7.312805<br />

83 27.604744 14.114006 7.217846 14.489759 7.244880<br />

84 27.574107 13.945982 7.112054 13.998214 7.158631<br />

85 27.563824 14.076397 7.362132 16.277059 7.370000<br />

86 27.553779 13.932122 7.257122 14.295930 6.992151<br />

87 27.831681 14.059698 7.192888 14.709195 7.508621<br />

88 27.515412 13.918892 6.788778 14.256534 8.013352<br />

89 27.787500 14.325000 7.312500 14.096348 7.641011<br />

90 27.478750 13.887708 7.527917 14.218889 7.556111<br />

91 27.745261 14.303571 7.151786 14.357143 7.178571<br />

92 27.733899 13.857880 6.783832 14.201087 7.100543<br />

93 27.704839 13.995968 7.572177 14.336559 7.294355<br />

94 27.427859 14.131117 7.189827 14.166223 7.234574<br />

95 27.666118 13.964803 7.131711 14.299211 7.422895<br />

96 27.656055 14.114844 6.779297 14.167708 7.083854<br />

97 27.646198 13.952126 7.242719 14.263402 7.269845<br />

98 27.636543 14.082207 7.441263 14.134949 7.195663<br />

43


Num. of<br />

Blocks<br />

Without<br />

TBB - 1<br />

Thread - 4<br />

Core System<br />

Without<br />

TBB - 2<br />

Threads - 4<br />

Core System<br />

Without<br />

TBB - 4<br />

Threads - 4<br />

Core System<br />

With TBB -<br />

2 Core<br />

System<br />

With TBB -<br />

4 Core<br />

System<br />

99 27.610227 13.923106 7.096402 14.262879 7.122980<br />

100 27.851438 14.067562 7.025438 14.371500 7.051750<br />

101 27.575681 14.176114 7.220235 13.963861 7.512624<br />

102 27.567096 13.775368 7.149449 14.352451 7.176225<br />

103 27.801699 14.160073 7.080036 14.196845 7.090291<br />

104 27.534375 14.007873 6.995913 17.345913 7.038221<br />

105 27.526429 13.890357 7.199464 16.909524 7.226429<br />

106 27.754776 13.995460 7.115802 14.300708 7.395283<br />

107 27.495386 14.114194 7.064895 14.167056 7.326168<br />

108 27.719792 13.983507 6.999479 14.035880 7.010185<br />

109 27.695126 14.084862 7.164908 14.383486 7.191743<br />

110 27.701250 13.971989 7.099773 14.009091 7.126364<br />

111 27.662162 14.071622 7.035811 14.365766 7.062162<br />

112 27.668471 13.945982 6.972991 13.998214 6.999107<br />

113 27.645133 14.044082 7.369082 14.333850 7.159513<br />

114 27.622204 13.935526 7.070230 14.208114 7.331798<br />

115 27.628696 14.032011 7.023261 14.317609 7.049565<br />

116 27.591918 14.141218 6.962716 13.963147 7.205388<br />

117 27.598558 14.020353 7.331090 14.301923 7.143803<br />

118 27.576801 13.887394 7.070975 14.166525 7.310381<br />

119 27.569433 14.009086 6.997532 14.272689 8.783193<br />

120 27.770781 13.878437 7.161719 13.944375 7.844583<br />

121 27.541271 14.205062 7.102531 14.244421 8.859504<br />

122 27.739549 13.869775 7.030635 14.141393 7.702254<br />

123 27.717530 13.974085 6.987043 14.230691 9.137602<br />

124 27.507460 14.063256 6.930696 14.115927 7.159274<br />

125 27.701250 13.950750 7.289100 14.217400 7.316400<br />

126 27.680060 14.051935 7.019345 14.104563 7.471032<br />

127 27.672343 13.941289 7.174311 14.402362 7.201181<br />

128 27.664746 13.819336 6.909668 14.080469 7.144922<br />

129 27.644331 14.126163 7.063081 16.412403 7.089535<br />

130 27.637067 14.017500 7.214135 14.070000 7.241154<br />

131 27.617176 13.910496 6.955248 14.154389 7.173092<br />

44


Num. of<br />

Blocks<br />

Without<br />

TBB - 1<br />

Thread - 4<br />

Core System<br />

Without<br />

TBB - 2<br />

Threads - 4<br />

Core System<br />

Without<br />

TBB - 4<br />

Threads - 4<br />

Core System<br />

With TBB -<br />

2 Core<br />

System<br />

With TBB -<br />

4 Core<br />

System<br />

132 27.610227 14.007386 7.104830 14.262879 7.131439<br />

133 27.603383 14.090273 7.038863 14.331955 7.468233<br />

134 27.783442 13.798321 7.185588 14.050000 7.400000<br />

135 27.565278 14.079306 6.946944 14.119630 6.972963<br />

136 27.571186 13.963511 7.079917 14.225184 7.303493<br />

137 27.747536 14.068659 7.223130 15.062774 7.054562<br />

138 27.534375 13.954620 7.170788 21.775000 7.197645<br />

139 27.720459 14.046313 6.927113 15.978777 7.145863<br />

140 27.701250 13.945982 7.068348 14.369107 7.082857<br />

141 27.516622 14.036436 7.195745 14.279078 7.234574<br />

142 27.675396 13.925836 7.156822 14.166725 7.171831<br />

143 27.680245 14.026836 6.908392 14.255070 7.859615<br />

144 27.650260 13.917839 7.057422 14.330556 7.258333<br />

145 27.655216 14.005991 7.181379 14.416552 7.208276<br />

146 27.637243 13.898630 7.132192 14.145719 7.342466<br />

147 27.619515 13.997066 7.083673 14.220408 7.281122<br />

148 27.624578 14.071622 7.035811 14.294088 7.243243<br />

149 27.595973 13.977181 7.156586 14.209396 7.183389<br />

150 27.779125 13.884000 7.120000 14.103500 8.196333<br />

151 27.584106 14.134644 7.072848 14.198675 7.088245<br />

152 27.567311 13.876974 7.015337 14.094243 7.217928<br />

153 27.736152 13.949877 7.143995 14.177288 7.181699<br />

154 27.556047 14.032670 7.108442 14.248377 7.124188<br />

155 27.712016 13.942137 7.051815 14.156452 7.251129<br />

156 27.534375 14.013221 7.006611 14.237500 7.698558<br />

157 27.688495 13.923965 7.132046 14.146815 7.158758<br />

158 27.682239 14.004826 7.086907 14.057278 7.113449<br />

159 27.665566 14.084670 7.031840 14.305975 7.226730<br />

160 27.659531 13.996641 6.998320 14.038594 7.024531<br />

161 27.653571 13.899340 7.120691 14.284317 7.136957<br />

162 27.637384 13.978356 7.231250 14.030710 7.423765<br />

163 27.631633 14.056403 7.023083 14.273466 7.213804<br />

164 27.778582 13.807889 6.980259 14.012805 7.169817<br />

45


Num. of<br />

Blocks<br />

Without<br />

TBB - 1<br />

Thread - 4<br />

Core System<br />

Without<br />

TBB - 2<br />

Threads - 4<br />

Core System<br />

Without<br />

TBB - 4<br />

Threads - 4<br />

Core System<br />

With TBB -<br />

2 Core<br />

System<br />

With TBB -<br />

4 Core<br />

System<br />

165 27.610227 14.047841 7.099773 14.252727 7.126364<br />

166 27.755535 13.953163 7.057003 14.015512 7.244880<br />

167 27.589334 14.029491 7.014746 15.335778 7.191467<br />

168 27.742969 13.945982 6.972991 17.418006 7.308185<br />

169 27.568935 14.021450 7.089719 17.314941 7.274852<br />

170 27.720882 13.929154 7.048015 15.370588 7.833088<br />

171 27.549013 14.013596 6.997039 14.213012 7.033041<br />

172 27.699310 13.922420 6.966061 14.130378 7.138227<br />

173 27.693533 13.996279 7.070484 14.048699 7.251879<br />

174 27.678233 13.915841 7.029849 14.112356 7.056178<br />

175 27.663107 14.131929 6.989679 15.974714 7.915571<br />

176 27.667116 13.899929 6.949964 14.104261 7.128267<br />

177 27.793644 13.972246 7.061547 14.175989 7.087994<br />

178 27.637500 13.893750 7.171875 18.933146 7.189326<br />

179 27.632263 14.105133 6.973324 18.537291 7.879050<br />

180 27.627083 13.887708 6.943854 14.088611 7.407222<br />

181 27.760256 13.949275 7.191298 14.140331 7.218232<br />

182 27.607727 14.019334 7.014251 14.219093 7.325824<br />

183 27.739549 13.942725 7.112705 14.132240 7.139344<br />

184 27.588791 14.002989 6.928940 14.201087 6.954891<br />

185 27.719291 13.927297 7.180135 14.124324 8.049054<br />

186 27.570262 13.995968 6.997984 14.192473 7.024194<br />

187 27.708389 14.063904 7.094418 14.250936 7.129947<br />

188 27.694149 13.847074 7.065559 14.041489 7.225665<br />

189 27.680060 14.047520 7.019345 14.241931 7.329233<br />

190 27.674901 13.973586 7.131711 14.158158 7.563947<br />

191 27.538743 13.909162 6.945844 14.233115 7.936518<br />

192 27.795117 13.958398 7.048730 14.019401 7.075130<br />

193 27.651101 14.033063 7.020855 14.215803 7.177332<br />

194 27.637597 13.952126 7.105090 14.142526 7.131701<br />

195 27.641346 13.889135 7.077212 14.198846 7.103718<br />

196 27.619515 13.945982 6.904879 14.134949 7.067474<br />

197 27.623319 14.002253 7.140895 14.190736 7.159137<br />

46


Num. of<br />

Blocks<br />

Without<br />

TBB - 1<br />

Thread - 4<br />

Core System<br />

Without<br />

TBB - 2<br />

Threads - 4<br />

Core System<br />

Without<br />

TBB - 4<br />

Threads - 4<br />

Core System<br />

With TBB -<br />

2 Core<br />

System<br />

With TBB -<br />

4 Core<br />

System<br />

198 27.736648 13.939962 6.969981 14.119066 7.258333<br />

199 27.605653 13.995697 7.060741 14.182789 7.095603<br />

200 27.592781 13.925719 7.025438 14.111875 7.051750<br />

201 27.588340 13.989272 6.998787 14.166667 7.150000<br />

202 27.707859 14.052197 7.088057 14.104827 7.371658<br />

203 27.694674 13.974754 7.053140 14.282882 7.079557<br />

204 27.567096 13.906250 7.018566 14.089706 7.044853<br />

205 27.684970 14.098902 6.984329 14.151707 7.141220<br />

206 27.672087 13.900850 7.071936 14.204976 7.098422<br />

207 27.667391 13.962681 7.045833 14.136353 7.072222<br />

208 27.662740 14.015895 6.883594 14.197236 7.038221<br />

209 27.650150 13.948834 7.098176 14.129306 7.244976<br />

210 27.645625 14.009554 7.072321 14.181667 7.218452<br />

211 27.759775 13.943158 7.030895 14.122393 7.930450<br />

212 27.628833 13.995460 6.997730 14.174292 7.268868<br />

213 27.624472 13.937588 7.090229 15.578286 7.360563<br />

214 27.729322 13.989428 7.057097 14.284463 7.083528<br />

215 27.608110 14.040785 7.016512 14.101163 7.050581<br />

216 27.603906 13.859896 6.991753 14.152199 7.134259<br />

217 27.707402 14.034418 7.082575 14.210484 7.101382<br />

218 27.702781 13.977695 7.042431 14.145298 7.191743<br />

219 27.583904 14.020548 7.010274 14.195434 7.044178<br />

220 27.686080 13.964403 6.985994 14.130909 7.126364<br />

221 27.681618 14.022031 7.067647 14.188235 7.094118<br />

222 27.677196 13.951351 7.035811 14.124324 8.005293<br />

223 27.665331 14.008520 7.116508 14.181166 7.030493<br />

224 27.661021 13.945982 6.972991 14.117857 7.111272<br />

225 27.649333 14.002667 7.409250 14.166778 7.198778<br />

226 27.645133 14.051466 7.022041 14.104093 7.055752<br />

227 27.751239 13.989565 7.116079 14.160022 7.135352<br />

228 27.636842 13.928207 6.960444 14.097917 7.339145<br />

229 27.618177 13.983979 7.163237 14.263100 7.073035<br />

230 27.737527 14.039266 7.016005 14.084565 7.158804<br />

47


Num. of<br />

Blocks<br />

Without<br />

TBB - 1<br />

Thread - 4<br />

Core System<br />

Without<br />

TBB - 2<br />

Threads - 4<br />

Core System<br />

Without<br />

TBB - 4<br />

Threads - 4<br />

Core System<br />

With TBB -<br />

2 Core<br />

System<br />

With TBB -<br />

4 Core<br />

System<br />

231 27.610227 13.971266 7.101218 14.255628 7.120563<br />

232 27.714197 14.026131 6.955523 14.078664 7.097091<br />

233 27.709844 13.965933 7.154855 14.241094 7.181652<br />

234 27.698397 14.020353 7.117147 14.072863 7.143803<br />

235 27.580532 14.067207 6.980346 14.226809 7.227447<br />

236 27.689936 13.894465 7.056833 14.173623 7.196822<br />

237 27.678718 14.061155 7.034098 14.226899 7.053376<br />

238 27.674606 14.002075 7.109716 14.047479 7.136345<br />

239 27.670528 13.943488 7.079969 14.212971 7.113494<br />

240 27.659531 13.989688 7.050469 14.160729 7.076875<br />

241 27.648626 14.042427 7.125078 14.199274 7.151763<br />

242 27.755036 13.984401 7.102531 14.036777 7.129132<br />

243 27.640818 14.036728 7.073302 14.192695 7.092901<br />

244 27.630123 13.869775 7.037474 14.031557 7.180533<br />

245 27.626327 14.024311 7.117730 14.398163 7.144388<br />

246 27.731098 13.967302 6.987043 16.246138 7.224289<br />

247 27.612070 14.018851 7.060096 18.730162 7.086538<br />

248 27.709325 13.854662 7.031628 14.115927 7.057964<br />

249 27.604744 14.114006 7.110617 14.059237 7.238153<br />

250 27.694575 13.950750 7.082175 14.217400 7.316400<br />

251 27.690613 14.001544 7.153685 14.053984 7.080378<br />

252 27.587351 13.945982 7.025967 14.097917 7.052282<br />

253 27.676186 13.996393 7.097134 14.148123 7.123715<br />

254 27.777461 13.941289 7.174311 14.197933 7.306693<br />

255 27.661985 14.082941 7.048015 14.142255 7.172941<br />

256 27.658228 14.034448 7.320337 14.191699 7.144922<br />

48


Table 2 : CPB Comparison at Different Block Lengths (Experiment B)<br />

Num. of<br />

Block Length = 16 Block Length = 24 Block Length = 32<br />

Blocks<br />

With TBB<br />

Without<br />

TBB<br />

With TBB<br />

Without<br />

TBB<br />

With TBB<br />

Without<br />

TBB<br />

1 78.1375 51.5375 172.9000 172.9000 194.5125 194.5125<br />

2 39.0688 25.7688 346.3542 251.0375 370.3219 493.7625<br />

3 26.0458 26.0458 496.5333 282.9944 554.1667 640.6167<br />

4 19.5344 19.5344 593.2354 337.4875 714.2516 824.8078<br />

5 30.9225 30.9225 512.4933 270.2117 610.4700 563.5875<br />

6 26.0458 30.4792 513.7125 285.7653 614.8479 653.7781<br />

7 25.8875 22.0875 554.1667 314.1333 649.4438 749.6688<br />

8 19.5344 22.8594 597.5302 335.5479 714.3555 823.1453<br />

9 26.0458 25.8611 548.3787 275.2361 656.5951 650.8688<br />

10 23.2750 23.4413 550.7308 303.0183 662.4231 718.2831<br />

11 21.3102 18.8920 571.4970 325.8500 677.7710 774.5739<br />

12 21.6125 19.5344 588.8021 336.2868 706.8396 822.5911<br />

13 24.0423 25.9606 562.0955 299.6763 672.3534 700.3601<br />

14 22.2063 22.2063 570.3167 311.7583 667.9688 749.6094<br />

15 20.8367 20.8367 574.9294 326.7367 687.4992 787.9696<br />

16 21.0930 17.8719 589.8411 338.8036 716.7973 823.0934<br />

17 21.4169 22.8838 568.4446 308.6382 688.3728 731.1577<br />

18 20.2271 20.2271 581.1361 319.4463 676.1295 769.2295<br />

19 20.4750 19.1625 576.9750 332.6750 693.1750 797.7375<br />

20 19.5344 18.2044 592.2379 337.7092 713.0463 818.9059<br />

21 23.5125 21.0583 572.3222 311.7583 694.5688 748.3625<br />

22 21.2347 20.0256 580.1117 321.9205 683.0608 772.8358<br />

23 21.4679 19.2272 580.5257 329.0304 694.5997 800.2046<br />

24 19.4651 19.4651 593.1431 334.0701 715.4292 814.4518<br />

25 21.8120 20.8145 576.3333 316.5843 694.0938 761.6578<br />

26 20.9731 18.9269 580.8093 322.3971 688.3709 787.2897<br />

27 20.1963 18.2875 582.4086 329.7086 697.5419 805.7583<br />

28 19.5344 18.5250 590.6229 337.6854 714.3406 818.2766<br />

29 20.5806 19.7207 578.0532 317.6713 697.7914 764.5207<br />

30 20.7813 19.0633 585.9389 327.3278 692.7083 779.7125<br />

31 20.1109 19.3065 583.7699 332.3927 696.3462 746.6234<br />

32 19.4824 18.6512 592.5773 335.5133 713.5242 801.6107<br />

49


Num. of<br />

Block Length = 16 Block Length = 24 Block Length = 32<br />

Blocks<br />

With TBB<br />

Without<br />

TBB<br />

With TBB<br />

Without<br />

TBB<br />

With TBB<br />

Without<br />

TBB<br />

33 21.2598 19.6981 580.4308 323.2639 698.2248 738.7545<br />

34 20.6346 20.5857 584.7110 328.0341 696.0252 781.5950<br />

35 20.0450 18.5725 584.3450 331.4867 702.4775 802.2988<br />

36 19.4420 18.7493 592.1887 334.3472 715.4292 795.5293<br />

37 20.3993 19.6804 581.3059 322.9444 698.2051 757.8978<br />

38 20.4750 19.1188 566.9125 329.0292 691.4469 786.1219<br />

39 19.9926 19.3106 557.7190 327.6972 701.7029 781.9718<br />

40 18.8278 18.2044 553.3077 333.3867 692.2858 795.2153<br />

41 20.2744 19.6256 554.5992 320.1732 705.4880 772.0082<br />

42 19.7917 18.5646 554.1667 327.8028 702.6042 792.8938<br />

43 19.9113 18.7128 568.2657 331.4432 704.9773 804.3794<br />

44 20.1011 18.2875 587.2152 338.0920 715.2528 775.1595<br />

45 20.2086 19.0633 583.7961 334.0270 704.2535 774.6881<br />

46 19.7332 19.1910 569.5870 329.4159 701.6473 797.6386<br />

47 21.0112 18.8181 561.5477 327.9252 703.5735 806.1003<br />

48 18.9456 18.3914 567.1434 335.8943 713.5415 808.2521<br />

49 20.1196 19.0679 560.8845 327.6369 683.6098 773.7411<br />

50 19.7505 18.7198 559.3758 328.3327 700.0621 787.6094<br />

51 19.3632 18.8417 565.7064 334.1299 705.6987 749.9994<br />

52 18.9909 18.4793 580.4683 337.0399 712.1095 770.5528<br />

53 19.6050 19.1031 578.6755 327.4184 706.0292 758.2255<br />

54 19.2419 18.7801 579.8225 329.6676 703.5300 792.7662<br />

55 19.8593 17.9550 588.1824 332.1776 705.3836 781.6622<br />

56 19.0000 18.0797 594.6802 335.8448 713.1977 801.7852<br />

57 20.0667 19.1333 588.4861 329.0389 705.0167 785.2250<br />

58 19.6920 18.8321 590.5888 331.4299 704.2837 788.2543<br />

59 19.3864 18.0339 589.1073 333.4393 705.5481 794.4918<br />

60 19.4790 17.7610 593.9928 334.2364 714.5702 801.1588<br />

61 20.0045 19.1596 589.3790 327.0492 705.8403 779.5081<br />

62 19.7087 18.8506 588.5071 331.2665 705.3558 790.1970<br />

63 19.7917 18.5514 575.8759 335.6315 709.3993 761.9792<br />

64 19.4824 18.2615 595.0365 336.3445 713.5502 787.2067<br />

50


Table 3 : CPB Comparison at Different Chunk Sizes (Experiment C)<br />

Chunk<br />

Size<br />

(Blocks)<br />

Cycles<br />

per<br />

Byte<br />

(CPB)<br />

Avg<br />

Chunk<br />

Size<br />

(Blocks)<br />

Cycles<br />

per<br />

Byte<br />

(CPB)<br />

Avg<br />

1 CPU Core<br />

1 2 4 8 16 32 64 128 256<br />

76.0983 75.6957 75.5009 75.5009 75.6957 75.6957 74.4748 73.8773 73.0591<br />

74.4878 74.4748 74.28 74.28 74.4878 74.6826 73.2669 72.4486 72.4486<br />

74.28 74.8904 74.28 74.28 74.267 74.4878 73.0591 72.2408 72.4486<br />

74.4748 74.4748 74.4748 74.267 74.28 74.6826 73.0591 72.4616 72.4486<br />

74.28 74.28 74.28 74.28 74.28 74.4748 73.0591 72.4486 72.4551<br />

74.6826 74.4748 74.28 74.28 74.4748 74.4748 72.8513 72.4486 72.4486<br />

74.4748 74.28 74.267 74.0722 74.28 74.4748 72.8513 72.4486 72.4486<br />

74.28 74.28 74.28 74.267 74.28 74.6826 73.0591 72.4486 72.4486<br />

74.28 74.4748 74.28 74.28 74.267 74.4878 73.0591 72.4486 72.4551<br />

74.28 74.28 74.28 74.28 74.28 74.6826 73.0591 72.2538 72.3447<br />

74.5618 74.5605 74.4203 74.3787 74.4592 74.6826 73.1799 72.5525 72.5006<br />

4 CPU Cores<br />

1 2 4 8 16 32 64 128 256<br />

29.6263 29.0288 28.4054 27.4053 25.3661 24.353 21.5151<br />

29.6263 29.0158 28.2105 27.6001 25.5739 24.5608 21.5086<br />

29.4314 29.0288 28.4054 27.3923 25.3661 24.5479 21.5151<br />

29.6263 28.808 28.4184 27.3923 25.5609 24.5608 21.5086<br />

29.6393 28.821 28.2105 27.6001 25.5739 24.7557 21.4112 NA NA<br />

29.4185 29.0158 28.4054 27.3923 25.5739 24.7687 21.5086<br />

29.6263 28.821 28.2105 27.3923 25.3661 24.1452 21.4112<br />

29.6393 29.0158 28.2105 27.4053 25.5739 24.5608 21.5151<br />

29.4185 28.821 28.2105 27.4053 25.5739 24.5479 21.5086<br />

29.5613 28.9307 28.2986 27.4428 25.5032 24.5334 21.4891<br />

51


REFERENCES<br />

[1] Definition of Encryption - www.wikipedia.org<br />

[2] Nigel Smart, “Cryptography: An Introduction”.<br />

[3] The <strong>OCB</strong> Authenticated-Encryption Algorithm .<br />

http://www.cs.ucdavis.edu/~rogaway/papers/draft-krovetz-ocb-00.txt<br />

52

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!