Hadoop Development - CSC

©2011 CSC 

INNOVATE 

AND DELIVER 

AN INTRODUCTION TO HADOOP 

AND MAP-REDUCE 

Les Klein 

Solution Architect, GBS EMEA 

November 10, 2011

Agenda 

• The Data Challenge 

• What Is Hadoop? 

• Hadoop Architectural Overview 

• What is Map-Reduce? 

• Real-World Example - A Cyber Problem 

• Extras: 

• How to get started 

• Resources 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 2

The Data Challenge 

•3 dimensions: 

–Volume 

–Variety 

–Velocity 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 3

Volume 

• In 2006, 161 exabytes of digital information were created, representing roughly 3 

million times the information in all the books ever written! 

– an exabyte is 1,000 petabytes, a petabyte is 1,000 terabytes, a terabyte is 1,000 gigabytes 

• In March 2007, researcher IDC released a report, sponsored by EMC 1 , forecasting that 

as much as 988 exabytes of digital information will be created in 2010, a six-fold 

increase from 2006. 

• From 2007 until 2010, IDC said it expected information will sport a compound annual 

growth rate of 57 percent to hit the 988 exabyte mark. IDC now forecasts the total 

volume of data stored electronically in 2011 will be 1.8 zettabytes 2 

– a zettabyte is 1000 exabytes 

– As an example, the Large Hadron Collider, at CERN in Switzerland, is expected to generate ~15 

petabytes of data per year 

1 The Expanding Digital Universe, A Forecast of Worldwide Information Growth Through 2010, John F Gantz et al, IDC, March 2007 

2 The Diverse and Exploding Digital Universe, An Updated Forecast of Worldwide Information Growth Through 2011, John F Gantz et al, IDC, March 

2008 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 4

Volume (cont’d) 

• IDC now predicts that the digital universe will be 44 times bigger in 2020 

than it was in 2009, totalling a staggering 35 zettabytes. 3 

• EMC reports that the number of customers storing a petabyte or more of 

data will grow from 1,000 (reached in 2010) to 100,000 before the end of 

the decade. 4 

– By 2012 it expects that some customers will be storing exabytes (1,000 petabytes) of 

information. 5 

• In 2010 Gartner reported that enterprise data growth will be 650 percent 

over the next five years, and that 80 percent of that will be unstructured. 6 

3 “The Digital Universe Decade – Are You Ready?” IDC, sponsored by EMC Corporation, May 2010. See tab “The Digital Universe Decade,” 

http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm 

4 “EMC’s Record Breaking Product Launch,” Chuck Hollis blog, 14 January 2011, http://chucksblog.emc.com/chucks_blog/2011/01/emcs-record-breaking-product-launch.html 

5 Ibid. 

6 “Technology Trends You Can’t Afford to Ignore,” Gartner Webinar, January 2010, slide 8, 

http://www.gartner.com/it/content/1258400/1258425/january_6_techtrends_rpaquet.pdf 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 5

Volume (cont’d) 

• If you have a few 100 terabytes of data then products like Teradata, 

Netezza, Oracle Database Machine, etc., can help you - $$$$$$$ 

– Note these usually also require “structured” data 

• If you have many (even tens) of petabytes of data that need to be stored 

and analyzed 

– Products like those are cost prohibitive for most of us (assuming that the product can 

scale that far) 

• Complexity of analytics is also now becoming a problem 

– Difficult to express within the constraints of the tools (e.g. SQL) 

– Time taken to get results is unacceptable 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 6

Variety 

• Continuing increase in the types of data needing to be stored: 

– Video, voice/sound, images, RFID tags, SMS messages, “chat”, etc. 

• Many of these are not easy to store/process in relational databases for 

analysis 

• Many sources of such data are “unstructured” and/or not easy to structure 

– Often need to know at design time what kind of “questions” will need to be asked 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 7

Velocity 

• The rate at which volume and variety are increasing are themselves 

increasing! 

• “real time” analysis needs 

– Increasingly becoming necessary to process new data as it streams in rather than 

over-night 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 8

What Is Hadoop? 

• Google famously encountered these problems when it wanted to index the 

World Wide Web. Their solution: 

– Google File System (GFS) for storage 

– Google Map-Reduce to be able to rapidly process, in a highly parallel way, data 

stored in GFS 

• Google‟s solution is proprietary and their success created demand for 

similar capabilities that other companies could use, including their 

competitors, most notably, Yahoo! 

• The result of that demand is an Apache Open Source project called 

Hadoop that provides: 

– HDFS (Hadoop Distributed File System), equivalent in capability to GFS 

– MapReduce to process the data stored in HDFS, equivalent in capability to Google 

Map-Reduce 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 9

What Is Hadoop? (cont’d) 

• Hadoop itself is “free” 

• There is already a large “ecosystem” around Hadoop 

– Many projects and products to make it easier to use Hadoop (open source and 

commercial) 

– Commercial support is available in a “RedHat” style model from Cloudera, 

Hortonworks and some others 

– Commercial support is also available from Greenplum (EMC) and IBM (BigInsights) 

who both have Hadoop based “products” in their portfolio, as do some others (e.g. 

Platform Computing) 

• Hadoop is widely used in industry today 

– Yahoo!, Facebook, LinkedIn, eBay, Quantcast, and many others 

• It is potentially “disruptive” technology 

– One CSC customer had an existing OLAP application that it re-implemented with 

Hadoop and got a 3x performance improvement for 10% of the infrastructure 

cost! 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 10

What Is Hadoop? (cont’d) 

• Hadoop can scale 

– Yahoo! has many clusters, the largest is 4000 nodes providing 16PB of HDFS (4 x 1TB HDDs/server) 

– Facebook has a 2000 node cluster providing 21PB of HDFS (12 x 1TB HDDs/server) 

• July 2011 Facebook announced a 30PB Hadoop cluster in a new “bleeding edge” data centre 

• What Hadoop/HDFS is not 

– A Database - it does not require structured data 

– A POSIX file system 

– Real-time – batch only 

• Hadoop is map-reduce only 

– Not all problems necessarily lend themselves to a this type of solution 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 11

Hadoop Architectural Overview 

Name Node 

Job Tracker 

TBSC 2009 

Secondary 

Name Node 

Data Nodes 

•Data Nodes are commodity servers 

- Provide both storage and processing 

•Name Node is a SPoF 

- HA/Resilient server is a good choice 

•Secondary Name Node is a “warm” 

standby 

- Fail-over requires some downtime 

- Ideally the same server choice as the 

Name 

Node 

•Job Tracker 

- Loss does not incur data loss, but in flight 

jobs 

are lost 

11/10/2011 12:53 PM 0725-23_TBSC 2009 12

Hadoop Architectural Overview (cont’d) 

• Hardware Failure is to be expected! 

– The Facebook 2000 node cluster has 24,000 SATA HDDs 

• A 3% failure rate per annum => 720 HDD fail per year = ~2 per day !! 

– Cheapest commodity servers rather than “enterprise” class devices 

• Limited/No redundancy 

• HDFS can accommodate disk/node loss without data loss 

– Each data block is replicated 3 times (by default) and HDFS can be made rack-aware 

• 1 st replica on different server in same rack 

• 2 nd replica on server in a different rack 

– If a disk or a Data Node is lost then HDFS automatically creates new replicas in background for all the 

lost data blocks (disk space permitting of course) 

• MapReduce will pre-emptively start the same processing tasks on different copies of 

data blocks when it detects that some nodes appear to be running slowly (or have 

died). 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 13


• HDFS is designed for Write Once/Read Many operations 

• HDFS block sizes are big 

– 64MB, 128MB and 256MB are common 

– To maximise disk read throughput 

• Hadoop runs one Map task for each HDFS block in the data to be processed and 

takes approx one minute to start a map task so execution needs to take at least 

one minute 

• Increase block size to increase task execution time 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 14


• Hadoop scales linearly 

– Suppose 100TB of data to process. With one server and a read rate of 100MB/s, it 

would take: 

• (100 x 10^12) / (100 x 10^6) = 10^6 = 1,000,000 seconds to read the files. 

– With 100 servers each with a 100MB/s read rate and the equal distribution of files 

across all 100 servers, i.e. each server has 1TB to read, it will take: 

• (1 x 10^12) / (100 x 10^6) = 10^4 = 10,000 seconds 

• Since all 100 read the data in parallel, total read time is 10,000 seconds, i.e. 100 

times faster. 

– If have 1000 servers and equal distribution of the data across the cluster, then can 

read all 100TB in just 1,000 seconds, i.e. less than 20 minutes. 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 15


• Hadoop/HDFS is a paradigm shift for Parallel Processing (HPC) 

– moves the processing to the data as opposed to moving the data to the processing 

• Developers only have 2 pieces of code to write 

– The map algorithm 

– The reduce algorithm 

• The only other code needed is the Hadoop client code needed to get 

Hadoop to run the map-reduce job 

– mostly boiler-plate and can be auto-generated 

• Hadoop does the rest 

– Decides which nodes will run the code 

– Coordinates running all the mappers before it starts any reducers 

– Developers no longer need to worry about rendezvous, semaphores, deadlocks, etc.! 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 16

What is Map-Reduce? 

• Map-Reduce is a different way of 

thinking about a problem. 

• Suppose that you want to count the 

number of times each word is used in 

the complete works of Shakespeare, 

and that you have loaded the 

complete works of Shakespeare into 

HDFS. 

• The first thing to do is to use the map 

phase to output a key-value pair for 

each word that you find in the blocks 

of data that you read. So if you 

process 

“To be, or not to be, that is the 

question” 

you would get the result shown to the 

right: 

TBSC 2009 

• [, 1] 

• [, 1] 

• [, 1] 

• [, 1] 

• [, 1] 

• [, 1] 

• [, 1] 

• [, 1] 

• [, 1] 

• [, 1] 

11/10/2011 12:53 PM 0725-23_TBSC 2009 17

What is Map-Reduce? (cont’d) 

• Note that you could choose to optimise the code to aggregate locally before outputting 

the key values 

– e.g. to get [, 2] and [, 2] but that is not something that has to be done (although 

in a non-trivial case doing so would minimise I/O so may well help). 

• The “key” function can be anything you like that generates a unique key for each value 

that you will encounter 

– e.g. we could convert all the characters to upper case and use that string as the key ([“TO”, 1] for 

example). 

• The Hadoop MapReduce framework uses the key to decide which Data Node to send 

that data to for the reduce phase. 

– In this example, the Data Node chosen to get the key value pairs for the “key for to” will get 2 items to 

process as will the one for the “key for be” 

– all the others will only get one pair to process. 

• All the reduce code has to do is to aggregate all the input pairs into a single key value 

pair like [“TO”, 2] (if we are using upper case strings as the key). 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 18


Map code for word count example 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 19


Reduce code for word count example 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 20


• Unlikely you will have more nodes in your cluster than possible key values 

– Hadoop will send pairs for more than a single key value to a given Data Node, and 

runs a separate Reduce task for each key value on that Data Node 

• your reduce code does not need to worry about that 

• Similarly, if there are more data blocks than Data Nodes 

– Hadoop starts a separate Map task for each block to be read by a Data Node 

– Hadoop decides which blocks will be processed on which nodes 

• Usually it has 3 choices of Data Node for each block 

– your map code doesn‟t need to worry about any of this 

(Note that in general Hadoop will run multiple tasks concurrently on Data Nodes) 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 21

Real world example - a Cyber problem 

• We would like to be able to trace the use of any Linux system calls back 

to the user who was ultimately responsible for them. 

• Assume malicious users will try to hide their identity by spawning layers of 

child processes that make it difficult to track back to the original process 

that is their "terminal" (login) session. 

• What we would like to know is can Hadoop be used to solve this problem 

by doing the track back for any (or all) SYSCALL (system call) events? 

• As for the scale, consider a datacentre with: 

– servers generating audit events at the rate of ~20 per second, i.e. ~72,000 per hour 

– that has 1000 servers, i.e. ~72 million events per hour, or ~1.5 billion events a day (!) 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 22

Real world example - a Cyber problem (cont’d) 

• The problem can addressed by processing the audit logs produced by the 

auditd daemon that is supplied with Linux kernels 2.6.x (or newer). 

• In order to get auditd to log the data we are interested in it is necessary to 

set up rules in /etc/audit/audit.rules on each server as specified in the 

NSA document 

“Guide to the Secure Configuration of Red Hat Enterprise Linux 5, Revision 4, 

September 14, 2010” 

• The data to process looks something like this: 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 23


node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11 

success=yes exit=0 a0=9f20008 a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139 

auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 

comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null) 

node=192.168.1.3 type=EXECVE msg=audit(1294649964.027:33673): argc=7 a0="iconv" a1="-f" a2="utf-8" 

a3="-t" a4="utf-8" a5="-o" a6="/dev/null" 

node=192.168.1.3 type=CWD msg=audit(1294649964.027:33673): cwd="/usr/share/man/man0p" 

node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=0 name="/usr/bin/iconv" 

inode=18564672 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:bin_t:s0 

node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=1 name=(null) inode=9994257 

dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0 

node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500 

auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" : 

exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)' 

node=192.168.1.3 type=USER_ACCT msg=audit(1294649964.060:33675): user pid=9836 uid=500 

auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: accounting acct="root" : 

exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)„ 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 24


• The events we are interested in are the SYSCALL events such as: 

node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11 

success=yes exit=0 a0=9f20008 a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139 

auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 

comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null) 

• The process id making this system call is 10139 and this process has a parent process 

3696 

• What we need to do is find which process created 3696 and if that was not an event 

with a "terminal" entry, such as: 

node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500 

auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" : 

exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)' 

• Then we need to trace the ppid of that process, and so on, until we get to an event that 

has a "terminal" entry. An interesting problem since we have no idea how deep the 

process creation hierarchy will be! 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 25


• As always, there is more than one way to approach this, but one simple way would 

seem to be to have a map-reduce job that takes as input: 

– The auditd logs, and 

– A list of parent process ids that we want to trace back 

• and which outputs a file that contains: 

– A list of parent process-ids as the key to a list of auditd events that all have this process-id as a ppid entry. 

– A list of parent process-ids found that also have "terminal" entries (i.e. for which we have now found the user). 

• We can then repeatedly run this map-reduce job using the output of one run as input, 

with the auditd logs, to the next, until there is no difference in the output file between 

two consecutive runs, or until the output file is empty (whichever occurs first). 

(Note that since we will be dealing with audit logs from many different servers, we will need to use the IP 

address of the server with the process-id to form a unique key) 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 26


• A first question is where to get the initial list of process IDs from, and there 

are two obvious options: 

– Wait for the SOC staff to spot SYSCALL events that they are interested in, or 

– Make a first pass through the audit logs and for a given day, extract all the SYSCALL 

events on that day and then find the owner UIDs for all of them 

(Note that since the dataset used for development was quite small, option 2 

was practical) 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 27


Sample auditd log data 

• node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11 success=yes exit=0 a0=9f20008 

a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 

fsgid=0 tty=(none) ses=4294967295 comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null) 

• node=192.168.1.3 type=EXECVE msg=audit(1294649964.027:33673): argc=7 a0="iconv" a1="-f" a2="utf-8" a3="-t" a4="utf-8" a5="-o" 

a6="/dev/null" 

• node=192.168.1.3 type=CWD msg=audit(1294649964.027:33673): cwd="/usr/share/man/man0p" 

• node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=0 name="/usr/bin/iconv" inode=18564672 dev=fd:00 

mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:bin_t:s0 

• node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=1 name=(null) inode=9994257 dev=fd:00 mode=0100755 

ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0 

• node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500 auid=4294967295 

subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" : exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 

res=success)' 

• node=192.168.1.3 type=USER_ACCT msg=audit(1294649964.060:33675): user pid=9836 uid=500 auid=4294967295 

subj=system_u:system_r:unconfined_t:s0 msg='PAM: accounting acct="root" : exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 

res=success)' 

• node=192.168.1.3 type=SYSCALL msg=audit(1294649964.063:33676): arch=40000003 syscall=11 success=yes exit=0 a0=9f12820 

a1=9f00c48 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10141 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 

fsgid=0 tty=(none) ses=4294967295 comm="gawk" exe="/bin/gawk" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null) 

• node=192.168.1.3 type=EXECVE msg=audit(1294649964.063:33676): argc=6 a0="/usr/bin/gawk" 

• node=192.168.1.3 type=CWD msg=audit(1294649964.063:33676): cwd="/usr/share/man/man0p" 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 28


Sample output from SYSCALL map-reduce run 

• 10530:192.168.1.3 audit(1294650031.384:34011) 

• 10531:192.168.1.3 audit(1294650031.471:34015) audit(1294650031.481:34016) audit(1294650031.490:34017) 

• 10551:192.168.1.3 audit(1294650031.564:34022) 

• 10555:192.168.1.3 audit(1294650061.348:34028) audit(1294650061.341:34027) 

• 10556:192.168.1.3 audit(1294650061.355:34029) audit(1294650061.447:34034) audit(1294650061.408:34031) audit(1294650061.446:34035) 

• 10557:192.168.1.3 audit(1294650061.405:34030) 

• 10559:192.168.1.3 audit(1294650061.413:34033) audit(1294650061.408:34032) 

• 10561:192.168.1.3 audit(1294650061.470:34036) 

• 10563:192.168.1.3 audit(1294650061.473:34037) audit(1294650061.477:34038) 

• 10668:192.168.1.3 audit(1294653661.604:34047) audit(1294653661.610:34048) 


• 10670:192.168.1.3 audit(1294653661.682:34051) 

• 10672:192.168.1.3 audit(1294653661.686:34053) audit(1294653661.685:34052) 

• 10674:192.168.1.3 audit(1294653661.743:34056) 

• 10676:192.168.1.3 audit(1294653661.745:34057) audit(1294653661.748:34058) 

• 10702:192.168.1.3 audit(1294654578.626:34062) 

• 10787:192.168.1.3 audit(1294657261.887:34072) audit(1294657261.881:34071) 


• 10789:192.168.1.3 audit(1294657261.956:34075) 

• 10791:192.168.1.3 audit(1294657261.974:34077) audit(1294657261.971:34076) 

• 10793:192.168.1.3 audit(1294657262.022:34080) 

• 10795:192.168.1.3 audit(1294657262.028:34081) audit(1294657262.031:34082) 

• 10898:192.168.1.3 audit(1294660861.189:34090) audit(1294660861.197:34091) 

• 10899:192.168.1.3 audit(1294660861.257:34093) audit(1294660861.258:34092) 

• ... 

• ... 

• 9833:192.168.1.3 audit(1294649959.818:33419) 

• 9836:192.168.1.3 audit(1294649964.126:33683) audit(1294649964.108:33679) audit(1294649964.115:33680) 

• 9840:192.168.1.3 audit(1294649959.993:33425) 

• 9841:192.168.1.3 audit(1294649959.999:33426) 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 29

Here you can see that although the output is one file, it contains the two lists in that the keys that have an "*" in front of them are the auditd events that contain "terminal" entries. Any lines that start with "*" will be ignored on the next run. 


Sample output from trackback map-reduce run 

• ... 

• ... 

• *10530:192.168.1.3 audit(1294650031.326:34005) 

• *10555:192.168.1.3 audit(1294650061.485:34040) audit(1294650061.485:34039) audit(1294650061.337:34026) audit(1294650061.330:34025) audit(1294650061.324:34024) audit(1294650061.324:34023) 






• *9836:192.168.1.3 audit(1294649964.119:33681) audit(1294649964.060:33675) audit(1294649964.123:33682) audit(1294649964.060:33674) 

• 10002:192.168.1.3:10003 audit(1294649962.248:33560) 

• 10008:192.168.1.3:10009 audit(1294649962.301:33565) 

• 10014:192.168.1.3:10015 audit(1294649962.368:33570) 

• 10020:192.168.1.3:10021 audit(1294649962.464:33575) 

• 10026:192.168.1.3:10027 audit(1294649962.540:33580) 

• 10032:192.168.1.3:10033 audit(1294649962.624:33585) 

• 10038:192.168.1.3:10039 audit(1294649962.693:33590) 

• 10044:192.168.1.3:10045 audit(1294649962.781:33595) 

• 10050:192.168.1.3:10051 audit(1294649962.857:33600) 

• 10056:192.168.1.3:10057 audit(1294649962.922:33605) 

• 10062:192.168.1.3:10063 audit(1294649963.007:33610) 

• 10068:192.168.1.3:10069 audit(1294649963.108:33615) 

• 10074:192.168.1.3:10075 audit(1294649963.188:33620) 

• 10080:192.168.1.3:10081 audit(1294649963.279:33625) 

• 10086:192.168.1.3:10087 audit(1294649963.343:33630) 

• 10092:192.168.1.3:10093 audit(1294649963.440:33635) 

• 10098:192.168.1.3:10099 audit(1294649963.500:33640) 

TBSC 2009 

although the output is one file, it contains the 

two lists in that the keys that have an "*" in front 

of them are the auditd events that contain 

"terminal" entries. Any lines that start with "*" 

will be ignored on the next run. 

11/10/2011 12:53 PM 0725-23_TBSC 2009 30

How to get started... 

• Download and install a Java SDK (version 5+) 

• Download and install Eclipse 

• Download and install Karmasphere Studio Community Edition for Eclipse 

• Grab a copy of my 2011 LEF Report – Hadoop, a Practioner‟s Guide 

• Get a copy of Tom White‟s book – Hadoop, the Definive Guide (O‟Reilly) 

• Windows users also need to: 

– Download and install Cygwin, with X windows 

• Open an xterm window in Cygwin and run Eclipse from there instead of Windows 

• No need to download and install Hadoop !! 

– Karmasphere Studio gives you all need to develop and test Hadoop MapReduce jobs 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 31

Resources 

• My LEF report: 

– http://assets1.csc.com/lef/downloads/CSC_Grant_2011_A_Practioner_s_Guide_for_Hadoop_Development.pdf 

• LEF Data rEvolution Report: 

– http://www.csc.com/lef/ds/22182-reports 

• Hadoop Wiki: 

– http://wiki.apache.org/hadoop/ 

• Hadoop home page: 

– http://hadoop.apache.org/#News 

• Karmasphere Studio: 

– http://www.karmasphere.com/Products-Information/karmasphere-studio.html 

• If you want to contact me: lklein2@csc.com 

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 32

Questions? 

TBSC 2009 

? 

11/10/2011 12:53 PM 0725-23_TBSC 2009 33

©2011 CSC 

INNOVATE 

AND DELIVER 

Thank You

TBSC 2009 

11/10/2011 12:53 PM 0725-23_TBSC 2009 35

Hadoop Development - CSC

Create successful ePaper yourself

Delete template?

Save as template?