Hadoop Development - CSC
Hadoop Development - CSC
Hadoop Development - CSC
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
©2011 <strong>CSC</strong><br />
INNOVATE<br />
AND DELIVER<br />
AN INTRODUCTION TO HADOOP<br />
AND MAP-REDUCE<br />
Les Klein<br />
Solution Architect, GBS EMEA<br />
November 10, 2011
Agenda<br />
• The Data Challenge<br />
• What Is <strong>Hadoop</strong>?<br />
• <strong>Hadoop</strong> Architectural Overview<br />
• What is Map-Reduce?<br />
• Real-World Example - A Cyber Problem<br />
• Extras:<br />
• How to get started<br />
• Resources<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 2
The Data Challenge<br />
•3 dimensions:<br />
–Volume<br />
–Variety<br />
–Velocity<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 3
Volume<br />
• In 2006, 161 exabytes of digital information were created, representing roughly 3<br />
million times the information in all the books ever written!<br />
– an exabyte is 1,000 petabytes, a petabyte is 1,000 terabytes, a terabyte is 1,000 gigabytes<br />
• In March 2007, researcher IDC released a report, sponsored by EMC 1 , forecasting that<br />
as much as 988 exabytes of digital information will be created in 2010, a six-fold<br />
increase from 2006.<br />
• From 2007 until 2010, IDC said it expected information will sport a compound annual<br />
growth rate of 57 percent to hit the 988 exabyte mark. IDC now forecasts the total<br />
volume of data stored electronically in 2011 will be 1.8 zettabytes 2<br />
– a zettabyte is 1000 exabytes<br />
– As an example, the Large Hadron Collider, at CERN in Switzerland, is expected to generate ~15<br />
petabytes of data per year<br />
1 The Expanding Digital Universe, A Forecast of Worldwide Information Growth Through 2010, John F Gantz et al, IDC, March 2007<br />
2 The Diverse and Exploding Digital Universe, An Updated Forecast of Worldwide Information Growth Through 2011, John F Gantz et al, IDC, March<br />
2008<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 4
Volume (cont’d)<br />
• IDC now predicts that the digital universe will be 44 times bigger in 2020<br />
than it was in 2009, totalling a staggering 35 zettabytes. 3<br />
• EMC reports that the number of customers storing a petabyte or more of<br />
data will grow from 1,000 (reached in 2010) to 100,000 before the end of<br />
the decade. 4<br />
– By 2012 it expects that some customers will be storing exabytes (1,000 petabytes) of<br />
information. 5<br />
• In 2010 Gartner reported that enterprise data growth will be 650 percent<br />
over the next five years, and that 80 percent of that will be unstructured. 6<br />
3 “The Digital Universe Decade – Are You Ready?” IDC, sponsored by EMC Corporation, May 2010. See tab “The Digital Universe Decade,”<br />
http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm<br />
4 “EMC’s Record Breaking Product Launch,” Chuck Hollis blog, 14 January 2011, http://chucksblog.emc.com/chucks_blog/2011/01/emcs-record-breaking-product-launch.html<br />
5 Ibid.<br />
6 “Technology Trends You Can’t Afford to Ignore,” Gartner Webinar, January 2010, slide 8,<br />
http://www.gartner.com/it/content/1258400/1258425/january_6_techtrends_rpaquet.pdf<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 5
Volume (cont’d)<br />
• If you have a few 100 terabytes of data then products like Teradata,<br />
Netezza, Oracle Database Machine, etc., can help you - $$$$$$$<br />
– Note these usually also require “structured” data<br />
• If you have many (even tens) of petabytes of data that need to be stored<br />
and analyzed<br />
– Products like those are cost prohibitive for most of us (assuming that the product can<br />
scale that far)<br />
• Complexity of analytics is also now becoming a problem<br />
– Difficult to express within the constraints of the tools (e.g. SQL)<br />
– Time taken to get results is unacceptable<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 6
Variety<br />
• Continuing increase in the types of data needing to be stored:<br />
– Video, voice/sound, images, RFID tags, SMS messages, “chat”, etc.<br />
• Many of these are not easy to store/process in relational databases for<br />
analysis<br />
• Many sources of such data are “unstructured” and/or not easy to structure<br />
– Often need to know at design time what kind of “questions” will need to be asked<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 7
Velocity<br />
• The rate at which volume and variety are increasing are themselves<br />
increasing!<br />
• “real time” analysis needs<br />
– Increasingly becoming necessary to process new data as it streams in rather than<br />
over-night<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 8
What Is <strong>Hadoop</strong>?<br />
• Google famously encountered these problems when it wanted to index the<br />
World Wide Web. Their solution:<br />
– Google File System (GFS) for storage<br />
– Google Map-Reduce to be able to rapidly process, in a highly parallel way, data<br />
stored in GFS<br />
• Google‟s solution is proprietary and their success created demand for<br />
similar capabilities that other companies could use, including their<br />
competitors, most notably, Yahoo!<br />
• The result of that demand is an Apache Open Source project called<br />
<strong>Hadoop</strong> that provides:<br />
– HDFS (<strong>Hadoop</strong> Distributed File System), equivalent in capability to GFS<br />
– MapReduce to process the data stored in HDFS, equivalent in capability to Google<br />
Map-Reduce<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 9
What Is <strong>Hadoop</strong>? (cont’d)<br />
• <strong>Hadoop</strong> itself is “free”<br />
• There is already a large “ecosystem” around <strong>Hadoop</strong><br />
– Many projects and products to make it easier to use <strong>Hadoop</strong> (open source and<br />
commercial)<br />
– Commercial support is available in a “RedHat” style model from Cloudera,<br />
Hortonworks and some others<br />
– Commercial support is also available from Greenplum (EMC) and IBM (BigInsights)<br />
who both have <strong>Hadoop</strong> based “products” in their portfolio, as do some others (e.g.<br />
Platform Computing)<br />
• <strong>Hadoop</strong> is widely used in industry today<br />
– Yahoo!, Facebook, LinkedIn, eBay, Quantcast, and many others<br />
• It is potentially “disruptive” technology<br />
– One <strong>CSC</strong> customer had an existing OLAP application that it re-implemented with<br />
<strong>Hadoop</strong> and got a 3x performance improvement for 10% of the infrastructure<br />
cost!<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 10
What Is <strong>Hadoop</strong>? (cont’d)<br />
• <strong>Hadoop</strong> can scale<br />
– Yahoo! has many clusters, the largest is 4000 nodes providing 16PB of HDFS (4 x 1TB HDDs/server)<br />
– Facebook has a 2000 node cluster providing 21PB of HDFS (12 x 1TB HDDs/server)<br />
• July 2011 Facebook announced a 30PB <strong>Hadoop</strong> cluster in a new “bleeding edge” data centre<br />
• What <strong>Hadoop</strong>/HDFS is not<br />
– A Database - it does not require structured data<br />
– A POSIX file system<br />
– Real-time – batch only<br />
• <strong>Hadoop</strong> is map-reduce only<br />
– Not all problems necessarily lend themselves to a this type of solution<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 11
<strong>Hadoop</strong> Architectural Overview<br />
Name Node<br />
Job Tracker<br />
TBSC 2009<br />
Secondary<br />
Name Node<br />
Data Nodes<br />
•Data Nodes are commodity servers<br />
- Provide both storage and processing<br />
•Name Node is a SPoF<br />
- HA/Resilient server is a good choice<br />
•Secondary Name Node is a “warm”<br />
standby<br />
- Fail-over requires some downtime<br />
- Ideally the same server choice as the<br />
Name<br />
Node<br />
•Job Tracker<br />
- Loss does not incur data loss, but in flight<br />
jobs<br />
are lost<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 12
<strong>Hadoop</strong> Architectural Overview (cont’d)<br />
• Hardware Failure is to be expected!<br />
– The Facebook 2000 node cluster has 24,000 SATA HDDs<br />
• A 3% failure rate per annum => 720 HDD fail per year = ~2 per day !!<br />
– Cheapest commodity servers rather than “enterprise” class devices<br />
• Limited/No redundancy<br />
• HDFS can accommodate disk/node loss without data loss<br />
– Each data block is replicated 3 times (by default) and HDFS can be made rack-aware<br />
• 1 st replica on different server in same rack<br />
• 2 nd replica on server in a different rack<br />
– If a disk or a Data Node is lost then HDFS automatically creates new replicas in background for all the<br />
lost data blocks (disk space permitting of course)<br />
• MapReduce will pre-emptively start the same processing tasks on different copies of<br />
data blocks when it detects that some nodes appear to be running slowly (or have<br />
died).<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 13
<strong>Hadoop</strong> Architectural Overview (cont’d)<br />
• HDFS is designed for Write Once/Read Many operations<br />
• HDFS block sizes are big<br />
– 64MB, 128MB and 256MB are common<br />
– To maximise disk read throughput<br />
• <strong>Hadoop</strong> runs one Map task for each HDFS block in the data to be processed and<br />
takes approx one minute to start a map task so execution needs to take at least<br />
one minute<br />
• Increase block size to increase task execution time<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 14
<strong>Hadoop</strong> Architectural Overview (cont’d)<br />
• <strong>Hadoop</strong> scales linearly<br />
– Suppose 100TB of data to process. With one server and a read rate of 100MB/s, it<br />
would take:<br />
• (100 x 10^12) / (100 x 10^6) = 10^6 = 1,000,000 seconds to read the files.<br />
– With 100 servers each with a 100MB/s read rate and the equal distribution of files<br />
across all 100 servers, i.e. each server has 1TB to read, it will take:<br />
• (1 x 10^12) / (100 x 10^6) = 10^4 = 10,000 seconds<br />
• Since all 100 read the data in parallel, total read time is 10,000 seconds, i.e. 100<br />
times faster.<br />
– If have 1000 servers and equal distribution of the data across the cluster, then can<br />
read all 100TB in just 1,000 seconds, i.e. less than 20 minutes.<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 15
<strong>Hadoop</strong> Architectural Overview (cont’d)<br />
• <strong>Hadoop</strong>/HDFS is a paradigm shift for Parallel Processing (HPC)<br />
– moves the processing to the data as opposed to moving the data to the processing<br />
• Developers only have 2 pieces of code to write<br />
– The map algorithm<br />
– The reduce algorithm<br />
• The only other code needed is the <strong>Hadoop</strong> client code needed to get<br />
<strong>Hadoop</strong> to run the map-reduce job<br />
– mostly boiler-plate and can be auto-generated<br />
• <strong>Hadoop</strong> does the rest<br />
– Decides which nodes will run the code<br />
– Coordinates running all the mappers before it starts any reducers<br />
– Developers no longer need to worry about rendezvous, semaphores, deadlocks, etc.!<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 16
What is Map-Reduce?<br />
• Map-Reduce is a different way of<br />
thinking about a problem.<br />
• Suppose that you want to count the<br />
number of times each word is used in<br />
the complete works of Shakespeare,<br />
and that you have loaded the<br />
complete works of Shakespeare into<br />
HDFS.<br />
• The first thing to do is to use the map<br />
phase to output a key-value pair for<br />
each word that you find in the blocks<br />
of data that you read. So if you<br />
process<br />
“To be, or not to be, that is the<br />
question”<br />
you would get the result shown to the<br />
right:<br />
TBSC 2009<br />
• [, 1]<br />
• [, 1]<br />
• [, 1]<br />
• [, 1]<br />
• [, 1]<br />
• [, 1]<br />
• [, 1]<br />
• [, 1]<br />
• [, 1]<br />
• [, 1]<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 17
What is Map-Reduce? (cont’d)<br />
• Note that you could choose to optimise the code to aggregate locally before outputting<br />
the key values<br />
– e.g. to get [, 2] and [, 2] but that is not something that has to be done (although<br />
in a non-trivial case doing so would minimise I/O so may well help).<br />
• The “key” function can be anything you like that generates a unique key for each value<br />
that you will encounter<br />
– e.g. we could convert all the characters to upper case and use that string as the key ([“TO”, 1] for<br />
example).<br />
• The <strong>Hadoop</strong> MapReduce framework uses the key to decide which Data Node to send<br />
that data to for the reduce phase.<br />
– In this example, the Data Node chosen to get the key value pairs for the “key for to” will get 2 items to<br />
process as will the one for the “key for be”<br />
– all the others will only get one pair to process.<br />
• All the reduce code has to do is to aggregate all the input pairs into a single key value<br />
pair like [“TO”, 2] (if we are using upper case strings as the key).<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 18
What is Map-Reduce? (cont’d)<br />
Map code for word count example<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 19
What is Map-Reduce? (cont’d)<br />
Reduce code for word count example<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 20
What is Map-Reduce? (cont’d)<br />
• Unlikely you will have more nodes in your cluster than possible key values<br />
– <strong>Hadoop</strong> will send pairs for more than a single key value to a given Data Node, and<br />
runs a separate Reduce task for each key value on that Data Node<br />
• your reduce code does not need to worry about that<br />
• Similarly, if there are more data blocks than Data Nodes<br />
– <strong>Hadoop</strong> starts a separate Map task for each block to be read by a Data Node<br />
– <strong>Hadoop</strong> decides which blocks will be processed on which nodes<br />
• Usually it has 3 choices of Data Node for each block<br />
– your map code doesn‟t need to worry about any of this<br />
(Note that in general <strong>Hadoop</strong> will run multiple tasks concurrently on Data Nodes)<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 21
Real world example - a Cyber problem<br />
• We would like to be able to trace the use of any Linux system calls back<br />
to the user who was ultimately responsible for them.<br />
• Assume malicious users will try to hide their identity by spawning layers of<br />
child processes that make it difficult to track back to the original process<br />
that is their "terminal" (login) session.<br />
• What we would like to know is can <strong>Hadoop</strong> be used to solve this problem<br />
by doing the track back for any (or all) SYSCALL (system call) events?<br />
• As for the scale, consider a datacentre with:<br />
– servers generating audit events at the rate of ~20 per second, i.e. ~72,000 per hour<br />
– that has 1000 servers, i.e. ~72 million events per hour, or ~1.5 billion events a day (!)<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 22
Real world example - a Cyber problem (cont’d)<br />
• The problem can addressed by processing the audit logs produced by the<br />
auditd daemon that is supplied with Linux kernels 2.6.x (or newer).<br />
• In order to get auditd to log the data we are interested in it is necessary to<br />
set up rules in /etc/audit/audit.rules on each server as specified in the<br />
NSA document<br />
“Guide to the Secure Configuration of Red Hat Enterprise Linux 5, Revision 4,<br />
September 14, 2010”<br />
• The data to process looks something like this:<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 23
Real world example - a Cyber problem (cont’d)<br />
node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11<br />
success=yes exit=0 a0=9f20008 a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139<br />
auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295<br />
comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null)<br />
node=192.168.1.3 type=EXECVE msg=audit(1294649964.027:33673): argc=7 a0="iconv" a1="-f" a2="utf-8"<br />
a3="-t" a4="utf-8" a5="-o" a6="/dev/null"<br />
node=192.168.1.3 type=CWD msg=audit(1294649964.027:33673): cwd="/usr/share/man/man0p"<br />
node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=0 name="/usr/bin/iconv"<br />
inode=18564672 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:bin_t:s0<br />
node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=1 name=(null) inode=9994257<br />
dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0<br />
node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500<br />
auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" :<br />
exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)'<br />
node=192.168.1.3 type=USER_ACCT msg=audit(1294649964.060:33675): user pid=9836 uid=500<br />
auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: accounting acct="root" :<br />
exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)„<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 24
Real world example - a Cyber problem (cont’d)<br />
• The events we are interested in are the SYSCALL events such as:<br />
node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11<br />
success=yes exit=0 a0=9f20008 a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139<br />
auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295<br />
comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null)<br />
• The process id making this system call is 10139 and this process has a parent process<br />
3696<br />
• What we need to do is find which process created 3696 and if that was not an event<br />
with a "terminal" entry, such as:<br />
node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500<br />
auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" :<br />
exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)'<br />
• Then we need to trace the ppid of that process, and so on, until we get to an event that<br />
has a "terminal" entry. An interesting problem since we have no idea how deep the<br />
process creation hierarchy will be!<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 25
Real world example - a Cyber problem (cont’d)<br />
• As always, there is more than one way to approach this, but one simple way would<br />
seem to be to have a map-reduce job that takes as input:<br />
– The auditd logs, and<br />
– A list of parent process ids that we want to trace back<br />
• and which outputs a file that contains:<br />
– A list of parent process-ids as the key to a list of auditd events that all have this process-id as a ppid entry.<br />
– A list of parent process-ids found that also have "terminal" entries (i.e. for which we have now found the user).<br />
• We can then repeatedly run this map-reduce job using the output of one run as input,<br />
with the auditd logs, to the next, until there is no difference in the output file between<br />
two consecutive runs, or until the output file is empty (whichever occurs first).<br />
(Note that since we will be dealing with audit logs from many different servers, we will need to use the IP<br />
address of the server with the process-id to form a unique key)<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 26
Real world example - a Cyber problem (cont’d)<br />
• A first question is where to get the initial list of process IDs from, and there<br />
are two obvious options:<br />
– Wait for the SOC staff to spot SYSCALL events that they are interested in, or<br />
– Make a first pass through the audit logs and for a given day, extract all the SYSCALL<br />
events on that day and then find the owner UIDs for all of them<br />
(Note that since the dataset used for development was quite small, option 2<br />
was practical)<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 27
Real world example - a Cyber problem (cont’d)<br />
Sample auditd log data<br />
• node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11 success=yes exit=0 a0=9f20008<br />
a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0<br />
fsgid=0 tty=(none) ses=4294967295 comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null)<br />
• node=192.168.1.3 type=EXECVE msg=audit(1294649964.027:33673): argc=7 a0="iconv" a1="-f" a2="utf-8" a3="-t" a4="utf-8" a5="-o"<br />
a6="/dev/null"<br />
• node=192.168.1.3 type=CWD msg=audit(1294649964.027:33673): cwd="/usr/share/man/man0p"<br />
• node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=0 name="/usr/bin/iconv" inode=18564672 dev=fd:00<br />
mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:bin_t:s0<br />
• node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=1 name=(null) inode=9994257 dev=fd:00 mode=0100755<br />
ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0<br />
• node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500 auid=4294967295<br />
subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" : exe="/bin/su" (hostname=?, addr=?, terminal=pts/1<br />
res=success)'<br />
• node=192.168.1.3 type=USER_ACCT msg=audit(1294649964.060:33675): user pid=9836 uid=500 auid=4294967295<br />
subj=system_u:system_r:unconfined_t:s0 msg='PAM: accounting acct="root" : exe="/bin/su" (hostname=?, addr=?, terminal=pts/1<br />
res=success)'<br />
• node=192.168.1.3 type=SYSCALL msg=audit(1294649964.063:33676): arch=40000003 syscall=11 success=yes exit=0 a0=9f12820<br />
a1=9f00c48 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10141 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0<br />
fsgid=0 tty=(none) ses=4294967295 comm="gawk" exe="/bin/gawk" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null)<br />
• node=192.168.1.3 type=EXECVE msg=audit(1294649964.063:33676): argc=6 a0="/usr/bin/gawk"<br />
• node=192.168.1.3 type=CWD msg=audit(1294649964.063:33676): cwd="/usr/share/man/man0p"<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 28
Real world example - a Cyber problem (cont’d)<br />
Sample output from SYSCALL map-reduce run<br />
• 10530:192.168.1.3 audit(1294650031.384:34011)<br />
• 10531:192.168.1.3 audit(1294650031.471:34015) audit(1294650031.481:34016) audit(1294650031.490:34017)<br />
• 10551:192.168.1.3 audit(1294650031.564:34022)<br />
• 10555:192.168.1.3 audit(1294650061.348:34028) audit(1294650061.341:34027)<br />
• 10556:192.168.1.3 audit(1294650061.355:34029) audit(1294650061.447:34034) audit(1294650061.408:34031) audit(1294650061.446:34035)<br />
• 10557:192.168.1.3 audit(1294650061.405:34030)<br />
• 10559:192.168.1.3 audit(1294650061.413:34033) audit(1294650061.408:34032)<br />
• 10561:192.168.1.3 audit(1294650061.470:34036)<br />
• 10563:192.168.1.3 audit(1294650061.473:34037) audit(1294650061.477:34038)<br />
• 10668:192.168.1.3 audit(1294653661.604:34047) audit(1294653661.610:34048)<br />
• 10669:192.168.1.3 audit(1294653661.626:34049) audit(1294653661.723:34054) audit(1294653661.726:34055) audit(1294653661.676:34050)<br />
• 10670:192.168.1.3 audit(1294653661.682:34051)<br />
• 10672:192.168.1.3 audit(1294653661.686:34053) audit(1294653661.685:34052)<br />
• 10674:192.168.1.3 audit(1294653661.743:34056)<br />
• 10676:192.168.1.3 audit(1294653661.745:34057) audit(1294653661.748:34058)<br />
• 10702:192.168.1.3 audit(1294654578.626:34062)<br />
• 10787:192.168.1.3 audit(1294657261.887:34072) audit(1294657261.881:34071)<br />
• 10788:192.168.1.3 audit(1294657261.951:34074) audit(1294657261.992:34078) audit(1294657261.892:34073) audit(1294657261.997:34079)<br />
• 10789:192.168.1.3 audit(1294657261.956:34075)<br />
• 10791:192.168.1.3 audit(1294657261.974:34077) audit(1294657261.971:34076)<br />
• 10793:192.168.1.3 audit(1294657262.022:34080)<br />
• 10795:192.168.1.3 audit(1294657262.028:34081) audit(1294657262.031:34082)<br />
• 10898:192.168.1.3 audit(1294660861.189:34090) audit(1294660861.197:34091)<br />
• 10899:192.168.1.3 audit(1294660861.257:34093) audit(1294660861.258:34092)<br />
• ...<br />
• ...<br />
• 9833:192.168.1.3 audit(1294649959.818:33419)<br />
• 9836:192.168.1.3 audit(1294649964.126:33683) audit(1294649964.108:33679) audit(1294649964.115:33680)<br />
• 9840:192.168.1.3 audit(1294649959.993:33425)<br />
• 9841:192.168.1.3 audit(1294649959.999:33426)<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 29
Here you can see that although the output is one file, it contains the two lists in that the keys that have an "*" in front of them are the auditd events that contain "terminal" entries. Any lines that start with "*" will be ignored on the next run.<br />
Real world example - a Cyber problem (cont’d)<br />
Sample output from trackback map-reduce run<br />
• ...<br />
• ...<br />
• *10530:192.168.1.3 audit(1294650031.326:34005)<br />
• *10555:192.168.1.3 audit(1294650061.485:34040) audit(1294650061.485:34039) audit(1294650061.337:34026) audit(1294650061.330:34025) audit(1294650061.324:34024) audit(1294650061.324:34023)<br />
• *10668:192.168.1.3 audit(1294653661.595:34045) audit(1294653661.601:34046) audit(1294653661.591:34043) audit(1294653661.592:34044) audit(1294653661.753:34060) audit(1294653661.752:34059)<br />
• *10787:192.168.1.3 audit(1294657262.072:34084) audit(1294657261.870:34069) audit(1294657261.878:34070) audit(1294657261.867:34068) audit(1294657261.865:34067) audit(1294657262.070:34083)<br />
• *10898:192.168.1.3 audit(1294660861.181:34086) audit(1294660861.186:34089) audit(1294660861.183:34088) audit(1294660861.181:34087) audit(1294660861.295:34103) audit(1294660861.295:34102)<br />
• *11010:192.168.1.3 audit(1294664461.538:34121) audit(1294664461.410:34108) audit(1294664461.403:34107) audit(1294664461.401:34106) audit(1294664461.400:34105) audit(1294664461.538:34122)<br />
• *11122:192.168.1.3 audit(1294668061.647:34124) audit(1294668061.792:34140) audit(1294668061.792:34141) audit(1294668061.656:34127) audit(1294668061.651:34126) audit(1294668061.649:34125)<br />
• *9836:192.168.1.3 audit(1294649964.119:33681) audit(1294649964.060:33675) audit(1294649964.123:33682) audit(1294649964.060:33674)<br />
• 10002:192.168.1.3:10003 audit(1294649962.248:33560)<br />
• 10008:192.168.1.3:10009 audit(1294649962.301:33565)<br />
• 10014:192.168.1.3:10015 audit(1294649962.368:33570)<br />
• 10020:192.168.1.3:10021 audit(1294649962.464:33575)<br />
• 10026:192.168.1.3:10027 audit(1294649962.540:33580)<br />
• 10032:192.168.1.3:10033 audit(1294649962.624:33585)<br />
• 10038:192.168.1.3:10039 audit(1294649962.693:33590)<br />
• 10044:192.168.1.3:10045 audit(1294649962.781:33595)<br />
• 10050:192.168.1.3:10051 audit(1294649962.857:33600)<br />
• 10056:192.168.1.3:10057 audit(1294649962.922:33605)<br />
• 10062:192.168.1.3:10063 audit(1294649963.007:33610)<br />
• 10068:192.168.1.3:10069 audit(1294649963.108:33615)<br />
• 10074:192.168.1.3:10075 audit(1294649963.188:33620)<br />
• 10080:192.168.1.3:10081 audit(1294649963.279:33625)<br />
• 10086:192.168.1.3:10087 audit(1294649963.343:33630)<br />
• 10092:192.168.1.3:10093 audit(1294649963.440:33635)<br />
• 10098:192.168.1.3:10099 audit(1294649963.500:33640)<br />
TBSC 2009<br />
although the output is one file, it contains the<br />
two lists in that the keys that have an "*" in front<br />
of them are the auditd events that contain<br />
"terminal" entries. Any lines that start with "*"<br />
will be ignored on the next run.<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 30
How to get started...<br />
• Download and install a Java SDK (version 5+)<br />
• Download and install Eclipse<br />
• Download and install Karmasphere Studio Community Edition for Eclipse<br />
• Grab a copy of my 2011 LEF Report – <strong>Hadoop</strong>, a Practioner‟s Guide<br />
• Get a copy of Tom White‟s book – <strong>Hadoop</strong>, the Definive Guide (O‟Reilly)<br />
• Windows users also need to:<br />
– Download and install Cygwin, with X windows<br />
• Open an xterm window in Cygwin and run Eclipse from there instead of Windows<br />
• No need to download and install <strong>Hadoop</strong> !!<br />
– Karmasphere Studio gives you all need to develop and test <strong>Hadoop</strong> MapReduce jobs<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 31
Resources<br />
• My LEF report:<br />
– http://assets1.csc.com/lef/downloads/<strong>CSC</strong>_Grant_2011_A_Practioner_s_Guide_for_<strong>Hadoop</strong>_<strong>Development</strong>.pdf<br />
• LEF Data rEvolution Report:<br />
– http://www.csc.com/lef/ds/22182-reports<br />
• <strong>Hadoop</strong> Wiki:<br />
– http://wiki.apache.org/hadoop/<br />
• <strong>Hadoop</strong> home page:<br />
– http://hadoop.apache.org/#News<br />
• Karmasphere Studio:<br />
– http://www.karmasphere.com/Products-Information/karmasphere-studio.html<br />
• If you want to contact me: lklein2@csc.com<br />
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 32
Questions?<br />
TBSC 2009<br />
?<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 33
©2011 <strong>CSC</strong><br />
INNOVATE<br />
AND DELIVER<br />
Thank You
TBSC 2009<br />
11/10/2011 12:53 PM 0725-23_TBSC 2009 35