24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Working with Big Data<br />

Big data<br />

What makes big data different? Most big-data proponents talk about the four<br />

Vs of big data:<br />

1. Volume: The amount of data that we generate and store is growing at an<br />

increasing rate, and predictions of the future generally only suggest further<br />

increases. Today's multi-gigabyte sized hard drives will turn into exabyte<br />

hard drives in a few years, and network throughput traffic will be increasing<br />

as well. The signal to noise ratio can be quite difficult, with important data<br />

being lost in the mountain of non-important data.<br />

2. Velocity: While related to volume, the velocity of data is increasing too.<br />

Modern cars have hundreds of sensors that stream data into their <strong>com</strong>puters,<br />

and the information from these sensors needs to be analyzed at a subsecond<br />

level to operate the car. It isn't just a case of finding answers in the volume of<br />

data; those answers often need to <strong>com</strong>e quickly.<br />

3. Variety: Nice datasets with clearly defined columns are only a small part<br />

of the dataset that we have these days. Consider a social media post, which<br />

may have text, photos, user mentions, likes, <strong>com</strong>ments, videos, geographic<br />

information, and other fields. Simply ignoring parts of this data that<br />

don't fit your model will lead to a loss of information, but integrating that<br />

information itself can be very difficult.<br />

4. Veracity: With the increase in the amount of data, it can be hard to determine<br />

whether the data is being correctly collected—whether it is outdated, noisy,<br />

contains outliers, or generally whether it is useful at all. Being able to trust<br />

the data is hard when a human can't reliably verify the data itself. External<br />

datasets are being increasingly merged into internal ones too, giving rise to<br />

more troubles relating to the veracity of the data.<br />

These main four Vs (others have proposed additional Vs) outline why big data is<br />

different to just lots-of-data. At these scales, the engineering problem of working<br />

with the data is often more difficult—let alone the analysis. While there are lots of<br />

snake oil salesmen that overstate the ability to use big data, it is hard to deny the<br />

engineering challenges and the potential of big-data analytics.<br />

The algorithms we have used are to date load the dataset into memory and then<br />

to work on the in-memory version. This gives a large benefit in terms of speed of<br />

<strong>com</strong>putation, as it is much faster to <strong>com</strong>pute on in-memory data than having to load<br />

a sample before we use it. In addition, in-memory data allows us to iterate over the<br />

data many times, improving our model.<br />

[ 272 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!