Big data is a very hot topic, and with the Splunk IPO last week seeing a 1999-style spike, the bandwagon is overflowing. We’re poised to see many businesses pivoting into the big data space or simply slapping a big data sticker on their products—accurate or not—just to ride the wave.
This post aims to help educate you with a few byte-sized big data concepts (not just trivia) so that you can distinguish the substance from the hype.
1. Big data is distributed data
Big data is a nebulous term with many different definitions. The key thing to remember is that in this day and age, big data is distributed data. This means the data is so massive it cannot be stored or processed by a single node.
The days of buying a single big iron server from IBM or Sun to handle all your business intelligence needs are long gone. It’s been proven by Google, Amazon, Facebook, and others that the way to scale fast and affordably is to use commodity hardware to distribute the storage and processing of our massive data streams across several nodes, adding and removing nodes as needed.
2. You’re going to hear the words “Hadoop” and “MapReduce”
What is Hadoop? It is an open source platform for consolidating, combining and understanding large-scale data in order to make better business decisions. Hadoop is the technology powering many (but not all) big data analytics infrastructures.
There are 2 key parts to Hadoop:
- HDFS (Hadoop distributed file system) which lets you store data across multiple nodes.
- MapReduce which lets you process data in parallel across multiple nodes.
Although Hadoop is one of the most popular solutions for crunching big data — there are plenty others. Big data can’t be shoehorned into one flavor of technology. The important characteristic is that you’re able to draw insights from large quantities of data, independent of specific technologies.
3. You can understand MapReduce without a degree from Stanford
The best plain English explanation of MapReduce I’ve encountered (paraphrasing):
We want to count all the books in the library. You count up shelf #1. I count up shelf #2. That’s map. Now we get together and add our individual counts. That’s reduce.
For a deeper understanding, Wikipedia is a good place to start.
4. Distributed data generation is fueling big data growth
The reason we have data problems so big that we need large-scale distributed computing architecture to solve is that the creation of the data is also large-scale and distributed. Most of us walk around carrying devices that are constantly pulsing all sorts of data into the cloud and beyond – our locations, our photos, our tweets, our status updates, our connections, even our heartbeats.
For every human-generated piece of data there’s likely associated machine-generated data. And then there’s the metadata. The data is abundant and it’s extremely valuable.
5. Machine learning is…awesome!
One of the key differentiators in big data analytics are the machine learning algorithms used to answer interesting questions and derive value from the 0s and 1s we’re furiously chewing up and spitting back out.
Some pretty cool examples:
- Nest – a beautiful thermostat that learns how hot or cold you like your house so you never have to adjust it again (not technically big data, but fun nonetheless)
- Gmail’s Bayesian spam filter – no more tempting emails from that pesky Nigerian prince!
- Varonis’ access control recommendations – ratchet down access based on highly accurate analytics.
If you’re interested in learning more about big data, join our webinar this Wednesday on Mastering Big Data.
photo credit: http://fav.me/d4vqn4w