Top 5 Myths About Big Data

Credit:

With the amount of hype around Big Data it’s easy to forget that we’re just in the first inning. More than three exabytes of new data are created each day, and market research firm IDC estimates that 1,200 exabytes of data will be generated this year alone.

The expansion of digital data has been underway for more than a decade and for those who’ve done a little homework, they understand that Big Data references more than just Google, eBay, or Amazon-sized data sets. The opportunity for a company of any size to gain advantages from Big Data stem from data aggregation, data exhaust, and metadata -- the fundamental building blocks to tomorrow’s business analytics. Combined, these data forces present an unparalleled opportunity.

Yet, despite how broadly Big Data is being discussed, it appears that it is still a very big mystery to many. In fact, outside of the experts who have a strong command of this topic, the misunderstandings around Big Data seem to have reached mythical proportions. Here are the top five myths.

1. Big Data is Only About Massive Data Volume

Volume is just one key element in defining Big Data, and it is arguably the least important of three elements. The other two are variety and velocity. Taken together, these three “Vs” of Big Data were originally posited by Gartner’s Doug Laney in a 2001 research report.

Generally speaking, experts consider petabytes of data volumes as the starting point for Big Data, although this volume indicator is a moving target. Therefore, while volume is important, the next two “Vs” are better individual indicators.

Variety refers to the many different data and file types that are important to manage and analyze more thoroughly, but for which traditional relational databases are poorly suited. Some examples of this variety include sound and movie files, images, documents, geo-location data, web logs, and text strings.

Velocity is about the rate of change in the data and how quickly it must be used to create real value. Traditional technologies are especially poorly suited to storing and using high-velocity data. So new approaches are needed. If the data in question is created and aggregates very quickly and must be used swiftly to uncover patterns and problems, the greater the velocity and the more likely that you have a Big Data opportunity.

2. Big Data Means Hadoop

Hadoop is the Apache open-source software framework for working with Big Data. It was derived from Google technology and put to practice by Yahoo and others. But, Big Data is too varied and complex for a one-size-fits-all solution. While Hadoop has surely captured the greatest name recognition, it is just one of three classes of technologies well suited to storing and managing Big Data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores. (See myth number five below for more about NoSQL.) Examples of MPP data stores include EMC’s Greenplum, IBM’s Netezza, and HP’s Vertica.

Plus, Hadoop is a software framework, which means it includes a number of components that were specifically designed to solve large-scale distributed data storage, analysis and retrieval tasks. Not all of the Hadoop components are necessary for a Big Data solution, and some of these components can be replaced with other technologies that better complement a user's needs. One example is MapR’s Hadoop distribution, which includes NFS as an alternative to HDFS, and offers a full random-access, read/write file system.

3. Big Data Means Unstructured Data

The term “unstructured" is imprecise and doesn’t account for the many varying and subtle structures typically associated with Big Data types. Also, Big Data may well have different data types within the same set that do not contain the same structure.

Therefore, Big Data is probably better termed “multi-structured” as it could include text strings, documents of all types, audio and video files, metadata, web pages, email messages, social media feeds, form data, and so on. The consistent trait of these varied data types is that the data schema isn’t known or defined when the data is captured and stored. Rather, a data model is often applied at the time the data is used.

4. Big Data is for Social Media Feeds and Sentiment Analysis

Simply put, if your organization needs to broadly analyze web traffic, IT system logs, customer sentiment, or any other type of digital shadows being created in record volumes each day, Big Data offers a way to do this. Even though the early pioneers of Big Data have been the largest, web-based, social media companies -- Google, Yahoo, Facebook -- it was the volume, variety, and velocity of data generated by their services that required a radically new solution rather than the need to analyze social feeds or gauge audience sentiment.

Now, thanks to rapidly increasing computer power (often cloud-based), open source software (e.g., the Apache Hadoop distribution), and a modern onslaught of data that could generate economic value if properly utilized, there are an endless stream of Big Data uses and applications. A favorite and brief primer on Big Data, which contains some thought-provoking uses, was published as an article early this year in Forbes.

5. NoSQL means No SQL

NoSQL means “not only” SQL because these types of data stores offer domain-specific access and query techniques in addition to SQL or SQL-like interfaces. Technologies in this NoSQL category include key value stores, document-oriented databases, graph databases, big table structures, and caching data stores. The specific native access methods to stored data provide a rich, low-latency approach, typically through a proprietary interface. SQL access has the advantage of familiarity and compatibility with many existing tools. Although this is usually at some expense of latency driven by the interpretation of the query to the native “language” of the underlying system.

For example, Cassandra, the popular open source key value store offered in commercial form by DataStax, not only includes native APIs for direct access to Cassandra data, but CQL (it’s SQL-like interface) as its emerging preferred access mechanism. It’s important to choose the right NoSQL technology to fit both the business problem and data type and the many categories of NoSQL technologies offer plenty of choice.