Big data represents the newest and most comprehensive version of organizations’ long-term aspiration to establish and improve their data-driven decision-making. Data in itself is not valuable at all. The value is in how organisations will use that data and turn their organisation into an information-centric company that relies on insights derived from data analyses for their decision-making.
The early detection of the Big Data characteristics can provide a cost effective strategy to many organizations to avoid unnecessary deployment of Big Data technologies. The data analytics on some data may not require Big Data techniques and technologies; the current and well established techniques and technologies maybe sufficient to handle the data storage and data processing. This brings us to the purpose of the characteristics of Big Data to help with identifying if a problem requires a Big Data solution.
According to Gartner big data definition is:
“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
There are differing opinions with the number of characteristics – “V dimensions” are needed to identify a project as ‘Big Data’. The original three V’s – Volume, Velocity, and Variety – appeared in 2001 when Gartner analyst Doug Laney used it to help identify key dimensions of big data.
3-D Data Management
- Volume – The sheer volume of the data is enormous and a very large contributor to the ever expanding digital universe is the Internet of Things with sensors all over the world in all devices creating data every second. All the emails, twitter messages, photos, video clips, sensor data etc. we produce and share every second. Currently, the data is generated by employees, partners, machines and customers. For example, hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago. More sources of data with a larger size of data combine to increase the volume of data that has to be analysed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
- Velocity – is the speed at which the data is created, stored, analysed and visualized. Big data technology allows us now to analyse the data while it is being generated, without ever putting it into databases. Initially, companies analysed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.
- Variety – Nowadays, 90% of the data that is generated by organisation is unstructured data. From excel tables and databases, data structure has changed to lose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational databases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. As new applications are introduced new data formats come to life.
The three V’s are the driving dimensions of Big Data, but they are open-ended. There is no specific volume, velocity, or variety of data that constitutes big. These may be the most common but by no means the only descriptors that have been used.
Quantifying ‘Big’ – How Many “V’s” in Big Data?
There are many different characteristics of Big Data on which data scientists agree, but none which by themselves can be used to say that this example is Big Data and that one is not. In fact I was able to find another eleven different characteristics claimed for Big Data. These characteristics were compiled from several sources including IBM, Paxata, Datafloq, SAS, Data Science Central and the National Institute of Standards and Technology (NIST) etc.
4.Value the all-important V, characterizing the business value, ROI, and potential of big data to transform your organization from top to bottom. It is all well and good having access to big data but unless we can turn it into value it is useless. It is so easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of costs and benefits.
5. Viability Neil Biehn, writing in Wired, sees Viability and Value as distinct missing Vs numbers 4 and 5. According to Biehn, “we want to carefully select the attributes and factors that are most likely to predict outcomes that matter most to businesses; the secret is uncovering the latent, hidden relationships among these variables.
6. Veracity: This refers to the accuracy, reliability. Veracity has an impact on the confidence data.
7. Variability – means that the meaning is changing (rapidly) dynamic, evolving, spatiotemporal data, time series, seasonal, and any other type of non-static behaviour in your data sources, customers, objects of study, etc.
8. Visualization Making all that vast amount of data comprehensible in a manner that is easy to understand and read.
9. Validity: data quality, governance, master data management (MDM) on massive, diverse, distributed, heterogeneous, “unclean” data collections.
10. Venue: distributed, heterogeneous data from multiple platforms, from different owners’ systems, with different access and formatting requirements, private vs. public cloud.
11. Vocabulary: schema, data models, semantics, ontologies, taxonomies, and other content- and context-based metadata that describe the data’s structure, syntax, content, and provenance.
12. Vagueness: confusion over the meaning of big data (Is it Hadoop? Is it something that we’ve always had? What’s new about it? What are the tools? Which tools should I use? etc.) (Note: Venkat Krishnamurthy Director of Product Management at YarcData introduced this new “V” at the Big Data Innovation Summit in Santa Clara on June 9, 2014.)
13. Virality: Defined by some users as the rate at which the data spreads; how often it is picked up and repeated by other users or events.
14. Volatility Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis.
How many V’s are enough?
In recent years, revisionists have blown out the count to a too-many, expanding the market space but also creating confusion. They definitely all matter, particularly as we consider designing and implementing processes to prepare raw data into “ready to use” information streams. Reaching a common definition of Big Data is one of the first tasks to tackle.
Bill Vorhies, President & Chief Data Scientist – Data-Magnum, has been working with the US Department of Commerce National Institute for Standards and Technology (NIST) working group developing a standardized “Big Data Roadmap” since the summer of 2013. They elected to stick with Volume, Variety, and Velocity and kicked other dimensions out of the Big Data definition as broadly applicable to all types of data.
As author and analytics strategy consultant Seth Grimes observes in his InformationWeek piece “Big Data: Avoid ‘Wanna V’ Confusion”. In his article he wants to differentiate the essence of Big Data, as defined by Doug Laney’s original-and-still-valid 3 Vs, from derived qualities, proposed by various vendors. In his opinion, the wanna-V backers and the contrarians mistake interpretive, derived qualities for essential attributes. Conflating inherent aspects with important objectives leads to poor prioritization and planning.
So, the above mentioned consultants believe that Variability, Veracity, Validity, Value etc. aren’t intrinsic, definitional Big Data properties. They are not absolutes. By contrast, they reflect the uses you intend for your data. They relate to your particular business needs. You discover context-dependent Variability, Veracity, Validity, and Value in your data via analyses that assess and reduce data and present insights in forms that facilitate business decision-making. This function, Big Data Analytics, is the key to understanding Big Data.
I’ve explored many sources to bring you a complete listing of possible definitions of Big Data with the goal of being able to determine what a Big Data opportunity is and what’s not. Once you have a single view of your data, you can start to make intelligent decisions about the business, its performance and the future plans.
In conclusion, Volume, Variety, and Velocity still make the best definitions but none of these stand on their own in identifying Big Data from not-so-big-data. Understanding these characteristics will help you analyse whether an opportunity calls for a Big Data solution but the key is to understand that this is really about breakthrough changes in the technology of storing, retrieving, and analysing data and then finding the opportunities that can best take advantage.
Bernard Marr “Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance” http://media.wiley.com/product_data/excerpt/33/11189658/1118965833-18.pdf
Harvard Business Review October 2012 Big Data: The Management Revolution by Andrew McAfee and Erik Brynjolfsson http://ai.arizona.edu/mis510/other/Big%20Data%20-%20The%20Management%20Revolution.pdf