A Big Data Primer

Written by Prof. Dr. Philippe Cudre-Mauroux, University of Fribourg

Far from today’s buzz, Big Data emerged as a new topic a decade ago as a result of two conflicting developments: the explosion of available data on one hand, and new hurdles hampering the evolution of database management systems on the other hand. As data was rapidly growing through the wider deployment of sensing and Web technologies, data management was facing unprecedented issues: limited progress on the hardware front (with stagnating CPU frequencies and inefficient storage media), as well as the appearance of new data types and query workloads (sensor data, social Web data, analytics).

pcm_pic

Web giants were the first companies to be confronted to the data deluge and to those issues. At that time, they were struggling on a daily basis with the maintenance of their data infrastructures. The CAP theorem, basically stating that it is impossible to operate legacy database systems in large-scale distributed settings, drew a lot of attention. Standard data infrastructures were failing. New solutions were desperately needed.

Mike Stonebraker, a pioneer in data management systems, was one of the early heralds of change during those troubled times. He advocated a radical approach: a total rewrite of data management infrastructures to cope with the new data and workloads that were appearing. The next few years witnessed the rise of a flurry of endeavors ambitioning to create more efficient, specialized data management systems, e.g., for aggregate queries, graphs, arrays, or semi-structured data. Hadoop MapReduce, with its simplistic programming model, emerged as the first general-purpose solution to handle large datasets on clusters of commodity machines. Big Data was born.

The data management market was previously in the hands of a few big players, all offering similar solutions. This dramatically changed with the emergence of specialized Big Data vendors, which created a fragmented market by offering a wealth of new options to potential customers. The big players started to develop their own specialized solution to manage Big Data. They also massively adopted Hadoop, which transitioned from an awkward programming framework to an open-source, general-purpose Big Data ecosystem.

Nowadays, virtually all large companies, both in Switzerland and abroad, are leveraging Big Data solutions internally. Big Data infrastructures are powering a wide range of new applications from large-scale log analysis to real-time stream processing. Increasingly, Big Data is used to collect all kinds of information in companies, creating so-called data lakes where heterogeneous pieces of data are gathered. Parts of the data are then fed into sophisticated machine-learning algorithms, which use historical facts to make predictions about future events. Supporting such predictive analytics is however quite complex, as it requires groups of highly-qualified (and highly sought-after) employees, both DevOps to configure and optimize the dedicated Big Data infrastructures, and data scientists to manually select the relevant pieces of data and the facets (i.e., features) that should be exploited by machine-learning algorithms.

As such, Big Data applications run today on sophisticated platforms that are very difficult to deploy and optimize in smaller entities such as SMEs or local administrations. This will however change in the near future, with the emergence of cloud-based integrated Big Data solutions that will drastically simplify how data is ingested, integrated, and leveraged. Amazon AWS was the first cloud-computing platform to support a wide range of Big Data solutions. The new online platforms built by DataBricks or Microsoft are good examples of this continuing trend.

As Big Data solutions mature, more and more aspects of our daily lives will be affected by decisions taken by fully-automated processes (prescriptive analytics). This raises obvious questions in terms of data privacy (who has access to the data?), transparency (what happens to the data?), but also in terms of the prejudices and social biases that are often implicitly ingrained in data-driven processes. As a society, we need to understand the social and decision-making implications of Big Data technologies, and take active steps towards making them more transparent and equitable.

About the author

Philippe Cudre-Mauroux is currently a Visiting Professor in Big Data at MIT and the CTO of Scigility. He is also a Swiss-NSF Professor and the director of the eXascale Infolab at the University of Fribourg, Switzerland. He received his Ph.D. from the Swiss Federal Institute of Technology EPFL, where he won both the Doctorate Award and the EPFL Press Mention in 2007. Before joining the University of Fribourg, he worked on distributed information and media management for HP, IBM Watson Research (NY), and Microsoft Research Asia. He is the President of the GITI and a member of the Forum des 100. He recently won a Verisign Internet Infrastructures Award as well as a Google Faculty Award. His research interests are in next-generation, Big Data management infrastructures for non-relational data.
Webpage: exascale.info/phil

Leave a comment