The phrase “Big Data” has been around for a while and we are at the stage where it has more impact every day, and it’s a trend that is showing no sign of slowing down. With that in mind, I am putting together a series of posts for those who might not be too familiar with the subject. As a companion to my guide, I’ve written a post explaining the meaning of some of the jargon and buzzwords that have built up around this topic. So here goes. These definitions are for anyone who wants to know mo
Data-as-a-service, software-as-a-service, platform-as-a-service, these all refer to the idea that rather than selling data, licences to use data, or platforms for running Big Data technology, it can be provided “as-a-service, “rather than as a distinct product. This reduces the upfront capital investment necessary for customers to begin putting their data, or platforms, to work for them, as the provider bears all the costs of setting up and hosting the infrastructure. As a customer, as-a-service infrastructure can greatly reduce the initial costs and setup time for getting Big Data initiatives up and running.
Data science is the professional field that deals with turning data into value, such as new in-sights or predictive models. It brings together expertise from fields including statistics, mathematics, computer science, communication as well as domain expertise such as business knowledge. The role of data scientist has recently been voted the number 1 job in the U.S., based on current demand and salary and career opportunities.
Data mining is the process of discovering in-sights from data. In terms of Big Data, because it is so large, this is generally done by computational methods in an automated way using methods such as decision trees, clustering analysis and, most recently, machine learning. Think of this as using the brute mathematical power of computers to spot patterns in data that would not be visible due to the complexity of the dataset.
Hadoop is a framework for Big Data computing that has been released into the public domain as open-source software, so it can be freely used by anyone. It consists of several modules, all tailored for a different vital step of the Big Data process, from file storage (Hadoop File System, HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce, see below). Due to its power and flexibility, it has become so popular that it has developed its own industry of retailers (selling tailored versions), support service providers and consultants.
Simply, this is predicting what will happen next based on data about what has happened previously. In the age of Big Data, because there is more data around than ever before, predictions are becoming more and more accurate. Predictive modelling is a core component of most Big Data initiatives, which are formulated to help us choose the course of action that will lead to the most desirable outcome. The speed of modern computers and the volume of available data means that predictions can be made based on a huge number of variables, allowing an ever-increasing number of variables to be assessed and leading to more successful results.
MapReduce is a computing procedure for working with large datasets. It was created in response to the difficulty of reading and analyzing really Big Data using conventional computing methodologies. As its name suggests, it consists of two procedures: mapping (sorting information into the format needed for analysis—for example, sorting a list of people according to their ages) and reducing (performing an operation, such as checking the age of everyone in the dataset to see who is over 21).
NoSQL refers to a database format that is de-signed to hold more than data that is simply arranged into tables, rows and columns, as is the case in a conventional relational database. This database format has proven very popular in Big Data applications because Big Data is of-ten messy, unstructured and does not easily fit into traditional database frameworks.
Python is a programming language that has become very popular in the Big Data space due to its ability to work well with large, unstructured datasets. Python is easier for a data-science beginner to learn than other languages such as R and more flexible.
Structured data is data that can be arranged neatly into charts and tables consisting of rows, columns or multi-dimensioned matrixes. This is traditionally the way that computers have stored data, and information in this format can be easily and simply processed and mined for insights. Data gathered from machines is often a good example of structured data, where various data points—speed, temperature, rate of failure, RPM, etc.—can be neatly recorded and tabulated for analysis.
Unstructured data is any data that cannot be easily put into conventional charts or tables. This can include video data, pictures, recorded sounds, text written in human languages and a great deal more. This data has traditionally been far harder to draw insight from using computers, which were generally designed to read and analyze structured information. However, since it has become apparent that a huge amount of value can be locked away in this unstructured data, great efforts have been made to create applications that are capable of understanding unstructured data— for example, visual recognition and natural language processing.
R is another programming language commonly used in Big Data, and can be thought of as more specialized than Python, being geared towards statistics. Its strength lies in its powerful handling of structured data. Like Python, it has an active community of users who are constantly expanding and adding to its capabilities by creating new libraries and extensions.
A recommendation engine is basically an algorithm, or collection of algorithms, designed to match an entity (for example, a customer) with something they are looking for. Recommendation engines used by the Like functionalities of Netflix or Amazon heavily rely on Big Data technology to gain an over-view of their customers and, using predictive modelling, match them with products to buy or content to consume. The economic incentives offered by recommendation engines have been a driving force behind many commercial Big Data initiatives and developments over the last decade.
Real time means “as it happens” and, in Big Data, specifically refers to a system or process that gives data-driven insights based on what is happening now. Recently, there has been a big push for the development of systems that are capable of processing and offering insights in real time (or near-real time), and advances in computing power, as well as development of techniques such as machine learning, have made it a reality in many applications.
The crucial last step of many Big Data initiatives involves getting the right information to the people who need it to make decisions, at the right time. When this step is automated, analytics is applied to the insights themselves to ensure that they are communicated in a way that they will be understood and easy to act on. This usually involves creating multiple reports based on the same data or insights, but each report is intended for a different audience (for example, an in-depth technical analysis report for engineers and an overview of the impact on the bottom line for C-level executives).
Spark is another open-source framework like Hadoop, but more recently developed and more suited to handling cutting-edge Big Data tasks involving real time analytics and machine learning. Unlike Hadoop, Spark does not include its own file system, though it is designed to work with Hadoop’s HDFS or a number of other options. However, for certain data-related processes Spark is able to calculate at over 100 times the speed of Hadoop, thanks to its in-memory processing capability. This means it is becoming an increasingly popular choice for projects involving deep learning, neural networks and other compute-intensive tasks.
Humans find it very hard to understand and draw insights from large amounts of text or numerical data. It can be done, but it takes time, and our concentration and attention is limited. For this reason, an effort is underway to develop computer applications that are capable of rendering information in a visual form. For example, charts and graphics that highlight the most important insights that are the result of our Big Data projects. A subfield of reporting (see above), visualizing is now often an automated process, with visualizations that are customized by algorithm to be understandable to the people who need to act or take decisions based on them.
is an internationally best-selling business author, keynote speaker and strategic advisor to companies and governments. He is one of the world’s most highly respected voices anywhere when it comes to data in business and has been recognized by LinkedIn as one of the world’s top 5 business influencers. In addition, he is a member of the Data Informed Board of Advisers (http://data-informed.com/data-informed-board-advisers). You can join Bernard’s network simply – follow him on Twitter@bernardmarr.