As part of my work in teaching Data Literacy skills, I help non-technical stakeholders understand key concepts related to data science. Their ability to reason about data-related issues is hampered by doubts in what key concepts actually mean. Example questions include “what is the definition of data science”, or “what is the difference between a data scientist and a data engineer”. In today’s blog I provide definitions of key concepts related to data science (i.e. data science definitions), to help non-technical stakeholders gain enough understanding for their purposes. Typically, such stakeholders would include managers of teams and departments that work with data teams, or rely on data products, in any area of an organization.
Defining Key Concepts
What is Analytics?
When people say “analytics”, they often refer to the systematic computational analysis of data or statistics. But analytics, as a domain of study, is much more than that. It includes many activities related to data acquisition, data management (e.g., data storage, data validation, data cleansing, data extraction) and data representation (e.g., data visualization). Yet the overall goal is data exploration in order to generate actionable insights. All the other activities serve this purpose. Often, analytics techniques are used in order to predict future behaviours, based on analysis and understanding of past events. For example, companies use analytics techniques to find prospects who are most likely to convert, i.e. to become clients (this is what personalized advertising is all about). Governments use analytics techniques to detect fraud, and so do banks.
What is the difference between Business Intelligence and Analytics?
Business Intelligence (BI) is defined as “the strategies and technologies used by enterprises for the data analysis and management of business information”. Often, BI analysts create dashboard showing e.g. the development in sales of one product or product group versus another, or how different market segment evolve. These dashboard often allow their users to select different filters to create different views of the data, e.g. per geography, per product etc.
The main difference between Business Intelligence and Analytics is that BI examines past and current events (e.g. sales per product group per region in the last 12 months), whereas analytics aims at predicting the future (e.g. how much will we sell product X in region Y in the next month or year?).
Big Data Definition
People often use the term “big data” when they want to express that they are dealing with a lot of data. But data quantity (volume) is just one of the dimensions of big data.
Although not a formal definition, my own explanation of Big Data is any analytics undertaking that exhibit complexities along one or more of four dimensions that have been coined by IBM as the ‘Four V’s of Big Data’:
- Volume refers to the scale of data;
- Velocity refers to the analysis of real-time streaming data, for example in a stock exchange or in vehicles equipped with sensors;
- Variety refers to the analysis of different forms of data – structured and unstructured data, text, audio, video, sensor measurements, social media and more;
- Veracity refers to the need to deal with the uncertainty of data – uncertain data quality, uncertain availability, completeness and correctness thereof.
When a data problem manifests complexities on one of more of these dimensions, we call it “big data”. One may say that if the data at hand manifests complexities on several of these dimensions, it is more complex than if the complexity is only along one dimension. As such, “Big Data” can be seen as a continuum, not as a “black and white” definition of whether something is big data or not. Similarly, “big” is a relative term, and so is “complexity”. You cannot compare data complexity in a big bank or at a tax agency with that of a small business.
And therefore, I tell my clients: if you have a data analytics problem that is more complex than what you’re used to, it is a Big Data problem for you (even if another organization would consider it differently). The essence is not whether your challenge meets some quantifiable criteria; it’s about how complex you experience it, in terms of volume, velocity, variety and veracity of data.
Artificial Intelligence Definition
Brought to its essence, Artificial Intelligence (AI) is intelligence demonstrated by machines (as opposed to intelligence demonstrated by humans). Often, AI is used in the context of software that is able to reason, discover meanings and learn, for which humans typically use their intelligence. Some examples include identifying a person in a crowd, learning from past mistakes and correcting them, or diagnosing a disease.
Machine Learning Definition
Closely related to AI, Machine Learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. The essence of ML is to train software to find the same answer, by itself.
Machine learning is not just about software (or: machines) learning how to solve a predefined problem, but also about finding insights when there is no clear definition of the problem. Read more about ML further down this article, where we discuss supervised and unsupervised learning.
Data Science Definition
Data science is defined as an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. The second part of this definition is very important: data science is a means, it’s not an end. Applying all these fancy techniques serves a purpose. For example, data science undertakings should help solve a problem, create actionable insights, facilitate decision making or help an organization grow.
What is the Difference Between a Data Engineer and a Data Scientist?
Typically, data science team consist of data engineers, data scientists, and data science managers who manage the data science teams.
In the spectrum between hard (technical) skills and soft skills (e.g. communication), whereas we acknowledge that both data engineers and data scientists require both, data engineers are the more technical employees. Data engineers are tasked with constructing databases, querying and retrieving data from databases, building machine learning algorithms and implementing them to run on large datasets. Data scientists, on the other hand, focus more on the analysis of data (yet they would also know how to retrieve data from the database), on performing experiments with data, visualizing it and communicating insights to the broader organization (the “consumer” of the data products), e.g. the marketing department, the finance department etc.
For a more elaborate discussion about the skills of a data scientist, see my earlier blog: What are the skills of a data expert in tomorrow’s job market? More than just number crunching.
Slightly More Technical Data Definitions
Structured vs. unstructured data
Typically, structured data are data that adhere to some predefined data model. For example, the data model will “tell you” that a car is an entity/class, and it has several attributes, such as color, weight, model name etc. You probably use structured data often. The typical example is tables in Excel files, where each row is a transaction, or an entity, or any form of record, and every column provides some attribute thereof. The key advantage of structured data is that it can easily be processed by software, and hence it supports fully-automated processes.
Unstructured data, on the other hand, do not adhere to some predefined data model. Typical examples are text (e.g. newspaper articles, emails, books), media files (e.g. images, videos, audio) and combinations thereof (e.g. PowerPoint presentations). A key disadvantage of unstructured data is the difficulty to extract insights in an automated way. Therefore, often the potential of unstructured data in organizations is ignored, or forgotten (if you interpret these statements as “hence there is an opportunity here”, you’re right!). Bear in mind that the vast majority of information worldwide in unstructured. Making an effort to unlock the potential of unstructured data in your organization can therefore prove highly rewarding for your organization.
What is Natural Language Processing?
Natural Language is ordinary language. It is what people speak, write and read. Natural language is a type of unstructured data. Natural Language Processing (NLP) is the field of using software to process and analyze large amounts of natural language data. Artificial Intelligence techniques are used in NLP, for example to understand whether in the context of a certain sentence, the word “right” is a direction, a statement of something being correct, a sarcastic comment or just a word people say when they try to “buy time to think”. In recent years, massive advancements have been made in NLP, as can be demonstrated by the great results of the Google Translate service. Jargon and legal texts introduce an extra level of complexity in NLP. Language is another issue to consider. Tools for processing the English language may be more advanced than tools for processing other languages. And if text includes multiple languages (especially if the languages are not known a-priori), NLP becomes more complex too. Finally, it’s worth mentioning that processing audio is more complex than processing written documents, because pronunciation and dialects hamper the software’s ability to understand the text.
What is Data Quality?
Data is generally considered high quality if it is “fit for its intended uses in operations, decision making and planning”. Read about myths, challenges, problems and solutions concerning data quality in my earlier blog “How to turn data into business: data quality vs. Data quantity”.
What is Correlation?
Correlation is a statistical relationship, whether causal or not, between two variables in data. It indicates the degree to which two variables are linearly related. For example, a correlation can exist between people’s height and their weight: the taller a person is, the higher their weight may be expected to be. The words “may be expected” show that correlations are often used when you want to predict Y, based on knowing X and the relationship between X and Y. In mathematical terms, a correlation can be described as a linear function.
Correlation should not be confused with Causation. Causation refers to the fact that X causes Y. In correlation, the there need not be a causal relationship between X and Y. It suffices that statistically speaking, there is a relationship between the two, e.g. when X happens, Y happens. Even if X does not cause Y.
What is the Difference Between Recall and Precision?
Often enough, analytics models predict whether an individual situation is likely to be “good” or “bad”. For example, a bank would have algorithms to define whether a new customer is likely to be fraudulent (i.e. bad) or not (i.e. good). Companies of all sorts would have models to determine whether someone is likely to become a client (i.e. good) or not (i.e. bad). These classifications will trigger different actions. For example, if a prospect is classified as “not likely to convert”, the sales team may decide not to engage with them; and if a prospect is classified as “likely to convert”, a dedicated account manager would invest time and effort in interacting with this prospect. These decisions therefore have important consequences. Overall, in each such analytic decision, the universe comprises of four categories:
- True positives: the analytic model rightfully classified these as positives. In our example, the model classified the prospect as likely to convert, and the prospect converted indeed.
- False negatives: the analytic model wrongly classified these as negatives. In our example, the model classified the prospect as not likely to convert, although in reality this prospect was eager to buy the service.
- True negatives: the analytic model correctly classified these as negatives. In our example, the model classified the prospect as not likely to convert, and indeed, the prospect had no intention or means to buy the service.
- False positives: the analytic model wrongly classified these as positives. In our example, the model classified the prospect as likely to convert, yet the prospect had no intention or means to buy the service.
The goal of the analytic model is to be correct in identifying true positives and true negatives, and to have as few as possible false positives and false negatives. False positives cost the organization resources (in our example: sales personnel spend time and other resources in vain, as the prospect will never convert), and false negatives are missed opportunities (in our example: sales personnel will not contact the prospect, and therefore a sales opportunity is missed).
Precision is the fraction of relevant instances among the retrieved instances. Using the a.m. terminology, precision is defined as: number of true positive cases, divided by the sum of true positive cases and false positive cases. The less false positives you have, the more precise your model is. Using our earlier example, precision refers to the number of prospects that were correctly classified as likely to convert, divided by the total number of prospects that were classified as likely to convert.
Recall is the fraction of relevant instances that were retrieved. Using the a.m. terminology, precision is defined as: number of true positive cases, divided by the sum of true positive cases and false negatives cases. Using our earlier example, recall refers to the number of prospects that were correctly classified as likely to convert, divided by all the prospects that (in reality) were likely to convert.
What is the Difference Supervised Learning and Unsupervised Learning?
We described earlier that Machine Learning is about letting software learn. We distinguish between two types of learning: supervised learning and unsupervised learning.
Unsupervised learning is a type of algorithm that learns patterns from untagged data, i.e. . trying to uncover unobserved factors in the data. It is called “unsupervised” as there is no known correct outcome to judge against. There is no training data which has already been tagged with the correct answer.
Supervised learning is an algorithm that maps an input to an output (i.e. it predicts the output) based on example input-output pairs. Supervised learning algorithms analyze training data, which have been tagged with the correct answers, to produce an algorithm that will be able to map new cases. For example, to teach software how to recognize frogs, you feed the software with many photos where all the frogs have been tagged (annotated). Based on its observations about all the photos where frogs are depicted, as well as about the photos where frogs are not depicted, the software will define an algorithm to detect whether a frog is or is not depicted on images. You can subsequently use this algorithm, to detect frogs in any image. The activity of tagging is referred to as “data annotation“.
Typically, you would apply supervised learning in cases where you wish to predict something, and where the computational performance of the model is of importance.
Given that data skills were not provided in the curricula of the education programs of the majority of today’s working population, it should come as no surprise that the majority of today’s working population lacks sufficient data literacy skills to succeed in the data economy. Nowadays, the ability to work with data is important in almost any profession: in finance, in marketing, in sales, in logistics, in manufacturing etc. Managers across the board require data literacy skills. Often enough, executives lack sufficient understanding of data and of working with data products. The current blog is yet another one in a series of blog articles aimed at improving the data literacy skills of non-technical staff.
If you wish to provide data literacy training to your staff, contact me to enquire about a data literacy course, dedicated to your team. Courses are tailored such that they comprise of two parts: the generic skills, and how these apply in the context of the organization at hand. For details, contact me through my LinkedIn profile or by sending an email to firstname.lastname@example.org, whereas [x=ziv], [y=baida] and [z=nl].
- Big data for customs at the borders: start small, think big!
- How to turn data into business: data quality vs. Data quantity
- What are the skills of a data expert in tomorrow’s job market? More than just number crunching