Data bias is often understood as a situation in which the available data is not representative of the population or phenomenon of study. In my earlier blog about the skills of a data expert, I discussed also bias in data interpretation. And finally, bias can also be created or caused by how the results of data analysis are being communicated. In today’s blog, I refer to bias in the broader sense, i.e. covering all of the above. My aim is to help readers become aware of bias in data that they consume. I guess the ultimate aim is removing bias from data altogether.
To demonstrate my ideas, I’ll use the example of health and death statistics of the Coronavirus pandemic. I do so with mixed feelings because while data analysis of sickness and death cases is of utmost importance for developing strategies to cope with the pandemic, such analysis fails to acknowledge that every single death behind the numbers is a personal tragedy for many people.
Coronavirus Statistics
Since the outbreak of the coronavirus, governments worldwide have been publishing daily statistics. Initial statistics counted the “number of infected people”. However, because there are unknown unknowns (the government only knows about infected people that have been tested by the healthcare system), the statistics were later redefined as the “number of positively tested people”. A second part of the statistics is the number of death cases. In the current article, I will refer to statistics as they have been published on https://www.worldometers.info/coronavirus/, a website that collects the statistics from countries worldwide. I’ll refer to the statistics per April 12, 2020, 14:08 GMT, and I’ll use the example of The Netherlands to demonstrate bias in data interpretation and communication.
Adding Data to Data: Putting Things In Perspective
According to the statistics, in the Netherlands, there are 25,587 cases of positively tested people. This puts The Netherlands in place #11 worldwide. By itself, that is a high place for any country. The gap between #11 and #9 is very big (Turkey occupies place #9 with 52,167 cases), whereas #12 (Switzerland; 25,300 cases), #13 (Canada; 23,318 cases) and #14 (Brazil; 21,040) are not very far from The Netherlands. Does that mean that the situation in The Netherlands is not good, but still reasonable?
If you look only at these numbers, you may think so. If you do, you’ve become subject to bias that is caused by having insufficient context to interpret the data. You accept the data as objective (facts), however, you lack information that is required for interpreting the data, and hence you are making decisions based on incomplete data. You form an opinion, and you think it is based on facts, and hence you are confident in your opinion, yet it is based on incomplete, somewhat misleading facts.
Let us now consider how adding data parameters to the analysis can remove this bias.
Adding Data: Population Size
The first extra parameter to add to the analysis is the size of the population. Having the same number of infected people in a small country and in a big country would have different implications, given the size of the population in the two countries.
To make country statistics more comparable, let us reorder the list based on the total cases of positively tested people per 1 million inhabitants. The Netherlands now ranks #23 worldwide. This list, however, has another flaw. Most of the countries in the top-20 positions (i.e. 20 countries with the most cases per 1 million inhabitants) are very small countries, raising the question of how comparable their situation is with the situation of large countries. Many of these countries have less than 100,000 inhabitants. Luxembourg is the biggest of these countries, with a population of approximately 626,000. These countries include San Marino (#1), Vatican City (#2), Andorra (#3), Luxembourg (#4), Iceland (#5), Gibraltar (#6), Faroe Islands (#7), Isle of Man (#10), Monaco (#13), Channel Islands (#14), Liechtenstein (#15) and Montserrat (18).
In these small countries, even a small number of cases is considered much. But in global statistics, these cases are not representative, i.e. they are not comparable to larger countries. Shall these countries be omitted from the statistics, so that you’re able to compare The Netherlands to its peers? But what is the peer group? Should also countries that are much bigger be removed from the statistics? For example, can the Netherlands (roughly 17 million inhabitants) be compared with China (almost 1,4 billion inhabitants), or the US (roughly 328 million inhabitants)? There is no straightforward answer. Your bias will determine your choice whether to include these small countries in the analysis, or not.
For the sake of continuing our discussion, let’s assume we leave all countries in the statistics. Thus the Netherlands ranks #11 worldwide in the number of positively tested people, yet only #23 worldwide in the number of positively tested people per 1 million inhabitants. That sounds more positive!
Adding Data: Total Death Cases
Let’s look at another parameter: the total number of death cases per country. With its 2,737 deaths, The Netherlands ranks #10 worldwide, higher than any of the previous statistics. Compare the Netherlands to its neighbor Germany (#9 in total deaths). While the total number of deaths in the two countries is very similar, Germany has roughly 5 times more cases of positively tested persons.
Country | Total cases | Total deaths |
Germany | 125,452 | 2,871 |
The Netherlands | 25,587 | 2,737 |
The German population is also roughly 5 times the Dutch population. What does this mean for The Netherlands? Let us look at another data parameter: the number of death cases per 1 million inhabitants. While Germany is ranked #22 worldwide, The Netherlands occupies place #8 in the worldwide ranking of death cases compared to the size of the population. And if you remove from this list small countries with less than 50 deaths in total, the Netherlands is ranked #5 worldwide in its death rate, caused by the Coronavirus.
Bias In Communicating Data Insights
Each of the previous statistics about The Netherlands is true, and they are all based on facts; on objective data:
- The Netherlands ranks #11 worldwide in the number of people who are known to have been infected with the Coronavirus
- The Netherlands ranks #23 worldwide in the number of people who are known to have been infected with the Coronavirus, per 1 million inhabitants
- The Netherlands ranks #10 worldwide in the number of people who are known to have died of the Coronavirus
- The Netherlands ranks #8 worldwide in the number of people who are known to have died of the Coronavirus per 1 million inhabitants
- The Netherlands ranks #5 worldwide in the number of people who are known to have died of the Coronavirus per 1 million inhabitants, when not considering countries with less than 50 death cases in total
But now imagine the impact of each of the above statements, if made by the Prime Minister of the Netherlands in a press conference, addressing the Dutch population. Especially the difference between second statement (#23 worldwide) and the fifth statement (#5 worldwide) is substantial. And here we get into what I refer to as bias in communication.
When communicating insights from data analysis, the communicator may make a choice for a specific message, based on the desired impact. If the Prime Minister wants to convey the message that the situation is severe (for example, to justify fierce actions that impose substantial restrictions on the individual), he may select the 5th message. But if he wants to convey the message that things are under control (for example, to avoid panic, or to appear as being in-control), he may choose message number 2. By making any choice to present certain data (and thus not to present other data!), the communicator may be hiding information from the audience (namely the impact of data parameters that have been left out of the analysis).
The communicator has his reasons for his choice, yet he does not share these reasons with his audience, or even the fact that an explicit choice has been made, and that more facts exist that may lead to different conclusions. In data terms, we would say that the communicator presents insights from data analysis without having documented the underlying assumptions or limitations of the underlying data and of the analysis. And thus the communicator imposes his bias (his choice) on its audience.
Note: I’m using the Netherlands as an example of communicating insights from data analysis, for the sake of discussion; I am not implying any deliberate choices to emphasize or hide certain facts from the public in The Netherlands.
Can Laymen Be Responsible For Detecting Bias?
Thus the choice of the desired message (by the person communicating the results of data analysis) creates a certain (often anticipated, or even desired) response among the audience, and thus it influences the audience in favor of the communicator’s bias.
Whenever we consume data (through reading, listening, watching,…) each one of us, as audience, has the responsibility to ask in-depth questions in order to verify the validity of the data analysis and detect biases. But let’s be realistic:
- Experience shows that many people do not ask such questions: the trend in society (fueled by social media) is to make short statements without verifying facts, rather than ask in-depth questions and verify information; this is how fake news propagate
- Especially when the communicator is considered to be “an authority” (e.g. an academic in the area of discussion, or a Government official responsible for the domain at hand), people trust the communicator and are less likely to ask questions that may reveal a bias.
- Many people are not trained (educated) in how to ask critical questions, especially in areas in which they are not experts (the current article aims to help in the information education of these people)
- Not everybody has the required intellectual capacity to ask in-depth questions and detect biases, especially in areas that they know little of.
Dealing With Data Requires Integrity
Therefore, I posit that it is not possible to assume that the audience has the responsibility and the ability to detect bias in the message that you convey to them. Of course, ideally, they should be able to do so. But assuming that they do would be an invalid assumption, in many cases.
In my earlier blog, I wrote “data experts must have strong ethics and challenge themselves and their peers about potential biases in their data and in their analysis. The data expert is an investigator in pursuit of ‘the truth’ while avoiding bias”. The case discussed here refers not only to the bias in data analysis (each of the above statistics presents one valid perspective on the coronavirus pandemic). I’ve argued that different choices in communicating the insights of Coronavirus statistics (does the Netherlands rank #5 or #23?) lead to different responses among the audience. It becomes important that the communicator does not abuse their power by making certain choices that can be considered more misleading than a reflection of reality.
In the case of Governments, most citizens tend to trust their Government and its communication (although my feeling – not based on sound research – is that the level of this trust has been decreasing in recent years). More broadly speaking, data is a tool that can be used positively or negatively (who has the mandate to decide what is right and what is wrong?). Integrity in dealing with data, starting from data collection and all the way up to communicating the results of data analysis, will become more and more critical in the data economy. Because if we cannot trust data, we cannot use it. Basing your actions on misleading data can be worse than making random decisions when having no data.
Yet Another Bias
Do you think that I have now provided you with all the different parameters, and that you now have all the required information to form an own opinion on how good or bad things actually are in the Netherlands? You’re wrong.
Who Is Included In The Statistics?
How complete is the data, based on which you derive insights?
Consider that the number of deaths in the statistics only includes… those people who have been included in the statistics, obviously. But many more people could have died of the coronavirus, without having been included in the statistics. These people were left out of the statistics. And indeed, a report by the Dutch Central Bureau of Statistics shows that in the week between 30/03/2020 and 05/04/2020, approximately 2000 more people died in the Netherlands than on average. Does the Coronavirus explain this statistical deviation? The Coronavirus statistics from the Netherlands show that in this week, somewhat more than 1000 people died of the coronavirus. What about all the others? What did they die of? Maybe indeed their death was caused by other reasons (e.g. cancer, car accidents). Maybe they could be considered a “statistical deviation”. But maybe some or many of them died due to the coronavirus, yet they haven’t been included in the statistics, for example, because they were not tested.
Number of Tests
The statistics on the number of known cases of coronavirus depend greatly on how many tests a country performs. If you perform very few tests, you may not be aware of how bad the situation is. A layman may assume that countries that have been impacted strongly by the Coronavirus perform more tests because the situation demands such action. So let’s look at the statistics again.
While the Netherlands ranks #23 worldwide in the number of people who are known to have been infected with the Coronavirus, per 1 million inhabitants; and while it ranks #8 worldwide in the number of people who are known to have died of the Coronavirus per 1 million inhabitants, the Netherlands ranks #47 in the number of tests that it performs per 1 million inhabitants. Indeed, the Dutch government’s policy is to test primarily high-risk populations and employees of the healthcare system. Most people with Corona symptoms are not being tested in the Netherlands. This explains the relatively high death rate of the Dutch statistics (more than 10% of the known cases end up in death).
The Swiss Government has a different approach. Some may say that the countries may be considered similar, because the number of known cases in the two countries is almost equal (in the Netherlands there are 25,587 known cases, and in Switzerland there are 25,300 known cases), yet the Dutch population is roughly twice the Swiss population.
Switzerland ranks #15 worldwide in the number of tests per 1 million inhabitants (22,393 tests), while the Netherlands ranks #47 with only 5,926 tests per 1 million inhabitants. One may now ask: how many known cases of Coronavirus would the Netherlands have, and how many known deaths would the Netherlands have, if it performed as many tests (per 1 million inhabitants) as Switzerland? Given the higher rate of tests in Switzerland, Swiss statistics are likely to be more reliable than Dutch statistics. The less you test, the more you operate in the dark, and thus the less reliable your statistics are. And yet Swiss statistics and Dutch statistics are combined into global statistics as if they were comparable.
Concluding Remarks
The aim of this article is not to point out what is the situation in the Netherlands. Nor is it to present a preference for some statistics rather than others. I aim to provide neither support for nor criticism on the choices that any Government makes in communicating statistics about the Coronavirus.
My aim is to demonstrate to readers how working with incomplete data entails that you are subject to bias. Once you realize this, you are more likely to ask in-depth questions whenever you hear something. Do not blindly rely on information that is presented to you. Verify its quality, ask tough questions, and form your own opinion once you know all the facts. Or at least, all the facts that you deem relevant and you’re able to obtain. And if you are unable to obtain some data, it’s still OK to form an opinion, as long as you’re aware that the opinion is based on certain assumptions that impose limitations on the validity of your analysis.
The discussion in this article is valid to any type of data; Coronavirus statistics are used here only as an example.
Suggested reading:
- Skills of the data expert in tomorrow’s job market: more than just number crunching
- Turning data into business: Data quality vs. Data quantity
Go back to the blog start page.
Sign up to receive blog updates via email.