German Bundesliga helps to learn about Corona Statistics: We have 15-55 times more cases than in the official statistics
Thank you, Bundesliga, the German Soccer League! As supporters of applied statistics, how long have we been waiting for a representative sample to estimate the real infection status? To restart the show in the soccer business – with ghost matches in empty stadiums – comprehensive testing on Covid-19 took place. Great for soccer supporters and even greater for us data enthusiasts and friends of fact-based decisions.
At oConsulting we took the information from the Bundesliga tests and ran some statistics with enlightening results. We used proven methods from Six Sigma – typically used in regulated production environments and quality management – to estimate the number of really infected persons. This number is way higher than expected.
The data situation:
Official data always present reported cases. A reported case is a person positively tested for Covid-19. In Germany this is usually done with a test using the method of Polymerase Chain Reaction (PCR). A PCR test basically diagnoses if you currently have the virus present in your body.
Important to know: In Germany no representative study was conducted yet on Covid-19. Only persons showing strong symptoms and having been in high risk areas or in close contact with confirmed cases are tested. With the exception of special persons of interest – of which quite a few were Corona positive. Just like in the case of the Bundesliga: With this test method currently around 170,000 people in Germany have been diagnosed to be infected with Covid-19. It is estimated that 144,000 of them have recovered. Factoring in the more than 7,000 fatalities we would estimate ~17,000 currently infected or 10% of those 170,000. This 10% will be used as important number in the following.
To restart the league later in the month 1,724 sportsmen and their teams have been tested and 12 of them were found Covid-19 positive (by the way: without showing strong symptoms).
Wonderful! For the first time results are published that are somewhat representative. We can use statistics on these numbers to draw conclusions to the wider population because the test persons come from all parts of Germany and from different age groups. However:
- The test group is mainly male. This is not representative for the population but since Covid-19 cases are roughly equally distributed between female and male this is not critical.
- The test group has experienced the same contact restrictions as the rest of the German population. The test group may not have been equally exposed to crowded supermarkets and public transportation. Thus the actual population likely has an even higher percentage of undetected infections.
We work here on the basis of 1724 persons with “yes/no” results. Of course this is not a whole lot of data to make precise estimations for a population of >80 million. We will see later how certain we can be.
Let’s start with the statistics
In a Six Sigma process improvement training we would run this exercise: “You have tested 1,724 persons. 10 of them were tested positive. That makes a proportion of 0.58%. Assuming it was a representative sample taken and you can infer from this to the entire population: What is the true infection rate that you can expect?” Normally this type of approach is used in process validation or when working on process or quality deviations.
The correct answer: “The true infection rate is between 0.36% and 1.21% (with 95% confidence).”
Now, let’s translate this into some astonishing findings:
We hear every day about the number of infected persons. Status for Germany on May 10, 2020: 170,000 people. The unexpressed underlying message we receive with it: Only a very small proportion of all inhabitants has had the virus.
In reality a lot more persons have been infected to date:
- The real number of people in Germany currently infected is between 300,000 and 1 Million – not at 17,000 as you can calculate from the official numbers
- Even more interesting: The total number of people once infected is at least 2.9 million (in the illustration the sum of the green and the red part). It may be up to 9.8 million (this is between 4% and 12% of the German population).
- So, the published numbers are at least underestimating reality by a factor of 17 (!). And it could be 58 (!) times more.
Why the big range from 2.9 million to 9.8? This is typical for statistical data: The uncertainty is due to the relatively small sample size of 1,724 Soccer League team members who have been tested. (In statistics you can say, the higher your sample size the more precise your conclusion.)
What can we learn from it?
The common understanding is that it won’t be possible to stop the pandemic but only to slow it down. The good news from this analysis is that way more people are already through with it (17-58 times more than reported).
Since the actual numbers are so inadequately representing the real situation it is strongly recommended NOT to make decisions based on these. E.g.: If there are more than 50 new cases within 100,000 inhabitants to intensify restrictions may be a completely random and over-anxious reaction. As we have seen this number is wrong by a factor of at least 10.
The bad news: It will take a while and will hit the economy seriously. If the desired rate of ~900 new cases per day (with the existing measurement quality) continues corresponding to at least 16,000 actual new cases we will reach a maximum of 20% infection by September 2020. So, restrictions will continue.
As consultants specialized on effectiveness and efficiency improvements we recommend:
With a planned representative study the data situation should be improved. A study with a sound approach for representativeness and with PCR tests as well as antibody tests which can diagnose who has been infected in the past and has produced antibodies in reaction. This could be used to validate the ratio of recovered vs. active cases.
 We use Germany as a case in point here because of data availability. Surely there is comparable information out there for other countries/regions.
 Data used in this blog come from Robert Koch Institute as of May 10, 2020 and published results of Bundesliga and their clubs until 10.05.2020. For ease of reading we have rounded a lot of the numbers quoted here.