Log In Sign Up

Statistics cannot prove that the Huanan Seafood Wholesale Market was the early epicenter of the COVID-19 pandemic

We criticize a statistical proof of the hypothesis that the Huanan seafood wholesale market was the epicenter of the COVID-19 pandemic. There are three points in the proof we consider critically: (1) The Huanan seafood wholesale market is not a data-driven location. (2) The assumption that a centroid of early case locations or another simply constructed point is the origin of an epidemic is not proved. (3) A Monte Carlo test used to prove that no other location than the seafood market can be the origin is not correct. Hence, the question of the origin of the pandemic is still open.


BETS: The dangers of selection bias in early analyses of the coronavirus disease (COVID-19) pandemic

The coronavirus disease 2019 (COVID-19) has quickly grown from a regiona...

Understanding the Impact of the COVID-19 Pandemic on Transportation-related Behaviors with Human Mobility Data

The constrained outbreak of COVID-19 in Mainland China has recently been...

Early Outbreak Detection for Proactive Crisis Management Using Twitter Data: COVID-19 a Case Study in the US

During a disease outbreak, timely non-medical interventions are critical...

Dynamics, behaviours, and anomaly persistence in cryptocurrencies and equities surrounding COVID-19

This paper uses new and recently introduced methodologies to study the s...

COVID-19 Forecasts via Stock Market Indicators

Reliable short term forecasting can provide potentially lifesaving insig...

1 Introduction

On 31 December 2019, the Chinese government notified the World Health Organization (WHO) of an outbreak of severe pneumonia of unknown etiology in Wuhan, Hubei province. This may be considered the beginning of the COVID-19 pandemic. Soon the question of its origin was asked and since then discussed controversially. Currently, in 2022, there are two main hypotheses: (a) there is a zoonotic origin, the virus came from animals sold on the Huanan Wholesale Seafood Market in Wuhan (the zoonosis hypothesis); and (b) there was an accident in the Wuhan Institute of Virology, somehow the virus fled from human supervision (the ‘lab-leak’ hypothesis). Each hypothesis is, however, not necessarily the alternative of the other.

On 26 July 2022 Science published the paper Worobey et al. (2022), which says clearly in its title “The Huanan Seafood wholesale Market in Wuhan was the early epicenter of the COVID-19 pandemic”. This paper, to which we refer in the following as W, has attracted world-wide attention and media coverage; its preprint version released in 2020 already had more than a hundred thousand downloads before the publication of the peer-reviewed final version. The paper uses for the proof of its title’s statement two different arguments: a statistical one and a zoonotic one based on a coincidence, using the fact that in the seafood market animals (mammalia) are sold.

The statistics in W mainly use center-points, which are defined by the coordinate-wise median latitudes and longitudes. W’s using the center-point to identify the “center” of a point cloud is analogous to using the median to measure the central tendency of a set of numerical data, obviously under the (unestablished) assumption that the “center” of a cloud of locations of cases starting from an origin of the infection process is close to this origin. However, using coordinate-wise median to define the “center” of a point cloud is a questionable choice, because the coordinate-wise median is not rotationally invariant (Walker, 2022a). Using the centroid, which is the coordinate-wise mean, may be geometrically more robust. Moreover, the paper considers as possible origin only one location, namely, the seafood market. They carried out a Monte Carlo test, found that the median distance between the seafood market and the confirmed cases is significantly shorter than the median distance between the seafood market and independent points following the Wuhan population density, and concluded that the seafood market was the origin of the infection process. This choice of the origin is based on non-statistical argument and is by no means data-driven. This has led us, as statisticians, to a critical view of the paper. We will discuss with technical details, considering statisticians as target readers, only the above mentioned statistical aspects of W and will not judge their zoonotic argument.

We come to the conclusion that their statistics arguments are not convincing, and hence we cannot accept an important part of the proof in W. Moreover, even if the seafood market could be established as the “center”, causal inference would still be unjustified (see e.g. Dablander and Hinne, 2019); centrality does not imply causality (Walker, 2022b). This does, of course, not mean that we reject the zoonosis hypothesis; we just consider the question of which of the two hypotheses is true to be still open.

2 Materials and methods

2.1 General

The starting point of the statistics in W as well of the present paper is the point pattern of address locations of the people infected in December 2019 (or “cases” for short). W recovered 155 cases from the 164 cases shown in Annex E2 Figure 4 of WHO Report (World Health Organization, 2021); the precise latitude and longitude coordinates of these data are not available and W claimed that the extraction method introduced no more than 50m of noise in each case. There is a cluster of seven cases with the same address; these duplicate locations are considered different in W. It is unfortunate that these data are of poor quality: imprecise locations, no clear onset date per case, etc. Figure 1 in Holmes et al. (2021) gives a vague impression what could be possible if temporal data would also be available. However, we will still use W’s data here, without any further critique, so that we will have the same point of departure. Nevertheless, instead of using latitude and longitude coordinates and the Haversine distance, we project these locations to UTM coordinates, as shown in Figure 1, and use the Euclidean distance. The data can be found in supplementary materials of this paper.

(a) (b)
Figure 1: (a) The 155 address locations of the people infected in December 2019, and some landmarks that are possible “centers” of the point cloud formed by the address locations. (b) The region of detail marked in (a).

W considers the Huanan Seafood Wholesale Market (hereinafter the “Market”) the “epicenter” of the pandemic, but near the Market there are various landmarks, which we will consider as possible “centers” of the point cloud of December cases and mark them in Figure 1. These include the Wuhan Center for Disease Control and Prevention (CDC), the Hankou Railway Station, and the Wanda Plaza, which is a shopping area with hotels and restaurants, near the Lingjiao Lake and its Park; in addition, one of the hotels listed in W is also marked. Figure 2 is a map in UTM coordinate system that shows the relative sizes of and distances between the Market, the Wuhan CDC, and the Hankou Railway Station. We follow W and use points to represent the Market, as well as the landmarks. Because of the presence of noise in W’s location data of cases, the physical sizes of these landmarks could also be considered as noise in locations.

Note that we do not hypothesise that any of these landmarks is the origin of the pandemic, but only mark them in the plots as possible “centers” of the point cloud of the 155 cases in our analysis to show that in context of statistics, the Market is not more likely to be the origin than the others are; none of them is identified by any data-driven approach. W excluded all landmarks because they claimed that no other location except the Huanan market clearly epidemiologically linked to early COVID-19 cases. In other words, according to W’s approach, if epidemiologists can find epidemiological links between the cases and any of these landmarks, then these landmarks will be equally likely to be the origin of the pandemic. We should also note that the Market is not the only place where one can have contacts with live farmed or wild animals.

Figure 2: The Huanan Seafood Wholesale Market, the Wuhan Center for Disease Control and Prevention, and the Hankou Railway Station in UTM coordinate system.

2.2 Centroids and center-points

The paper W

assumes that an estimate of the “center” of a cloud of cases and the origin from which the infection process started are close, or in the words of its authors:

insofar as the center-point of early cases might reflect the starting point of the epidemic. This may imply a natural, but perhaps overly simplified, model of radial diffusion in an isotropic medium. We consider the problem of modeling the spatial propagation of an epidemic in a city an difficult and interesting problem. Nevertheless, in the present paper we follow the authors of W and consider estimates of the “center” (which can be center-points or centroids) of clouds of cases as points close to the origin.

Since the locations of the cases are not exactly known and since the propagation of the infection is surely not deterministic, we state that a simple determination of the centroid or the center-point of the cloud of the 155 cases is not sufficient to determine the origin. When we assumed that all these 155 cases came from one origin, then any subsets of these 155 cases would come from the same origin. We further assumed that the origin should be close to the “center” of the point cloud of cases. This opens up the possibility of using resampling to study the variation in the estimates of the location of the actual “center” of the underlying point cloud model from which the sample of these 155 cases came.

We take resamples, each of size , sampled with or without replacement, from the original 155 cases and determine the centroid and the center-point of each resample. The results for and , 150, 100, 80 and 50 are shown in Figures 3 and 4. It is evident that the smaller the size of the resamples, the bigger the clouds of centroids and center-points. The clouds of centroids can be understood as the distribution of the estimator of this “center”, but because the center-points are not rotationally invariant, the clouds of center-points could only be used to provide an idea of the size of variation in the estimator.

(a) (b) (c)
(d) (e) (f)
Figure 3: The centroid (+) and the center-point () of all 155 cases (), and centroids () and center-points () of 999 resamples without replacement of size in the region of detail.

An elliptical shape can be observed among the clouds of centroids for different . Whether the axes of the ellipses have any interpretation may be of epidemiological interest, but proposing epidemic models to explain this observation would exceed the scope of this paper.

A visual inspection of Figure 3 clearly suggests that the Market can hardly be considered a part of the clouds of centroids or center-points unless is 50 or less. In the next section we will provide a more quantitative justification for this statement. Even when we sample with replacement to introduce a higher variability as shown in Figure 4, the Market can still hardly be accepted as a part of the clouds of centroids or center-points for . In contrast, the Wanda Plaza clearly locates in the central part of the point clouds formed by centroids, though it is still not a part of the clouds of center-points. If the origin of the pandemic is really close to the “center” of the point cloud of cases, then in the context of statistics, the Wanda Plaza may be more suspicious than the Market, which is neither more nor less likely than the other landmarks shown in the figures.

(a) (b) (c)
(d) (e) (f)
Figure 4: The centroid (+) and the center-point () of all 155 cases (), and centroids () and center-points () of 999 resamples with replacement of size in the region of detail.

2.3 The Monte Carlo test

Modern statistics uses Monte Carlo tests to prove or reject distributional hypotheses, when the classical tests cannot be applied. The authors of W followed this approach and wrote in the main paper: We [] investigated whether the December COVID-19 cases were closer to the market than expected based on an empirical null distribution of Wuhan’s population density [], with its median distance to the Huanan market of 16.11km.

This is all there and only in the supplementary materials section a bit more is presented. There it is written: To test whether the December cases were closer to the Huanan market than expected, null distributions were generated from the population density data []. For each point in each pseudoreplicate the Haversine distance to Huanan was calculated, and the median [] distance to Huanan was calculated for each pseudoreplicate. The median [] distance between all the early December cases () [] [was] compared to these null distributions. Note that W considers also different subsets of these 155 cases, e.g. a subset consisting of 35 cases epidemiologically linked to the Market and a subset consisting of 79 clinically diagnosed cases. We use here only the point cloud of all 155 cases to illustrate the problems in the methodology of W.

Speaking in words used in statistics, they generated artificial patterns of 155 cases by simulating an inhomogeneous Poisson process with intensity function proportional to population density (weighted by the age groups) in Wuhan city. For each pattern they determined the 155 distances from the Market to individual cases and worked out the median of these numbers. The medians are denoted by with , 2, , . In the peer-reviewed version of W, the value of is missing, but in its preprint version, it was stated that . There is also no further explanation of how the reported -value was obtained, and so naturally we can assume that the standard Monte Carlo test was adopted. That is, the simulated medians together with the observed median distance for the 155 cases were ordered in increasing order, and the Monte Carlo -value would be the rank of in this series divided by .

W also used this approach for testing whether the distance between the center-point of the 155 cases and the Market was shorter than expected under their null model (age-weighted population density data), and 1 million points were simulated independently according to the null model, and the distances between the Market and these simulated points were calculated.

Against their use of Monte Carlo test we have two main arguments; the first one is more technical and the second is a fundamental one.

  • The artificial patterns are hardly typical patterns of residential addresses of early infected people in an epidemic. The former tend to be patterns that cover rather large regions (scattered over the full area of Wuhan City) and hence they are comparatively more sparse than the observed pattern around the Huanan market. The null model does not have any special point that can be considered as the origin of an epidemic. It is not realistic to assume that a pattern of cases of a contagious disease is a pattern of independent points following some density rule, such as (age-weighted or unweighted) population density. The rejection of this null model does not suggest any special role played by the Market in the pandemic.

  • The result of the test can be predicted without any computer work: If one places clusters of 155 points (clusters following the rule of W

    or even clusters following any realistic stochastic model for an epidemic) randomly in the whole city region of Wuhan, then the probability is very small that a cluster-center falls just to a position close to the Market, but the null hypothesis will always be rejected. If one replaces the role of the Market in the test by some arbitrary landmark in Wuhan City and repeats what is done in

    W, the same will happen. This demonstrates the problem in the lack of a data-driven approach for the identification of the “center”.

Thus, unfortunately the testing procedure in W must be considered as nonsense. It does not support the zoonosis hypothesis.

To enable the Market to play a role in the hypotheses, we consider the null hypothesis that the Market is the “center” of the 155 cases, then as we discussed in the previous section, it will also be the “center” of any subsets of the 155 cases. Here, we present the testing procedure by using the centroid to estimate the “center” of a point cloud. The null hypothesis is rejected if the distance from the centroid (represented by the “+” in Figures 3 and 4) of the 155 cases to the Market (represented by “”) is significantly longer than the distances , where , 2, , , from “+” to centroids (represented by “”) of replicates of random subsets of size resampled from these 155 cases. The same testing procedure can be applied to the center-points. The Monte Carlo -values, defined to the rank of in the decreasing series formed by divided by , where , corresponding to the simulation shown in Figures 3 and  4 are given in Table 1.

Sampling “center”
without centroid 0.001 0.001 0.016 0.130
replacement center-point 0.001 0.002 0.006 0.081
with centroid 0.030 0.024 0.060 0.122 0.223
replacement center-point 0.008 0.009 0.043 0.068 0.182
Table 1: Monte Carlo -value for testing whether the Market is the “center”

For , it is clear that under sampling without replacement, the null hypothesis has to be rejected in all scenarios considered. Even though the Monte Carlo -values for are larger than 0.05 (and so are some of the

-values under sampling with replacement), the elliptical shape of the point clouds formed by the centroids of subsets suggests that using the Euclidean distance as the test statistic, implicitly assuming a spherical shape, may lower the power of the test, and the failure of rejecting the null hypothesis is likely a type II error. The small

-values for large , together with a visual inspection of Figures 3 and 4, indicate that the Market cannot be accepted as the “center” of the 155 cases.

3 Conclusions

We come to a clear conclusion: The analysis in the paper W does not give a proof of the centrality of the Market in the 155 December cases. On the contrary, our analysis suggests that the Market is in fact unlikely to be the spatial “center” of the cases. However, neither W’s nor our statistics analysis could be used to support or reject the zoonosis hypothesis.

Epidemic models to relate the centrality and the causality and to describe how the infection process evolves from an origin or from multiple origins have yet to be developed. Data-driven approaches to identify the number of cluster centers and their locations for a point cloud of cases of a contagious disease should have been adopted. These models and approaches should be validated by the data set of the outbreak not only in one particular city but also in the others. These challenging research investigations will be left for future endeavours.

We regret to say that the important question of the origin of the COVID-19 pandemic remains open.


  • Dablander and Hinne (2019) Dablander, F., Hinne, M., 2019. Node centrality measures are a poor substitute for causal inference. Scientific Reports 9, 6846. DOI:10.1038/s41598-019-43033-9.
  • Holmes et al. (2021) Holmes, E.C., Goldstein, S.A., Rasmussen, A.L., Robertson, D.L., Crits-Christoph, A., Wertheim, J.O., Anthony, S.J., Barclay, W.S., Boni, M.F., Doherty, P.C., Farrar, J., Geoghegan, J.L., Jiang, X., Leibowitz, J.L., Neil, S.J., Skern, T., Weiss, S.R., Worobey, M., Andersen, K.G., Garry, R.F., Rambaut, A., 2021. The origins of sars-cov-2: A critical review. Cell 184, 4848–4856. DOI:10.1016/j.cell.2021.08.017.
  • Walker (2022a) Walker, D., 2022a. [Twitter] 29 July. Available at” (Accessed: 14 August 2022).
  • Walker (2022b) Walker, D., 2022b. [Twitter] 29 July. Available at” (Accessed: 14 August 2022).
  • World Health Organization (2021) World Health Organization, 2021. Who-convened global study of origins of sars-cov-2: China part. Available at”.
  • Worobey et al. (2022) Worobey, M., Levy, J.I., Serrano, L.M., Crits-Christoph, A., Pekar, J.E., Goldstein, S.A., Rasmussen, A.L., Kraemer, M.U.G., Newman, C., Koopmans, M.P.G., Suchard, M.A., Wertheim, J.O., Lemey, P., Robertson, D.L., Garry, R.F., Holmes, E.C., Rambaut, A., Andersen, K.G., 2022. The huanan seafood wholesale market in wuhan was the early epicenter of the covid-19 pandemic. Science, abp8715. DOI:10.1126/science.abp8715.