Extracting the main trend in a dataset: the Sequencer algorithm

06/24/2020
by   Dalya Baron, et al.
0

Scientists aim to extract simplicity from observations of the complex world. An important component of this process is the exploration of data in search of trends. In practice, however, this tends to be more of an art than a science. Among all trends existing in the natural world, one-dimensional trends, often called sequences, are of particular interest as they provide insights into simple phenomena. However, some are challenging to detect as they may be expressed in complex manners. We present the Sequencer, an algorithm designed to generically identify the main trend in a dataset. It does so by constructing graphs describing the similarities between pairs of observations, computed with a set of metrics and scales. Using the fact that continuous trends lead to more elongated graphs, the algorithm can identify which aspects of the data are relevant in establishing a global sequence. Such an approach can be used beyond the proposed algorithm and can optimize the parameters of any dimensionality reduction technique. We demonstrate the power of the Sequencer using real-world data from astronomy, geology as well as images from the natural world. We show that, in a number of cases, it outperforms the popular t-SNE and UMAP dimensionality reduction techniques. This approach to exploratory data analysis, which does not rely on training nor tuning of any parameter, has the potential to enable discoveries in a wide range of scientific domains. The source code is available on github and we provide an online interface at <http://sequencer.org>.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset