Toward a generic representation of random variables for machine learning

by   Gautier Marti, et al.
Hellebore Capital Ltd

This paper presents a pre-processing and a distance which improve the performance of machine learning algorithms working on independent and identically distributed stochastic processes. We introduce a novel non-parametric approach to represent random variables which splits apart dependency and distribution without losing any information. We also propound an associated metric leveraging this representation and its statistical estimate. Besides experiments on synthetic datasets, the benefits of our contribution is illustrated through the example of clustering financial time series, for instance prices from the credit default swaps market. Results are available on the website and an IPython Notebook tutorial is available at for reproducible research.


page 8

page 9

page 11

page 12


Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

We present a methodology for clustering N objects which are described by...

Clustering Financial Time Series: How Long is Enough?

Researchers have used from 30 days to several years of daily returns as ...

Some stochastic comparison results for frailty and resilience models

Frailty and resilience models provide a way to introduce random effects ...

Independence clustering (without a matrix)

The independence clustering problem is considered in the following formu...

Time Series Clustering With Random Convolutional Kernels

Time series can describe a wide range of natural and social phenomena. A...

Robust Statistical Comparison of Random Variables with Locally Varying Scale of Measurement

Spaces with locally varying scale of measurement, like multidimensional ...

Please sign up or login with your details

Forgot password? Click here to reset