Fitting Semiparametric Cumulative Probability Models for Big Data

07/13/2022
by   Chun Li, et al.
0

Cumulative probability models (CPMs) are a robust alternative to linear models for continuous outcomes. However, they are not feasible for very large datasets due to elevated running time and memory usage, which depend on the sample size, the number of predictors, and the number of distinct outcomes. We describe three approaches to address this problem. In the divide-and-combine approach, we divide the data into subsets, fit a CPM to each subset, and then aggregate the information. In the binning and rounding approaches, the outcome variable is redefined to have a greatly reduced number of distinct values. We consider rounding to a decimal place and rounding to significant digits, both with a refinement step to help achieve the desired number of distinct outcomes. We show with simulations that these approaches perform well and their parameter estimates are consistent. We investigate how running time and peak memory usage are influenced by the sample size, the number of distinct outcomes, and the number of predictors. As an illustration, we apply the approaches to a large publicly available dataset investigating matrix multiplication runtime with nearly one million observations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/29/2022

Asymptotic Properties for Cumulative Probability Models for Continuous Outcomes

Regression models for continuous outcomes often require a transformation...
research
04/30/2015

Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication

We describe a matrix multiplication recognition algorithm for a subset o...
research
02/06/2020

Solving Tall Dense Linear Programs in Nearly Linear Time

In this paper we provide an Õ(nd+d^3) time randomized algorithm for solv...
research
08/14/2020

How little data do we need for patient-level prediction?

Objective: Provide guidance on sample size considerations for developing...
research
09/20/2022

Effects of Influential Points and Sample Size on the Selection and Replicability of Multivariable Fractional Polynomial Models

The multivariable fractional polynomial (MFP) procedure combines variabl...
research
08/17/2018

Fitting Probabilistic Index Models on Large Datasets

Recently, Thas et al. (2012) introduced a new statistical model for the ...
research
05/17/2018

Covariance-Insured Screening

Modern bio-technologies have produced a vast amount of high-throughput d...

Please sign up or login with your details

Forgot password? Click here to reset