Real-Time Regression Analysis of Streaming Clustered Data With Possible Abnormal Data Batches

06/30/2021
by   Lan Luo, et al.
0

This paper develops an incremental learning algorithm based on quadratic inference function (QIF) to analyze streaming datasets with correlated outcomes such as longitudinal data and clustered data. We propose a renewable QIF (RenewQIF) method within a paradigm of renewable estimation and incremental inference, in which parameter estimates are recursively renewed with current data and summary statistics of historical data, but with no use of any historical subject-level raw data. We compare our renewable estimation method with both offline QIF and offline generalized estimating equations (GEE) approach that process the entire cumulative subject-level data, and show theoretically and numerically that our renewable procedure enjoys statistical and computational efficiency. We also propose an approach to diagnose the homogeneity assumption of regression coefficients via a sequential goodness-of-fit test as a screening procedure on occurrences of abnormal data batches. We implement the proposed methodology by expanding existing Spark's Lambda architecture for the operation of statistical inference and data quality diagnosis. We illustrate the proposed methodology by extensive simulation studies and an analysis of streaming car crash datasets from the National Automotive Sampling System-Crashworthiness Data System (NASS CDS).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/10/2021

Statistical Inference in High-dimensional Generalized Linear Models with Streaming Data

In this paper we develop an online statistical inference approach for hi...
research
08/04/2022

Statistical Inference for Streamed Longitudinal Data

Modern longitudinal data, for example from wearable devices, measures bi...
research
03/15/2016

Bias Correction for Regularized Regression and its Application in Learning with Streaming Data

We propose an approach to reduce the bias of ridge regression and regula...
research
06/10/2021

Online Debiased Lasso

We propose an online debiased lasso (ODL) method for statistical inferen...
research
11/30/2020

Joint integrative analysis of multiple data sources with correlated vector outcomes

We propose a distributed quadratic inference function framework to joint...
research
10/03/2022

Inference on High-dimensional Single-index Models with Streaming Data

Traditional statistical methods are faced with new challenges due to str...
research
07/26/2022

Functional Regression with Intensively Measured Longitudinal Outcomes: A New Lens through Data Partitioning

Modern longitudinal data from wearable devices consist of biological sig...

Please sign up or login with your details

Forgot password? Click here to reset