Controlling the False Split Rate in Tree-Based Aggregation

08/11/2021
by   Simeng Shao, et al.
0

In many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. For example, companies belong to industries, which in turn belong to ever coarser divisions such as sectors; microbes are commonly arranged in a taxonomic hierarchy from species to kingdoms; street blocks belong to neighborhoods, which in turn belong to larger-scale regions. The problem of tree-based aggregation that we consider in this paper asks which of these tree-defined subgroups of leaves should really be treated as a single entity and which of these entities should be distinguished from each other. We introduce the "false split rate", an error measure that describes the degree to which subgroups have been split when they should not have been. We then propose a multiple hypothesis testing algorithm for tree-based aggregation, which we prove controls this error measure. We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients. We apply this methodology to aggregate stocks based on their volatility and to aggregate neighborhoods of New York City based on taxi fares.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2021

Multi Split Conformal Prediction

Split conformal prediction is a computationally efficient method for per...
research
03/16/2019

A Bottom-up Approach to Testing Hypotheses That Have a Branching Tree Dependence Structure, with False Discovery Rate Control

Modern statistical analyses often involve testing large numbers of hypot...
research
01/24/2021

NeurT-FDR: Controlling FDR by Incorporating Feature Hierarchy

Controlling false discovery rate (FDR) while leveraging the side informa...
research
08/23/2022

Conceptual Modeling of Aggregation: an Exploration

This paper is about conceptual modeling of aggregates in software engine...
research
08/27/2021

Multiple Hypothesis Testing Framework for Spatial Signals

The problem of identifying regions of spatially interesting, different o...
research
10/30/2020

Measure Inducing Classification and Regression Trees for Functional Data

We propose a tree-based algorithm for classification and regression prob...
research
12/08/2017

Wireless Aggregation at Nearly Constant Rate

One of the most fundamental tasks in sensor networks is the computation ...

Please sign up or login with your details

Forgot password? Click here to reset