Controlling the False Split Rate in Tree-Based Aggregation

by   Simeng Shao, et al.

In many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. For example, companies belong to industries, which in turn belong to ever coarser divisions such as sectors; microbes are commonly arranged in a taxonomic hierarchy from species to kingdoms; street blocks belong to neighborhoods, which in turn belong to larger-scale regions. The problem of tree-based aggregation that we consider in this paper asks which of these tree-defined subgroups of leaves should really be treated as a single entity and which of these entities should be distinguished from each other. We introduce the "false split rate", an error measure that describes the degree to which subgroups have been split when they should not have been. We then propose a multiple hypothesis testing algorithm for tree-based aggregation, which we prove controls this error measure. We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients. We apply this methodology to aggregate stocks based on their volatility and to aggregate neighborhoods of New York City based on taxi fares.



page 1

page 2

page 3

page 4


Multi Split Conformal Prediction

Split conformal prediction is a computationally efficient method for per...

A Bottom-up Approach to Testing Hypotheses That Have a Branching Tree Dependence Structure, with False Discovery Rate Control

Modern statistical analyses often involve testing large numbers of hypot...

NeurT-FDR: Controlling FDR by Incorporating Feature Hierarchy

Controlling false discovery rate (FDR) while leveraging the side informa...

Multiple Hypothesis Testing Framework for Spatial Signals

The problem of identifying regions of spatially interesting, different o...

Aggregating Learned Probabilistic Beliefs

We consider the task of aggregating beliefs of severalexperts. We assume...

P-values for high-dimensional regression

Assigning significance in high-dimensional regression is challenging. Mo...

Wireless Aggregation at Nearly Constant Rate

One of the most fundamental tasks in sensor networks is the computation ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.