OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking

09/04/2023
by   Blaž Škrlj, et al.
0

The design of modern recommender systems relies on understanding which parts of the feature space are relevant for solving a given recommendation task. However, real-world data sets in this domain are often characterized by their large size, sparsity, and noise, making it challenging to identify meaningful signals. Feature ranking represents an efficient branch of algorithms that can help address these challenges by identifying the most informative features and facilitating the automated search for more compact and better-performing models (AutoML). We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. OutRank was built with categorical data in mind, utilizing a variant of mutual information that is normalized with regard to the noise produced by features of the same cardinality. We further extend the similarity measure by incorporating information on feature similarity and combined relevance. The proposed approach's feasibility is demonstrated by speeding up the state-of-the-art AutoML system on a synthetic data set with no performance loss. Furthermore, we considered a real-life click-through-rate prediction data set where it outperformed strong baselines such as random forest-based approaches. The proposed approach enables exploration of up to 300 search for better models on off-the-shelf hardware.

READ FULL TEXT
research
06/25/2011

The All Relevant Feature Selection using Random Forest

In this paper we examine the application of the random forest classifier...
research
01/05/2017

Exploration of Proximity Heuristics in Length Normalization

Ranking functions used in information retrieval are primarily used in th...
research
09/04/2023

Drifter: Efficient Online Feature Monitoring for Improved Data Integrity in Large-Scale Recommendation Systems

Real-world production systems often grapple with maintaining data qualit...
research
01/30/2020

TCMI: a non-parametric mutual-dependence estimator for multivariate continuous distributions

The identification of relevant features, i.e., the driving variables tha...
research
05/26/2023

Mitigating Exploitation Bias in Learning to Rank with an Uncertainty-aware Empirical Bayes Approach

Ranking is at the core of many artificial intelligence (AI) applications...
research
01/23/2021

ReliefE: Feature Ranking in High-dimensional Spaces via Manifold Embeddings

Feature ranking has been widely adopted in machine learning applications...
research
04/30/2021

Ranking the information content of distance measures

Real-world data typically contain a large number of features that are of...

Please sign up or login with your details

Forgot password? Click here to reset