Correspondence Factor Analysis of Big Data Sets: A Case Study of 30 Million Words; and Contrasting Analytics using Apache Solr and Correspondence Analysis in R

07/06/2015
by   Fionn Murtagh, et al.
0

We consider a large number of text data sets. These are cooking recipes. Term distribution and other distributional properties of the data are investigated. Our aim is to look at various analytical approaches which allow for mining of information on both high and low detail scales. Metric space embedding is fundamental to our interest in the semantic properties of this data. We consider the projection of all data into analyses of aggregated versions of the data. We contrast that with projection of aggregated versions of the data into analyses of all the data. Analogously for the term set, we look at analysis of selected terms. We also look at inherent term associations such as between singular and plural. In addition to our use of Correspondence Analysis in R, for latent semantic space mapping, we also use Apache Solr. Setting up the Solr server and carrying out querying is described. A further novelty is that querying is supported in Solr based on the principal factor plane mapping of all the data. This uses a bounding box query, based on factor projections.

READ FULL TEXT
research
12/13/2015

Big Data Scaling through Metric Mapping: Exploiting the Remarkable Simplicity of Very High Dimensional Spaces using Correspondence Analysis

We present new findings in regard to data analysis in very high dimensio...
research
02/23/2017

A Unified Parallel Algorithm for Regularized Group PLS Scalable to Big Data

Partial Least Squares (PLS) methods have been heavily exploited to analy...
research
01/03/2023

Notes on Correspondence Analysis of Power Transformed Data Sets

We prospect for a clear simple picture on CA of power transformed or the...
research
02/27/2019

Linear Time Visualization and Search in Big Data using Pixellated Factor Space Mapping

It is demonstrated how linear computational time and storage efficient a...
research
11/08/2021

Extension of Correspondence Analysis to multiway data-sets through High Order SVD: a geometric framework

This paper presents an extension of Correspondence Analysis (CA) to tens...
research
04/30/2021

Latent Factor Decomposition Model: Applications for Questionnaire Data

The analysis of clinical questionnaire data comes with many inherent cha...
research
07/13/2020

Assessing the behavior and performance of a supervised term-weighting technique for topic-based retrieval

This article analyses and evaluates FDDe̱ṯa̱, a supervised term-weightin...

Please sign up or login with your details

Forgot password? Click here to reset