A computational theoretical approach for mining data on transient events from databases of high energy astrophysics experiments

04/08/2020
by   Francesco Lazzarotto, et al.
0

Data on transient events, like GRBs, are often contained in large databases of unstructured data from space experiments, merged with potentially large amount of background or simply undesired information. We present a computational formal model to apply techniques of modern computer science -such as Data Mining (DM) and Knowledge Discovering in Databases (KDD)- to a generic, large database derived from a high energy astrophysics experiment. This method is aimed to search, identify and extract expected information, and maybe to discover unexpected information .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

07/27/2021

SimCleaner – Sistema de Padronização de Bases de Dados utilizando Funções de Similaridade

The Knowledge Discovery in Database (KDD) process permits the detection ...
10/19/2010

Mining Knowledge in Astrophysical Massive Data Sets

Modern scientific data mainly consist of huge datasets gathered by a ver...
11/20/2018

StarStar Models: Process Analysis on top of Databases

Much time in process mining projects is spent on finding and understandi...
07/05/2019

Materials databases: the need for open, interoperable databases with standardized data and rich metadata

Driven by the recent rapid increase in the number of materials databases...
12/27/2021

An efficient mining scheme for high utility itemsets

Knowledge discovery in databases aims at finding useful information, whi...
12/24/2009

Similarité en intension vs en extension : à la croisée de l'informatique et du théâtre

Traditional staging is based on a formal approach of similarity leaning ...
09/01/2021

Quantum-Inspired Keyword Search on Multi-Model Databases

With the rising applications implemented in different domains, it is ine...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Giant archives of data resources are available to researchers in astronomy, and several are data not published on the web yet. (Schade et al. 2000). Two tasks are now urgent to improve knowledge extraction from our astrophysical archives:

  1. making data available from different archives and putting them in a common and efficient way;

  2. adopting the appropriate technics to extract relations in data to find new istances of known phenomena and to discover unknown phenomena.

This two goals may be helped by several studies in modern computer science disciplines that come under the name of KDD and DM.

2 Knowledge Discovery in Database (KDD) & Data Mining (DM)

Definition

The nontrivial extraction of implicit, previously unknown, and potentially useful information from data
Frawley, W. Piatesky-Shapiro, & G., Matheus C. J., (1992)

Some outhors also name the KDD process as Data Mining (DM) but more often with DM we can refer to the analysis and representation of the data after to have preprocessed, cleaned and organized them. Our work explains how apply KDD and DM technics in order to improve astrophysical studies on GRB and, more in general, on high energy astrophysical transient events. We show:

  • the development method of an integrated and fitting high energy astrophysical archive, which stores in a complete and efficient way both photon data and scientific results in a highly cross-referenced database which supports ad-hoc querying by users;

  • how a formal model permits us to launch efficient explorations in order to divide the archive into event classes and to extract high level scientific results in an efficient and compact way.

010_lazzarottof1_procKDD.eps

Figure 1: The process of Knowledge Discovery

3 The implementation of the KDD system

3.1 Data Warehouse

Before performing Data Mining applications, data (better if from different databases) must be treated with preprocessing, cleaning and filtering operations to be stored in a standard format in an integrated database system also named Data Repository.
The whole process can assume the name of Data Warehousing

020_lazzarottof1_DWprocess.eps

Figure 2: Data Warehousing step

3.2 The data organization

To perform an efficient information retrieving amonog GRB data sets, cannot be disregarded the use of instruments that permit to submit simple queries, based on a formal or natural language, the minimum is to implement a Data Base Management System(DBMS). SQL (Simple Query Language) is also suitable, but for very large databases would be better to use the OLAP (On Line Analytical Processing) approach based on integrated hierarchic and multidimensional representation of data.

4 Mining GRB among astrophysics databases

Preliminary studies (Feroci et al. 1999; Lazzarotto 2001) highlighted the segmentation of SAX GRBM detector on-ground database, in different classes of events.

030_lazzarottof1_DWprocess.eps

Figure 3: Classes of signals

4.1 Preprocessing and cleaning data

Noisy and corrupted data are a large part of the data, during extensive analysis on IASF/CNR GRBM archive since 1999, were been detected and corrected many causes of errors in the data such as:

  • space acquisition and transmission errors;

  • on-ground preprocessing errors;

  • local software errors.

4.2 Fuzzy logic, Clustering and Pattern recognition

In the early analysis (Lazzarotto 2001) we used a self-implemented object oriented database, that realized only queries and statistical global operations we thought were significative, now we intend to implement a standard KDD system based on a DW in order to apply more efficient tecnichs of DM. In the past works, we used fuzzy logic

to perform pattern recognition and to discriminate different kinds of presumibly known signals.

5 Steps for the model

040_lazzarottof1_GRBMarchive.eps

Figure 4: Composition of the GRBM instrument archive

The following steps show how to improve the ability of filtering events.

  1. Definition of quantitative data and attributes characterizing a transient event (event list, duration in seconds, location, spectral measures, flux, …)

  2. Definition of categorical attributes characterizing a transient event (hardness, class of duration, #instruments that have detected it, flux level, …)

  3. application of 3 basical DM tecnichs:

In Feroci et al. (1999); Lazzarotto (2001), we went deep basically on temporal series to make a partition of the data set into known classes. A complete KDD system permits to improve past work and to apply the predictive technics we want to show.

5.1 Associations among event attributes

  • We define an event as a cube of observations () for that event, temporally divided in Peaks () with some Attributes ()

Let be an event a set of categorical attribute instances. Let be a set of events.

Association rule : with

  • Support =

  • Confidence =

Problem : find all the rules in a transient events dataset D :

050_lazzarottof1_EventCube.eps

Figure 5: formal and visual model for a transient event

5.2 Density based clustering of events

060_lazzarottof1_ClusterDBSCAN.eps

Figure 6: Clustering algorithms to discovery burst classes

Another approach to knowledge extraction among a large and noisy transient events database, is to adopt clustering algorithms. We chose to apply a density-based clustering algorithm. The concept is to think that clusters of events are dense regions of a multidimensional space distiguished from sparse regions that represent the noise. These algorithms need an event defined as a set of attributes that respects the axioms of a metric space. The basical idea: to decide if an event is in a cluster, we have to find that the density of events in the neighborhood of that event, must exceed a certain threshold. A fundamental work in this field is Density Based Spatial Clustering of Application with Noise (DBSCAN by Ester, Kriegel, Sander & Xu, 1996).

6 Conclusions

  • We defined a basical criterium to select a transient event from a temporal series (i.e. T90 as we made in (Feroci et al. 1999; Lazzarotto 2001)

  • Applying correctly defined attributes of events: duration, position, hardness, shape (rising/falling front, FWHM), we have now the correct methodological system and theorical instruments to launch our next analysis.

  • We have to spend a startup time to implement the global system, engeneering actual tools and algorithms, and to train the system to known results.

  • Then we can launch the system on large datasets, in order to find classifications among events and relations among events attributes, without supposing criteria that always result not general and imprecise, when we change or enlarge our datasets of hypotetical GRB events.

  • We can quickly change the point of interest, change kind of data with a minimum work, have facilities to represent results.

And this is very important in a mysterious problem like GRBs, where lots of instrumental data are not globally analyzed yet.

7 Acronyms

DB

DataBase.

DBMS

Data Base Management System.

DM

Data Mining.

DW

Data Warehousing.

FWHM

Full Width Half Maximum

GRB

Gamma Ray Burst.

GRBM

Gamma Ray Burst Monitor.

KDD

Knowledge Discovery in Database.

OLAP

OnLine Analytical Processing.

SAX

Satellite for Astronomy in X rays.

SGR

Soft Gamma ray Repeater.

SQL

Simple Query Language.

Frawley, W. et al, 1992, ”Knowledge Discovery in Databases: An Overview.“ AI Magazine, pgs 213-228. Ester, M. et al, 1996,“A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise“ (DBSCAN), Proc. of KDD’96 pgs 226-231. Agrawal R. et al. (IBM), 1996, “The Quest Data Mining System”, Proc. of KDD’96 Pgs 244-249. Atzeni, P. et al, 1999, ”Database Systems: Concepts, Languages & Architectures“ (book), MacGraw-Hill, ISBN-007-709500-6 . Feroci, M. et al, 1999, ”A Robust Filter for the BeppoSAX Gamma Ray Burst Monitor Triggers“, astro-ph/9912488, Proceedings of the 5th Huntsville Gamma-Ray Bursts Symposium, pgs. 711-715. Schade, D. et al, 2000, “A Data Mining Model for Astronomy”, in ASP Conf. Ser., Vol. 216, Astronomical Data Analysis Software and Systems IX, pgs 215-218. Lazzarotto, F., 2001, Computer Science Master Thesis at La Sapienza University of Rome, (190 pages, in italian). Borgelt C., et al, 2002, “Induction of Association Rules: Apriori Implementation.”, Compstat, DOI:10.1007/978-3-642-57489-4_59 .