DeepAI AI Chat
Log In Sign Up

Similarity Search on Computational Notebooks

01/30/2022
by   Misato Horiuchi, et al.
Osaka University
0

Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/17/2017

An Efficient Probabilistic Approach for Graph Similarity Search

Graph similarity search is a common and fundamental operation in graph d...
07/22/2021

LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety...
04/05/2021

Predicting Mergers and Acquisitions using Graph-based Deep Learning

The graph data structure is a staple in mathematics, yet graph-based mac...
10/26/2022

Constrained Approximate Similarity Search on Proximity Graph

Search engines and recommendation systems are built to efficiently displ...
10/07/2018

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

The importance of an efficient and scalable document similarity detectio...
10/17/2018

Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data

Searching for high-dimensional vector data with high accuracy is an inev...
12/05/2013

ABC-SG: A New Artificial Bee Colony Algorithm-Based Distance of Sequential Data Using Sigma Grams

The problem of similarity search is one of the main problems in computer...