Similarity Search on Computational Notebooks

01/30/2022
by   Misato Horiuchi, et al.
0

Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/17/2017

An Efficient Probabilistic Approach for Graph Similarity Search

Graph similarity search is a common and fundamental operation in graph d...
research
07/22/2021

LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety...
research
04/05/2021

Predicting Mergers and Acquisitions using Graph-based Deep Learning

The graph data structure is a staple in mathematics, yet graph-based mac...
research
10/26/2022

Constrained Approximate Similarity Search on Proximity Graph

Search engines and recommendation systems are built to efficiently displ...
research
10/07/2018

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

The importance of an efficient and scalable document similarity detectio...
research
08/08/2019

VisJSClassificator – Manual Visual Collaborative Classification Graph-based Tool

Analysts need to classify, search and correlate numerous images. Automat...
research
12/05/2013

ABC-SG: A New Artificial Bee Colony Algorithm-Based Distance of Sequential Data Using Sigma Grams

The problem of similarity search is one of the main problems in computer...

Please sign up or login with your details

Forgot password? Click here to reset