DeepAI AI Chat
Log In Sign Up

Similarity Search on Computational Notebooks

by   Misato Horiuchi, et al.
Osaka University

Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.


page 1

page 2

page 3

page 4


An Efficient Probabilistic Approach for Graph Similarity Search

Graph similarity search is a common and fundamental operation in graph d...

LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety...

Predicting Mergers and Acquisitions using Graph-based Deep Learning

The graph data structure is a staple in mathematics, yet graph-based mac...

Constrained Approximate Similarity Search on Proximity Graph

Search engines and recommendation systems are built to efficiently displ...

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

The importance of an efficient and scalable document similarity detectio...

Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data

Searching for high-dimensional vector data with high accuracy is an inev...

ABC-SG: A New Artificial Bee Colony Algorithm-Based Distance of Sequential Data Using Sigma Grams

The problem of similarity search is one of the main problems in computer...