Sherlock in OSS: A Novel Approach of Content-Based Searching in Object Storage System

01/24/2023
by   Jannatun Noor, et al.
0

Object Storage Systems (OSS) inside a cloud promise scalability, durability, availability, and concurrency. However, open-source OSS does not have a specific approach to letting users and administrators search based on the data, which is contained inside the object storage, without involving the entire cloud infrastructure. Therefore, in this paper, we propose Sherlock, a novel Content-Based Searching (CoBS) architecture to extract additional information from images and documents and store it in an Elasticsearch-enabled database, which helps us to search for our desired data based on its contents. This approach works in two sequential stages. First, it will be uploaded to a classifier that will select the data type and send it to the specific model for the data. The images that are being uploaded are sent to our trained model for object detection, and the documents are sent for keyword extraction. Next, the extracted information is sent to Elasticsearch, which enables searching based on the contents. Because the precision of the models is so fundamental to the search's correctness, we train our models with comprehensive datasets (Microsoft COCO Dataset) for multimedia data and SemEval2017 Dataset for document data. Furthermore, we put our designed architecture to the test with a real-world implementation of an open-source OSS called OpenStack Swift. In addition, we upload images into the dataset in various segments to find out the efficacy of our proposed model in real-life Swift object storage.

READ FULL TEXT
research
09/10/2008

Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

Most search engines index the textual content of documents in digital li...
research
06/06/2023

A Practical Framework for Storing and Searching Encrypted Data on Cloud Storage

Security has become a significant concern with the increased popularity ...
research
11/09/2022

DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop

Business documents come in a variety of structures, formats and informat...
research
06/20/2019

Pattern Spotting in Historical Documents Using Convolutional Models

Pattern spotting consists of searching in a collection of historical doc...
research
01/06/2021

On-Device Document Classification using multimodal features

From small screenshots to large videos, documents take up a bulk of spac...
research
02/06/2023

FastCat Catalogues: Interactive Entity-based Exploratory Analysis of Archival Documents

We describe FastCat Catalogues, a Web application that supports research...
research
06/06/2019

One-shot Information Extraction from Document Images using Neuro-Deductive Program Synthesis

Our interest in this paper is in meeting a rapidly growing industrial de...

Please sign up or login with your details

Forgot password? Click here to reset