Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes

12/11/2019
by   Masaki Oguni, et al.
0

In websites that collect user-generated recipes, recipes are often posted that have a major component, such as the cooking instructions, that is very similar to those in other recipes. We refer to such recipes as "near-duplicate recipes". In this study, we propose a method that extends the "Word Mover's Distance", which calculates distances between texts based on word embedding, to character 3-gram embedding. Using a corpus of over 1.21 million recipes, we learned the word embedding and the character 3-gram embedding by using a Skip-Gram model with negative sampling and fastText to extract candidate pairs of near-duplicate recipes. We then annotated these candidates and evaluated the proposed method against a comparison method. Our results demonstrated that near-duplicate recipes that were not detected by the comparison method were successfully detected by the proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/21/2016

Bayesian Neural Word Embedding

Recently, several works in the domain of natural language processing pre...
research
02/04/2018

Smooth q-Gram, and Its Applications to Detection of Overlaps among Long, Error-Prone Sequencing Reads

We propose smooth q-gram, the first variant of q-gram that captures q-gr...
research
09/04/2018

Segmentation-free compositional n-gram embedding

Applying conventional word embedding models to unsegmented languages, wh...
research
06/13/2019

Character n-gram Embeddings to Improve RNN Language Models

This paper proposes a novel Recurrent Neural Network (RNN) language mode...
research
12/25/2019

N-gram Statistical Stemmer for Bangla Corpus

Stemming is a process that can be utilized to trim inflected words to st...
research
03/05/2018

Calculated attributes of synonym sets

The goal of formalization, proposed in this paper, is to bring together,...
research
04/24/2017

Streaming Word Embeddings with the Space-Saving Algorithm

We develop a streaming (one-pass, bounded-memory) word embedding algorit...

Please sign up or login with your details

Forgot password? Click here to reset