Detect Camouflaged Spam Content via StoneSkipping: Graph and Text Joint Embedding for Chinese Character Variation Representation

by   Zhuoren Jiang, et al.

The task of Chinese text spam detection is very challenging due to both glyph and phonetic variations of Chinese characters. This paper proposes a novel framework to jointly model Chinese variational, semantic, and contextualized representations for Chinese text spam detection task. In particular, a Variation Family-enhanced Graph Embedding (VFGE) algorithm is designed based on a Chinese character variation graph. The VFGE can learn both the graph embeddings of the Chinese characters (local) and the latent variation families (global). Furthermore, an enhanced bidirectional language model, with a combination gate function and an aggregation learning function, is proposed to integrate the graph and text information while capturing the sequential information. Extensive experiments have been conducted on both SMS and review datasets, to show the proposed method outperforms a series of state-of-the-art models for Chinese spam detection.


page 1

page 2

page 3

page 4


Component-Enhanced Chinese Character Embeddings

Distributed word representations are very useful for capturing semantic ...

Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking

Chinese Spell Checking (CSC) aims to detect and correct erroneous charac...

SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check

Chinese Spelling Check (CSC) is a task to detect and correct spelling er...

Contextual Similarity is More Valuable than Character Similarity: Curriculum Learning for Chinese Spell Checking

Chinese Spell Checking (CSC) task aims to detect and correct Chinese spe...

Optimizing the Learning Order of Chinese Characters Using a Novel Topological Sort Algorithm

We present a novel algorithm for optimizing the order in which Chinese c...

A BERT-based Dual Embedding Model for Chinese Idiom Prediction

Chinese idioms are special fixed phrases usually derived from ancient st...

Learning Joint Gaussian Representations for Movies, Actors, and Literary Characters

Understanding of narrative content has become an increasingly popular to...