Tree-based Focused Web Crawling with Reinforcement Learning

12/12/2021
by   Andreas Kontogiannis, et al.
0

A focused crawler aims at discovering as many web pages relevant to a target topic as possible, while avoiding irrelevant ones; i.e. maximizing the harvest rate. Reinforcement Learning (RL) has been utilized to optimize the crawling process, yet it deals with huge state and action spaces, which can constitute a serious challenge. In this paper, we propose TRES, an end-to-end RL-empowered framework for focused crawling. Unlike other approaches, we properly model a crawling environment as a Markov Decision Process, by representing the state as a subgraph of the Web and actions as its expansion edges. TRES adopts a keyword expansion strategy based on the cosine similarity of keyword embeddings. To learn a reward function, we propose a deep neural network, called KwBiLSTM, leveraging the discovered keywords. To reduce the time complexity of selecting a best action, we propose Tree-Frontier, a two-fold decision tree, which also speeds up training by discretizing the state and action spaces. Experimentally, we show that TRES outperforms state-of-the-art methods in terms of harvest rate by at least 58 Our implementation code can be found on https://github.com/ddaedalus/TRES.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/05/2020

State Action Separable Reinforcement Learning

Reinforcement Learning (RL) based methods have seen their paramount succ...
research
05/15/2018

Feedback-Based Tree Search for Reinforcement Learning

Inspired by recent successes of Monte-Carlo tree search (MCTS) in a numb...
research
10/12/2021

StARformer: Transformer with State-Action-Reward Representations

Reinforcement Learning (RL) can be considered as a sequence modeling tas...
research
06/01/2021

Exploring Dynamic Selection of Branch Expansion Orders for Code Generation

Due to the great potential in facilitating software development, code ge...
research
01/28/2022

Discovering Exfiltration Paths Using Reinforcement Learning with Attack Graphs

Reinforcement learning (RL), in conjunction with attack graphs and cyber...
research
10/07/2020

Improved POMDP Tree Search Planning with Prioritized Action Branching

Online solvers for partially observable Markov decision processes have d...

Please sign up or login with your details

Forgot password? Click here to reset