Augmenting Rule-based DNS Censorship Detection at Scale with Machine Learning

The proliferation of global censorship has led to the development of a plethora of measurement platforms to monitor and expose it. Censorship of the domain name system (DNS) is a key mechanism used across different countries. It is currently detected by applying heuristics to samples of DNS queries and responses (probes) for specific destinations. These heuristics, however, are both platform-specific and have been found to be brittle when censors change their blocking behavior, necessitating a more reliable automated process for detecting censorship. In this paper, we explore how machine learning (ML) models can (1) help streamline the detection process, (2) improve the usability of large-scale datasets for censorship detection, and (3) discover new censorship instances and blocking signatures missed by existing heuristic methods. Our study shows that supervised models, trained using expert-derived labels on instances of known anomalies and possible censorship, can learn the detection heuristics employed by different measurement platforms. More crucially, we find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing heuristics. Moreover, both methods demonstrate the capability to uncover a substantial number of new DNS blocking signatures, i.e., injected fake IP addresses overlooked by existing heuristics. These results are underpinned by an important methodological finding: comparing the outputs of models trained using the same probes but with labels arising from independent processes allows us to more reliably detect cases of censorship in the absence of ground-truth labels of censorship.

READ FULL TEXT
research
05/25/2020

Improving Web Content Blocking With Event-Loop-Turn Granularity JavaScript Signatures

Content blocking is an important part of a performant, user-serving, pri...
research
08/15/2022

Non-Blocking Batch A* (Technical Report)

Heuristic search has traditionally relied on hand-crafted or programmati...
research
12/03/2021

Bridging the gap between prostate radiology and pathology through machine learning

Prostate cancer is the second deadliest cancer for American men. While M...
research
09/06/2023

Detecting Manufacturing Defects in PCBs via Data-Centric Machine Learning on Solder Paste Inspection Features

Automated detection of defects in Printed Circuit Board (PCB) manufactur...
research
02/07/2022

Jury Learning: Integrating Dissenting Voices into Machine Learning Models

Whose labels should a machine learning (ML) algorithm learn to emulate? ...
research
10/07/2022

Scaling Directed Controller Synthesis via Reinforcement Learning

Directed Controller Synthesis technique finds solutions for the non-bloc...
research
05/14/2023

CERTainty: Detecting DNS Manipulation at Scale using TLS Certificates

DNS manipulation is an increasingly common technique used by censors and...

Please sign up or login with your details

Forgot password? Click here to reset