Pythia: a Framework for the Automated Analysis of Web Hosting Environments

03/16/2019
by   Srdjan Matic, et al.
0

A common approach when setting up a website is to utilize third party Web hosting and content delivery networks. Without taking this trend into account, any measurement study inspecting the deployment and operation of websites can be heavily skewed. Unfortunately, the research community lacks generalizable tools that can be used to identify how and where a given website is hosted. Instead, a number of ad hoc techniques have emerged, e.g., using Autonomous System databases, domain prefixes for CNAME records. In this work we propose Pythia, a novel lightweight approach for identifying Web content hosted on third-party infrastructures, including both traditional Web hosts and content delivery networks. Our framework identifies the organization to which a given Web page belongs, and it detects which Web servers are self-hosted and which ones leverage third-party services to provide contents. To test our framework we run it on 40,000 URLs and evaluate its accuracy, both by comparing the results with similar services and with a manually validated groundtruth. Our tool achieves an accuracy of 90 are self-hosted. We publicly release our tool to allow other researchers to reproduce our findings, and to apply it to their own studies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/28/2020

Beyond the Front Page: Measuring Third Party Dynamics in the Field

In the modern Web, service providers often rely heavily on third parties...
research
01/09/2023

Quantifying User Password Exposure to Third-Party CDNs

Web services commonly employ Content Distribution Networks (CDNs) for pe...
research
02/10/2023

Exploring the Cookieverse: A Multi-Perspective Analysis of Web Cookies

Web cookies have been the subject of many research studies over the last...
research
09/10/2019

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

Web archiving frameworks are commonly assessed by the quality of their a...
research
12/06/2021

Topology and Geometry of the Third-Party Domains Ecosystem

Over the years, web content has evolved from simple text and static imag...
research
10/24/2020

Towards Benchmark Datasets for Machine Learning Based Website Phishing Detection: An experimental study

In this paper, we present a general scheme for building reproducible and...
research
01/18/2021

Leveraging AI to optimize website structure discovery during Penetration Testing

Dirbusting is a technique used to brute force directories and file names...

Please sign up or login with your details

Forgot password? Click here to reset