Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

by   Kritika Garg, et al.

Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to the web archive. We found that web pages that require regular updates (e.g., radio playlists, updates for sports scores, image carousels) are more likely to make such recurring requests. If the resources requested by the web page are not archived, some web archives may attempt to patch the archive by requesting the resources from the live web. If the requested resources are unavailable on the live web, the resources cannot be archived, and the responses remain HTTP 404. Some archived pages continue to poll the server as frequently as they did on the live web, while some pages poll the server even more frequently if their requests return HTTP 404 responses, creating a high amount of unnecessary traffic. On a large scale, such web pages are effectively a denial of service attack on the web archive. Significant computational, network and storage resources are required for web archives to archive and then successfully replay pages as they were on the live web, and these resources should not be spent on unnecessary HTTP traffic. Our proposed solution is to optimize archival replay using Cache-Control HTTP response headers. We implemented this approach in a test environment and cached HTTP 404 responses that prevented the browser's requests from reaching the web archive server.


page 6

page 7

page 8

page 10

page 13

page 16


Making Recommendations from Web Archives for "Lost" Web Pages

When a user requests a web page from a web archive, the user will typica...

Demystifying Mobile Web Browsing under Multiple Protocols

With the popularity of mobile devices, such as smartphones, tablets, use...

Impact of HTTP Cookie Violations in Web Archives

Certain HTTP Cookies on certain sites can be a source of content bias in...

Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering

Many web sites are transitioning how they construct their pages. The con...

DSL-driven Integration of HTTP Services in DIME

As the integration of web services into web applications becomes more an...

What Did It Look Like: A service for creating website timelapses using the Memento framework

Popular web pages are archived frequently, which makes it difficult to v...

On the Persistence of Persistent Identifiers of the Scholarly Web

Scholarly resources, just like any other resources on the web, are subje...

Please sign up or login with your details

Forgot password? Click here to reset