Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering

05/01/2023
by   Michele C. Weigle, et al.
0

Many web sites are transitioning how they construct their pages. The conventional model is where the content is embedded server-side in the HTML and returned to the client in an HTTP response. Increasingly, sites are moving to a model where the initial HTTP response contains only an HTML skeleton plus JavaScript that makes API calls to a variety of servers for the content (typically in JSON format), and then builds out the DOM client-side, more easily allowing for periodically refreshing the content in a page and allowing dynamic modification of the content. This client-side rendering, now predominant in social media platforms such as Twitter and Instagram, is also being adopted by news outlets, such as CNN.com. When conventional web archiving techniques, such as crawling with Heritrix, are applied to pages that render their content client-side, the JSON responses can become out of sync with the HTML page in which it is to be embedded, resulting in temporal violations on replay. Because the violative JSON is not directly observable in the page (i.e., in the same manner a violative embedded image is), the temporal violations can be difficult to detect. We describe how the top level CNN.com page has used client-side rendering since April 2015 and the impact this has had on web archives. Between April 24, 2015 and July 21, 2016, we found almost 15,000 mementos with a temporal violation of more than 2 days between the base CNN.com HTML and the JSON responses used to deliver the content under the main story. One way to mitigate this problem is to use browser-based crawling instead of conventional crawlers like Heritrix, but browser-based crawling is currently much slower than non-browser-based tools such as Heritrix.

READ FULL TEXT

page 2

page 6

page 9

page 10

page 12

page 14

page 17

research
12/01/2022

Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

Upon replay, JavaScript on archived web pages can generate recurring HTT...
research
12/01/2017

Demystifying Mobile Web Browsing under Multiple Protocols

With the popularity of mobile devices, such as smartphones, tablets, use...
research
08/07/2019

Making Recommendations from Web Archives for "Lost" Web Pages

When a user requests a web page from a web archive, the user will typica...
research
08/27/2021

Replaying Archived Twitter: When your bird is broken, will it bring you down?

Historians and researchers trust web archives to preserve social media c...
research
06/17/2019

Impact of HTTP Cookie Violations in Web Archives

Certain HTTP Cookies on certain sites can be a source of content bias in...
research
09/06/2013

Desktop and Mobile Web Page Comparison: Characteristics, Trends, and Implications

The broad proliferation of mobile devices in recent years has drasticall...
research
12/15/2020

Building an ID Card Repository with Progressive Web Application to Mitigate Fraud

A lot of service requires identity of users to mitigate undesirable inci...

Please sign up or login with your details

Forgot password? Click here to reset