Google Dataset Search by the Numbers

06/12/2020
by   Omar Benjelloun, et al.
0

Scientists, governments, and companies increasingly publish datasets on the Web. Google's Dataset Search extracts dataset metadata – expressed using schema.org and similar vocabularies – from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from about 500K to almost 30M. Thus, this corpus has become a valuable snapshot of data on the Web. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We analyze this corpus and discuss where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. Based on this analysis, we identify gaps and possible future work to help make data more discoverable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2018

Analysis of Schema.org Usage in the Tourism Domain

Schema.org is an initiative founded in 2011 by the four-big search engin...
research
01/27/2020

Leveraging Schema Labels to Enhance Dataset Search

A search engine's ability to retrieve desirable datasets is important fo...
research
07/27/2020

On using Product-Specific Schema.org from Web Data Commons: An Empirical Set of Best Practices

Schema.org has experienced high growth in recent years. Structured descr...
research
11/09/2017

Defining Tourism Domains for Semantic Annotation of Web Content

Schema.org is an initiative by Bing, Google, Yahoo! and Yandex that publ...
research
10/19/2012

Exploiting Locality in Searching the Web

Published experiments on spidering the Web suggest that, given training ...
research
04/06/2021

Large-scale Sustainable Search on Unconventional Computing Hardware

Since the advent of the Internet, quantifying the relative importance of...
research
03/01/2018

Inferring Missing Categorical Information in Noisy and Sparse Web Markup

Embedded markup of Web pages has seen widespread adoption throughout the...

Please sign up or login with your details

Forgot password? Click here to reset