Open Data on GitHub: Unlocking the Potential of AI

06/09/2023
by   Anthony Cintron Roman, et al.
0

GitHub is the world's largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes of data. This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research. We analyze the existing landscape of open data on GitHub and the patterns of how users share datasets. Our findings show that GitHub is one of the largest hosts of open data in the world and has experienced an accelerated growth of open data assets over the past four years. By examining the open data landscape on GitHub, we aim to empower users and organizations to leverage existing open datasets and improve their discoverability – ultimately contributing to the ongoing AI revolution to help address complex societal issues. We release the three datasets that we have collected to support this analysis as open datasets at https://github.com/github/open-data-on-github.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2022

GeoThermalCloud: Machine Learning for Geothermal Resource Exploration

This paper presents a novel ML-based methodology for geothermal explorat...
research
09/08/2017

BlockSci: Design and applications of a blockchain analysis platform

Analysis of blockchain data is useful for both scientific research and c...
research
06/26/2023

Fauno: The Italian Large Language Model that will leave you senza parole!

This paper presents Fauno, the first and largest open-source Italian con...
research
10/09/2019

Loss Landscape Sightseeing with Multi-Point Optimization

We present multi-point optimization: an optimization technique that allo...
research
09/09/2023

The Effectiveness of Security Interventions on GitHub

In 2017, GitHub was the first online open source platform to show securi...
research
10/01/2019

Beyond Textual Issues: Understanding the Usage and Impact of GitHub Reactions

Recently, GitHub introduced a new social feature, named reactions, which...
research
04/03/2022

BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster

Most AI projects start with a Python notebook running on a single laptop...

Please sign up or login with your details

Forgot password? Click here to reset