PyTorrent: A Python Library Corpus for Large-scale Language Models

10/04/2021
by   Mehdi Bahrami, et al.
0

A large scale collection of both semantic and natural language resources is essential to leverage active Software Engineering research areas such as code reuse and code comprehensibility. Existing machine learning models ingest data from Open Source repositories (like GitHub projects) and forum discussions (like Stackoverflow.com), whereas, in this showcase, we took a step backward to orchestrate a corpus titled PyTorrent that contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure. The dataset, schema and a pretrained language model is available at: https://github.com/fla-sil/PyTorrent

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2022

Deepchecks: A Library for Testing and Validating Machine Learning Models and Data

This paper presents Deepchecks, a Python library for comprehensively val...
research
01/02/2021

Minimum Viable Model Estimates for Machine Learning Projects

Prioritization of machine learning projects requires estimates of both t...
research
04/03/2022

BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster

Most AI projects start with a Python notebook running on a single laptop...
research
07/10/2019

Executability of Python Snippets in Stack Overflow

Online resources today contain an abundant amount of code snippets for d...
research
12/14/2017

Rasa: Open Source Language Understanding and Dialogue Management

We introduce a pair of tools, Rasa NLU and Rasa Core, which are open sou...
research
08/07/2023

SynJax: Structured Probability Distributions for JAX

The development of deep learning software libraries enabled significant ...
research
05/06/2020

Introducing PyCross: PyCloudy Rendering Of Shape Software for pseudo 3D ionisation modelling of nebulae

Research into the processes of photoionised nebulae plays a significant ...

Please sign up or login with your details

Forgot password? Click here to reset