Skip to main content

Return2Corp input set generator.

Project description

Input Set Generator

Welcome to Return2Corp's input set generator!

Installation

To install (pre-official release), download the project's requirements.txt into a folder and navigate to that folder in the terminal. Then:

pip install virtualenv
virtualenv venv
source venv/bin/activate (or activate.fish, if you’re using fish)
pip install -r requirements.txt
pip install --index-url https://test.pypi.org/simple/ r2c-isg

Then, to run:

r2c-isg

Quick Start

Try the following command sequences:

  • Load the top 5,000 pypi projects by downloads in the last 365 days, sort by descending number of downloads, trim to the top 100 most downloaded, download project metadata and all versions, and generate an input set json.

      load pypi top5kyear
      sort "desc download_count
      trim 100
      get -mv all
      set-meta -n test -v 1.0
      export inputset.json
    

  • Load all npm projects, sample 100, download the latest versions, and generate an input set json.

      load npm allbydependents
      sample 100
      get -v latest
      set-meta -n test -v 1.0
      export inputset.json
    

  • Load a csv containing github urls and commit hashes, get project metadata and the latest versions, generate an input set json of type GitRepoCommit, remove all versions, and generate an input set json of type GitRepo.

      load --columns "url v.commit" github list_of_github_urls_and_commits.csv
      get -mv latest
      set-meta -n test -v 1.0
      export inputset_1.json
      trim -v 0
      export inputset_2.json
    

Shell Usage

Input/Output

  • load (OPTIONS) [noreg | github | npm | pypi] [WEBLIST_NAME | FILEPATH.csv]
    Generates a dataset from a weblist or a local file. The following weblists are available:

    • Github: top1kstarred, top1kforked; the top 1,000 most starred or forked repos
    • NPM: allbydependents; all packages, sorted from most to fewest dependents count (caution: 1M+ projects... handle with care)
    • Pypi: top5kmonth and top5kyear; the top 5,000 most downloaded projects in the last 30/365 days

    Options:
    -c --columns "string of col names": A space-separated list of column names in a csv. Overrides default columns (name and version), as well as any headers listed in the file (headers in files begin with a '!'). The CSV reader recognizes the following column keywords: name, url, org, v.commit, v.version. All other columns are read in as project or version attributes.
    Example usage: --headers "name url downloads v.commit v.date".

  • backup (FILEPATH.p)
    Backs up the dataset to a pickle file (defaults to ./dataset_name.p).

  • restore FILEPATH.p
    Restores a dataset from a pickle file.

  • import [noreg | github | npm | pypi] FILEPATH.json
    Builds a dataset from an R2C input set.

  • export (FILEPATH.json)
    Exports a dataset to an R2C input set (defaults to ./dataset_name.json).

Data Acquisition

  • get (OPTIONS)
    Downloads project and version metadata from Github/NPM/Pypi.

    Options:
    -m --metadata: Gets metadata for all projects.
    -v --versions [all | latest]: Gets historical versions for all projects.

Transformation

  • trim (OPTIONS) N
    Trims the dataset to n projects or n versions per project.

    Options
    -v --versions: Binary flag; trims on versions instead of projects.

  • sample (OPTIONS) N
    Samples n projects or n versions per project.

    Options
    -v --versions: Binary flag; sample versions instead of projects.

  • sort "[asc, desc] attributes [...]"
    Sorts the projects and versions based on a space-separated string of keywords. Valid keywords are:

    • Any project attributes
    • Any version attributes (prepend "v." to the attribute name)
    • Any uuids (prepend "uuids." to the uuid name
    • Any meta values (prepend "meta." to the meta name
    • The words "asc" and "desc"

    All values are sorted in ascending order by default. The first keyword in the string is the primary sort key, the next the secondary, and so on.

    Example: The string "uuids.name meta.url downloads desc v.version_str v.date" would sort the dataset by ascending project name, url, and download count; and descending version string and date (assuming those keys exist).

Settings

  • set-meta (OPTIONS)
    Sets the dataset's metadata.

    Options:
    -n --name NAME: Input set name. Must be set before the dataset can be exported.
    -v --version VERSION: Input set version. Must be set before the dataset can be exported.
    -d --description DESCRIPTION: Description string.
    -r --readme README: Markdown-formatted readme string.
    -a --author AUTHOR: Author name; defaults to git user.name.
    -e --email EMAIL: Author email; defaults to git user.email.

  • set-api (OPTIONS)
    --cache_dir CACHE_DIR: The path to the requests cache; defaults to ./.requests_cache.
    --cache_timeout DAYS: The number of days before a cached request goes stale.
    --nocache: Binary flag; disables request caching for this dataset.
    --github_pat GITHUB_PAT: A github personal access token, used to increase the max allowed hourly request rate from 60/hr to 5,000/hr. For instructions on how to obtain a token, see: https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line.

Visualization

  • show
    Converts the dataset to a json file and loads it in the system's native json viewer.

Python Project

You can also import the package into your own project. Just import the Dataset structure, initialize it, and you're good to go!

from r2c_isg.structures import Dataset

ds = Dataset.import_inputset(
    'file.csv' ~or~ 'weblist_name',
    registry='github' ~or~ 'npm' ~or~ 'pypi',
    cache_dir=path/to/cache/dir,      # optional; overrides ./.requests_cache
    cache_timeout=int(days_in_cache), # optional; overrides 1 week cache timeout
    nocache=True,                     # optional; disables caching
    github_pat=your_github_pat        # optional; personal access token for github api
)

ds.get_projects_meta()

ds.get_project_versions(historical='all' ~or~ 'latest')

ds.trim(
    n,
    on_versions=True	# optional; defaults to False
)

ds.sample(
    n,
    on_versions=True	# optional; defaults to False
)

ds.sort('string of sort parameters')

ds.update(**{'name': 'you_dataset_name', 'version': 'your_dataset_version'})

ds.export_inputset('your_inputset.json')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

r2c-inputset-generator-0.2.0.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

r2c_inputset_generator-0.2.0-py3-none-any.whl (38.8 kB view details)

Uploaded Python 3

File details

Details for the file r2c-inputset-generator-0.2.0.tar.gz.

File metadata

  • Download URL: r2c-inputset-generator-0.2.0.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.19.1 CPython/3.6.5

File hashes

Hashes for r2c-inputset-generator-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ba2e783aa62d57cc38a6f5d1b67db49c50bc081d58f73a9933cb80e646166c59
MD5 69e7b305d43ea0e9cca87829702f4f03
BLAKE2b-256 c22c20357ac422fdf9eb6c2e28f80c5c8395ecd52608c9cd3254c14b24f90483

See more details on using hashes here.

File details

Details for the file r2c_inputset_generator-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: r2c_inputset_generator-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 38.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.19.1 CPython/3.6.5

File hashes

Hashes for r2c_inputset_generator-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 10e6bfc2094c7f87131b1ee17f284be862025a188b38322eb537488f2393197f
MD5 8d9e17ea5e3a7b70c2c22e4daf1ce074
BLAKE2b-256 fe4554e4b38948f6786069f3be6ac1756f3ced4131b9537727edb5df682aeb04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page