Skip to main content

Package for extracting software repository metadata

Project description

# Scraper

Scraper is a tool for scraping and visualizing open source data from various
code hosting platforms, such as: GitHub.com, GitHub Enterprise, GitLab.com,
hosted GitLab, and Bitbucket Server.

## Getting Started: Code.gov

[Code.gov](https://code.gov) is a newly launched website of the US Federal
Government to allow the People to access metadata from the governments custom
developed software. This site requires metadata to function, and this Python
library can help with that!

To get started, you will need a [GitHub Personal Auth
Token](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
to make requests to the GitHub API. This should be set in your environment or
shell ``rc`` file with the name ``GITHUB_API_TOKEN``:

$ export GITHUB_API_TOKEN=XYZ

$ echo "export GITHUB_API_TOKEN=XYZ" >> ~/.bashrc

Additionally, to perform the labor hours estimation, you will need to install
``cloc`` into your environment. This is typically done with a [Package
Manager](https://github.com/AlDanial/cloc#install-via-package-manager) such as
``npm`` or ``homebrew``.

Then to generate a ``code.json`` file for your agency, you will need a
``config.json`` file to coordinate the platforms you will connect to and scrape
data from. An example config file can be found in [demo.json](/demo.json). Once
you have your config file, you are ready to install and run the scraper!

# Install Scraper
$ pip install -e .

# Run Scraper with your config file ``config.json``
$ scraper --config config.json

A full example of the resulting ``code.json`` file can be [found
here](https://gist.github.com/IanLee1521/b7d7c0c2d8c24b10dd04edd5e8cab6c4).

## Config File Options

The configuration file is a json file that specifies what repository platforms
to pull projects from as well as some settings that can be used to override
incomplete or inaccurate data returned via the scraping.

The basic structure is:

```
{
# REQUIRED
"contact_email": "...", # Used when the contact email cannot be found otherwise

# OPTIONAL
"agency": "...", # Your agency abbreviation here
"organization": "...", # The organization within the agency
"permissions": { ... }, # Object containing default values for usageType and exemptionText

# Platform configurations, described in more detail below
"GitHub": [ ... ],
"GitLab": [ ... ],
"Bitbucket": [ ... ],
}
```

```
"GitHub": [
{
"url": "https://github.com", # GitHub.com or GitHub Enterprise URL to inventory
"token": null, # Private token for accessing this GitHub instance
"public_only": true, # Only inventory public repositories

"orgs": [ ... ], # List of organizations to inventory
"repos": [ ... ], # List of single repositories to inventory
"exclude": [ ... ] # List of organizations / repositories to exclude from inventory
}
],
```

```
"GitLab": [
{
"url": "https://gitlab.com", # GitLab.com or hosted GitLab instance URL to inventory
"token": null, # Private token for accessing this GitHub instance

"orgs": [ ... ], # List of organizations to inventory
"repos": [ ... ], # List of single repositories to inventory
"exclude": [ ... ] # List of groups / repositories to exclude from inventory
}
]
```

```
"Bitbucket": [
{
"url": "https://bitbucket.internal", # Base URL for a Bitbucket Server instance
"username": "", # Username to authenticate with
"password": "", # Password to authenticate with

"exclude": [ ... ] # List of projects / repositories to exclude from inventory
}
]
```

## License

Scraper is released under an MIT license. For more details see the
[LICENSE](/LICENSE) file.

LLNL-CODE-705597


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llnl-scraper-0.6.0.dev0.tar.gz (21.8 kB view details)

Uploaded Source

Built Distribution

llnl_scraper-0.6.0.dev0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file llnl-scraper-0.6.0.dev0.tar.gz.

File metadata

  • Download URL: llnl-scraper-0.6.0.dev0.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.2

File hashes

Hashes for llnl-scraper-0.6.0.dev0.tar.gz
Algorithm Hash digest
SHA256 cf236a8f92b6ab75d4ebd52a56207b0da1da85bc47843603eb9f9eb486c3c942
MD5 c054de404013ab0f4d2ca609c8d95ce1
BLAKE2b-256 0a3168c8f5ab02cb75cf40b6b0246220161c489c4da8f874ac771d070a91d28a

See more details on using hashes here.

Provenance

File details

Details for the file llnl_scraper-0.6.0.dev0-py3-none-any.whl.

File metadata

  • Download URL: llnl_scraper-0.6.0.dev0-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.2

File hashes

Hashes for llnl_scraper-0.6.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 326e4a79a3ab600e58bdd17476cd60bfdc7369a5ded085d6925043bb16536c38
MD5 853cba99601883c0a59dccb0a448dfec
BLAKE2b-256 f0813a61d35057783df52f49bf0e4cda61a0e95a9807cd8438ffbbbb13c35e65

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page