shub · PyPI

Scrapinghub Command Line Client

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python
- Python :: 2.7
Topic
- Internet :: WWW/HTTP

Project description

shub is the Scrapinghub command line client. It allows you to deploy projects or dependencies, schedule spiders, and retrieve scraped data or logs without leaving the command line.

Requirements

Python 2.7

Installation

If you have pip installed on your system, you can install shub from the Python Package Index:

pip install shub

We also supply stand-alone binaries. You can find them in our latest GitHub release.

Usage

To see all available commands, run:

shub

For help on a specific command, run it with a --help flag, e.g.:

shub schedule --help

Quickstart

Start by logging in:

shub login

This will save your Scrapinghub API key to a file in your home directory (~/.scrapinghub.yml) and is necessary for access to projects associated with your Scrapinghub account.

Next, navigate to a Scrapy project that you wish to upload to Scrapinghub. You can deploy it to Scrapy Cloud via:

shub deploy

On the first call, this will guide you through a wizard to save your project ID into a YAML file named scrapinghub.yml, living next to your scrapy.cfg. For advanced uses, you can even define aliases for multiple project IDs there:

# project_directory/scrapinghub.yml
projects:
  default: 12345
  prod: 33333

>From anywhere within the project directory tree, you can now deploy via shub deploy (to project 12345) or shub deploy prod (to project 33333).

You can also directly supply the Scrapinghub project ID, e.g.:

shub deploy 12345

Next, schedule one of your spiders to run on Scrapy Cloud:

shub schedule myspider

(or shub schedule prod/myspider). You can watch its log or the scraped items while the spider is running by supplying the job ID:

shub log -f 2/34
shub items -f 2/34

Configuration

shub reads its configuration from two YAML files: A global one in your home directory (~/.scrapinghub.yml), and a local one (scrapinghub.yml). The local one is typically specific to a Scrapy project and should live in the same folder as scrapy.cfg. However, shub will try all parent directories until it finds a scrapinghub.yml. When configurations overlap, the local configuration file will always take precedence over the global one.

Both files have the same format:

projects:
  default: 12345
  prod: 33333
apikeys:  # populated manually or via shub login
  default: 0bbf4f0f691e0d9378ae00ca7bcf7f0c
version: GIT  # Use git branch/commit as version when deploying. Other
              # possible values are AUTO (default) or HG

projects is a mapping from human-friendly names to project IDs. This allows you to run shub deploy prod instead of shub deploy 33333. version is a string to be used as project version information when deploying to Scrapy Cloud. By default, shub will auto-detect your version control system and use its branch/commit ID as version.

Typically, you will configure only your API keys and the version specifier in the global ~/.scrapinghub.yml, and keep project configuration in the local project_dir/scrapinghub.yml. However, sometimes it can be convenient to also configure projects in the global configuration file. This allows you to use the human-friendly project identifiers outside of the project directory.

Deploying projects

To deploy a Scrapy project to Scrapy Cloud, navigate into the project’s folder and run:

shub deploy [TARGET]

where [TARGET] is either a project name defined in scrapinghub.yml or a numerical Scrapinghub project ID. If you have configured a default target in your scrapinghub.yml, you can leave out the parameter completely:

$ shub deploy
Packing version 3af023e-master
Deploying to Scrapy Cloud project "12345"
{"status": "ok", "project": 12345, "version": "3af023e-master", "spiders": 1}
Run your spiders at: https://dash.scrapinghub.com/p/12345/

$ shub deploy prod --version 1.0.0
Packing version 1.0.0
Deploying to Scrapy Cloud project "33333"
{"status": "ok", "project": 33333, "version": "1.0.0", "spiders": 1}
Run your spiders at: https://dash.scrapinghub.com/p/33333/

Run shub deploy -l to see a list of all available targets. You can also deploy your project from a Python egg, or build one without deploying:

$ shub deploy --egg egg_name --version 1.0.0
Using egg: egg_name
Deploying to Scrapy Cloud project "12345"
{"status": "ok", "project": 12345, "version": "1.0.0", "spiders": 1}
Run your spiders at: https://dash.scrapinghub.com/p/12345/

$ shub deploy --build-egg egg_name
Writing egg to egg_name

Deploying dependencies

Sometimes your project will depend on third party libraries that are not available on Scrapy Cloud. You can easily upload these via shub deploy-egg by supplying a repository URL:

$ shub deploy-egg --from-url https://github.com/scrapinghub/dateparser.git
Cloning the repository to a tmp folder...
Building egg in: /tmp/egg-tmp-clone
Deploying dependency to Scrapy Cloud project "12345"
{"status": "ok", "egg": {"version": "v0.2.1-master", "name": "dateparser"}}
Deployed eggs list at: https://dash.scrapinghub.com/p/12345/eggs

Or even a specific branch if using git:

$ shub deploy-egg --from-url https://github.com/scrapinghub/dateparser.git --git-branch py3-port
Cloning the repository to a tmp folder...
py3-port branch was checked out
Building egg in: /tmp/egg-tmp-clone
Deploying dependency to Scrapy Cloud project "12345"
{"status": "ok", "egg": {"version": "v0.1.0-30-g48841f2-py3-port", "name": "dateparser"}}
Deployed eggs list at: https://dash.scrapinghub.com/p/12345/eggs

Or a package on PyPI:

$ shub deploy-egg --from-pypi loginform
Fetching loginform from pypi
Collecting loginform
  Downloading loginform-1.0.tar.gz
  Saved /tmp/shub/loginform-1.0.tar.gz
Successfully downloaded loginform
Package fetched successfully
Uncompressing: loginform-1.0.tar.gz
Building egg in: /tmp/shub/loginform-1.0
Deploying dependency to Scrapy Cloud project "12345"
{"status": "ok", "egg": {"version": "loginform-1.0", "name": "loginform"}}
Deployed eggs list at: https://dash.scrapinghub.com/p/12345/eggs

For projects that use Kumo, instead of uploading eggs of your requirements, you can specify a requirements file to be used:

# project_directory/scrapinghub.yml

projects:
  default: 12345
  prod: 33333

requirements_file: deploy_reqs.txt

Note that this requirements file is an extension of the Kumo stack, and therefore should not contain packages that are already part of the stack, such as scrapy.

You can specify the Kumo stack to be used by extending the projects section of your configuration:

# project_directory/scrapinghub.yml

projects:
  default:
    id: 12345
    stack: kumo-stack-portia
  prod: 33333  # will use the original stack

Scheduling jobs and fetching job data

shub allows you to schedule a spider run from the command line:

shub schedule SPIDER

where SPIDER should match the spider’s name. By default, shub will schedule the spider in your default project (as defined in scrapinghub.yml). You may also explicitly specify the project to use:

shub schedule prod/SPIDER

You can supply spider arguments and job-specific settings through the -a and -s options:

$ shub schedule myspider -a ARG1=VALUE -a ARG2=VALUE
Spider myspider scheduled, job ID: 12345/2/15
Watch the log on the command line:
    shub log -f 2/15
or print items as they are being scraped:
    shub items -f 2/15
or watch it running in Scrapinghub's web interface:
    https://dash.scrapinghub.com/p/12345/job/2/15

$ shub schedule 33333/myspider -s LOG_LEVEL=DEBUG
Spider myspider scheduled, job ID: 33333/2/15
Watch the log on the command line:
    shub log -f 2/15
or print items as they are being scraped:
    shub items -f 2/15
or watch it running in Scrapinghub's web interface:
    https://dash.scrapinghub.com/p/33333/job/2/15

shub provides commands to retrieve log entries, scraped items, or requests from jobs. If the job is still running, you can provide the -f (follow) option to receive live updates:

$ shub log -f 2/15
2016-01-02 16:38:35 INFO Log opened.
2016-01-02 16:38:35 INFO [scrapy.log] Scrapy 1.0.3.post6+g2d688cd started
...
# shub will keep updating the log until the job finishes or you hit CTRL+C

$ shub items 2/15
{"name": "Example product", description": "Example description"}
{"name": "Another product", description": "Another description"}

$ shub requests 1/1/1
{"status": 200, "fp": "1ff11f1543809f1dbd714e3501d8f460b92a7a95", "rs": 138137, "_key": "1/1/1/0", "url": "http://blog.scrapinghub.com", "time": 1449834387621, "duration": 238, "method": "GET"}
{"status": 200, "fp": "418a0964a93e139166dbf9b33575f10f31f17a1", "rs": 138137, "_key": "1/1/1/0", "url": "http://blog.scrapinghub.com", "time": 1449834390881, "duration": 163, "method": "GET"}

Advanced use cases

It is possible to configure multiple API keys:

# scrapinghub.yml
projects:
  default: 123
  otheruser: someoneelse/123
apikeys:
  default: 0bbf4f0f691e0d9378ae00ca7bcf7f0c
  someoneelse: a1aeecc4cd52744730b1ea6cd3e8412a

as well as different API endpoints:

# scrapinghub.yml
projects:
  dev: vagrant/3
endpoints:
  vagrant: http://vagrant:3333/api/
apikeys:
  default: 0bbf4f0f691e0d9378ae00ca7bcf7f0c
  vagrant: a1aeecc4cd52744730b1ea6cd3e8412a

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python
- Python :: 2.7
Topic
- Internet :: WWW/HTTP

Release history Release notifications | RSS feed

2.15.4

Feb 8, 2024

2.15.3

Jan 23, 2024

2.15.2

Jan 17, 2024

2.15.1

Dec 27, 2023

2.15.0

Dec 27, 2023

2.14.5

Apr 14, 2023

2.14.4

Feb 3, 2023

2.14.3

Nov 21, 2022

2.14.2

Jun 14, 2022

2.14.1

Apr 12, 2022

2.14.0

Mar 10, 2022

2.13.0

Jan 24, 2021

2.12.0

Sep 16, 2020

2.11.0

Jul 15, 2020

2.10.0

Sep 26, 2019

2.9.0

May 16, 2019

2.8.2

Jan 18, 2019

2.8.1

Aug 20, 2018

2.8.0

Jun 6, 2018

2.7.0

Jun 26, 2017

2.6.1

Jun 1, 2017

2.6.0

May 22, 2017

2.5.1

Feb 15, 2017

2.5.0

Dec 20, 2016

2.4.2

Sep 20, 2016

2.4.1

Sep 19, 2016

2.4.0

Sep 9, 2016

2.3.0

Aug 16, 2016

2.2.0

Jul 8, 2016

This version

2.1.1

Apr 22, 2016

2.1.0

Apr 19, 2016

2.0.2

Mar 1, 2016

2.0.1

Feb 12, 2016

2.0.0

Jan 26, 2016

1.5.0

Dec 10, 2015

1.4.2

Dec 7, 2015

1.4.1

Dec 7, 2015

1.4.0

Dec 7, 2015

1.3.4

Oct 3, 2015

1.3.3

Oct 3, 2015

1.3.2

Sep 3, 2015

1.3.1

Jun 29, 2015

1.3.0

Jun 14, 2015

1.2.0

Mar 25, 2015

1.1.1

Aug 26, 2014

1.0

Jun 18, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shub-2.1.1.tar.gz (23.9 kB view details)

Uploaded Apr 22, 2016 Source

Built Distribution

shub-2.1.1-py2-none-any.whl (34.2 kB view details)

Uploaded Apr 22, 2016 Python 2

File details

Details for the file shub-2.1.1.tar.gz.

File metadata

Download URL: shub-2.1.1.tar.gz
Upload date: Apr 22, 2016
Size: 23.9 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for shub-2.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3ae74a9554ca3c0a07f6c221848d5abdc804fb49e40bd50359a881821cc487b3`
MD5	`1a8dd60c6a4dc571b43e8e2320a468b0`
BLAKE2b-256	`ca2f344c75a43cae26fe6822d8e477d06940f45dffbd4cf880081221a6607f9c`

See more details on using hashes here.

File details

Details for the file shub-2.1.1-py2-none-any.whl.

File metadata

Download URL: shub-2.1.1-py2-none-any.whl
Upload date: Apr 22, 2016
Size: 34.2 kB
Tags: Python 2
Uploaded using Trusted Publishing? No

File hashes

Hashes for shub-2.1.1-py2-none-any.whl
Algorithm	Hash digest
SHA256	`0609918aca9e4fd7d1c5df29713a1f8b01f8988c77fd27fa05b5187acfc35edb`
MD5	`22d60a07074756e3ff05db4a28b27419`
BLAKE2b-256	`4862c3c40cb85648f66d6a1824e9f0a19d89990549f4d8e5722e8a40b6dcd5b4`