Skip to main content

Small customizable multiprocessing multi-proxy crawler.

Project description

travis sonar_quality sonar_maintainability sonar_coverage Maintainability pip

An highly customizable crawler that uses multiprocessing and proxies to download one or more websites following a given filter, search and save functions.

REMEMBER THAT DDOS IS ILLEGAL. DO NOT USE THIS SOFTWARE FOR ILLEGAL PURPOSE.

Installing TinyCrawler

pip install tinycrawler

Preview (Test case)

This is the preview of the console when running the test_base.py.

preview

Usage example

from tinycrawler import TinyCrawler
from bs4 import BeautifulSoup


def url_validator(url:str)->bool:
    """Return if page at given url is to be downloaded."""
    if "http://www.example.com/my/path" not in url:
        return False

    return True

def file_parser(response: 'Response', logger: 'Log')->str:
    """Parse downloaded page into document to be saved.
        response: 'Response', response object from requests.models.Response
        logger: 'Log', a logger to log eventual errors or infos

        Return None if the page should not be saved.
    """

    soup = BeautifulSoup(response.text, 'lxml')

    example = soup.find("div", {"class": "example"})
    if example is None:
        return None

    return example.get_text()


my_crawler = TinyCrawler(
    use_cli=True, # True to use the command line interface, False otherwise
    directory="my_path_for_website" # Path for where to save website
)

my_crawler.load_proxies("path/to/my/proxies.json")
my_crawler.set_url_validator(url_validator)
my_crawler.set_file_parser(file_parser)

my_crawler.run("http://www.example.com/my/path/index.html")

Proxies are expected to be in the following format:

[
  {
    "ip": "89.236.17.108",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  },
  {
    "ip": "128.199.141.151",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  }
]

License

The software is released under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinycrawler-1.5.0.tar.gz (13.6 kB view details)

Uploaded Source

File details

Details for the file tinycrawler-1.5.0.tar.gz.

File metadata

  • Download URL: tinycrawler-1.5.0.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tinycrawler-1.5.0.tar.gz
Algorithm Hash digest
SHA256 c102c651798ab85527aec0edafafbadf47f8e35aeccc5631f6bb923b61e37978
MD5 56802ae50a24343b0cb0cf3b5c798f75
BLAKE2b-256 b50e191d1716f46ab1fd12f637a3ba712424f4f945e00cb9de3a6482420ef74f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page