tinycrawler

Small customizable multiprocessing multi-proxy crawler.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Information Technology
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Information Analysis

Project description

An highly customizable crawler that uses multiprocessing and proxies to download one or more websites following a given filter, search and save functions.

REMEMBER THAT DDOS IS ILLEGAL. DO NOT USE THIS SOFTWARE FOR ILLEGAL PURPOSE.

Installing TinyCrawler

pip install tinycrawler

Preview (Test case)

This is the preview of the console when running the test_base.py.

preview

Usage example

from tinycrawler import TinyCrawler
from bs4 import BeautifulSoup


def url_validator(url:str)->bool:
    """Return if page at given url is to be downloaded."""
    if "http://www.example.com/my/path" not in url:
        return False

    return True

def file_parser(response: 'Response', logger: 'Log')->str:
    """Parse downloaded page into document to be saved.
        response: 'Response', response object from requests.models.Response
        logger: 'Log', a logger to log eventual errors or infos

        Return None if the page should not be saved.
    """

    soup = BeautifulSoup(response.text, 'lxml')

    example = soup.find("div", {"class": "example"})
    if example is None:
        return None

    return example.get_text()


my_crawler = TinyCrawler(
    use_cli=True, # True to use the command line interface, False otherwise
    directory="my_path_for_website" # Path for where to save website
)

my_crawler.load_proxies("path/to/my/proxies.json")
my_crawler.set_url_validator(url_validator)
my_crawler.set_file_parser(file_parser)

my_crawler.run("http://www.example.com/my/path/index.html")

Proxies are expected to be in the following format:

[
  {
    "ip": "89.236.17.108",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  },
  {
    "ip": "128.199.141.151",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  }
]

License

The software is released under the MIT license.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Information Technology
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Information Analysis

Release history Release notifications | RSS feed

1.7.5

Nov 22, 2018

1.7.1

Nov 16, 2018

1.7.0

Nov 16, 2018

1.6.1

Nov 5, 2018

This version

1.6.0

Nov 4, 2018

1.5.0

Oct 4, 2018

1.2.0

Jun 17, 2018

1.0.1

Jun 16, 2018

1.0.0

Jun 16, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinycrawler-1.6.0.tar.gz (13.9 kB view hashes)

Uploaded Nov 4, 2018 Source

Hashes for tinycrawler-1.6.0.tar.gz

Hashes for tinycrawler-1.6.0.tar.gz
Algorithm	Hash digest
SHA256	`1591900b5ed537cf21b964f4d4b27eacec3445aa3fe031907733a3e0e90ff9e8`
MD5	`3939029f96a0d4c63f3e0c429c5d213a`
BLAKE2b-256	`997edf0cc8146e4b7127ccf57ea3056d50301aba78bb2dbabb036e7eff0f8050`