Skip to main content

No project description provided

Project description

CmonCrawl Banner

CommonCrawl Extractor with great versatility

Build Tests Documentation

License Python Version PyPI

Unlock the full potential of CommonCrawl data with CmonCrawl, the most versatile extractor that offers unparalleled modularity and ease of use.

Why Choose CmonCrawl?

CmonCrawl stands out from the crowd with its unique features:

  • High Modularity: Easily create custom extractors tailored to your specific needs.
  • Comprehensive Access: Supports all CommonCrawl access methods, including AWS Athena and the CommonCrawl Index API for querying, and S3 and the CommonCrawl API for downloading.
  • Flexible Utility: Accessible via a Command Line Interface (CLI) or as a Software Development Kit (SDK), catering to your preferred workflow.
  • Type Safety: Built with type safety in mind, ensuring that your code is robust and reliable.

Getting Started

Installation

Install From PyPi

$ pip install cmoncrawl

Install From source

$ git clone https://github.com/hynky1999/CmonCrawl
$ cd CmonCrawl
$ pip install -r requirements.txt
$ pip install .

Usage Guide

Step 1: Extractor preparation

Begin by preparing your custom extractor. Obtain sample HTML files from the CommonCrawl dataset using the command:

$ cmon download --match_type=domain --limit=100 html_output example.com html

This will download a first 100 html files from example.com and save them in html_output.

Step 2: Extractor creation

Create a new Python file for your extractor, such as my_extractor.py, and place it in the extractors directory. Implement your extraction logic as shown below:

from bs4 import BeautifulSoup
from cmoncrawl.common.types import PipeMetadata
from cmoncrawl.processor.pipeline.extractor import BaseExtractor
class MyExtractor(BaseExtractor):
   def __init__(self):
      # you can force a specific encoding if you know it
      super().__init__(encoding=None)

   def extract_soup(self, soup: BeautifulSoup, metadata: PipeMetadata):
      # here you can extract the data you want from the soup
      # and return a dict with the data you want to save
      body = soup.select_one("body")
      if body is None:
        return None
      return {
         "body": body.get_text()
      }

   # You can also override the following methods to drop the files you don't want to extracti
   # Return True to keep the file, False to drop it
   def filter_raw(self, response: str, metadata: PipeMetadata) -> bool:
      return True
   def filter_soup(self, soup: BeautifulSoup, metadata: PipeMetadata) -> bool:
      return True

# Make sure to instantiate your extractor into extractor variable
# The name must match so that the framework can find it
extractor = MyExtractor()

Step 3: Config creation

Set up a configuration file, config.json, to specify the behavior of your extractor(s):

{
    "extractors_path": "./extractors",
    "routes": [
        {
            # Define which url match the extractor, use regex
            "regexes": [".*"],
            "extractors": [{
                "name": "my_extractor",
                # You can use since and to choose the extractor based
                on the date of the crawl
                # You can ommit either of them
                "since": "2009-01-01",
                "to": "2025-01-01"
            }]
        },
        # More routes here
    ]
}

Step: 4 Run the extractor

Test your extractor with the following command:

$ cmon extract config.json extracted_output html_output/*.html html

Step 5: Full crawl and extraction

After testing, start the full crawl and extraction process:

1. Retrieve a list of records to extract.

cmon download --match_type=domain --limit=100 dr_output example.com record

This will download the first 100 records from example.com and save them in dr_output. By default it saves 100_000 records per file, you can change this with the --max_crawls_per_file option.

2. Process the records using your custom extractor.

$ cmon extract --n_proc=4 config.json extracted_output dr_output/*.jsonl record

Note that you can use the --n_proc option to specify the number of processes to use for the extraction. Multiprocessing is done on file level, so if you have just one file it will not be used.

Advanced Usage

CmonCrawl was designed with flexibility in mind, allowing you to tailor the framework to your needs. For distributed extraction and more advanced scenarios, refer to our documentation and the CZE-NEC project.

Examples and Support

For practical examples and further assistance, visit our examples directory.

Contribute

Join our community of contributors on GitHub. Your contributions are welcome!

License

CmonCrawl is open-source software licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CmonCrawl-1.1.5.tar.gz (667.9 kB view details)

Uploaded Source

Built Distribution

CmonCrawl-1.1.5-py3-none-any.whl (52.5 kB view details)

Uploaded Python 3

File details

Details for the file CmonCrawl-1.1.5.tar.gz.

File metadata

  • Download URL: CmonCrawl-1.1.5.tar.gz
  • Upload date:
  • Size: 667.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for CmonCrawl-1.1.5.tar.gz
Algorithm Hash digest
SHA256 978e5d05d7d75bdb1405c07a7e403485adfa70285f2661dc2be5e59cd875dd0c
MD5 43a8087f2ddf79af393207e07f138b51
BLAKE2b-256 54119a97b2438c8679baa518dd94fc2f10389ccc1521d458effb1424d43e807b

See more details on using hashes here.

File details

Details for the file CmonCrawl-1.1.5-py3-none-any.whl.

File metadata

  • Download URL: CmonCrawl-1.1.5-py3-none-any.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for CmonCrawl-1.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 181e45ea4d02e28c4446f99067233eb915e791d90ffea5f474358077361b6ad4
MD5 6b4f8876943c8cb70e6e8f7fe54a56d3
BLAKE2b-256 6424e8970419210d100f8a4ec156952e4da68825087ea04543b9289826a7c7cc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page