Skip to main content

No project description provided

Project description

CommonCrawl Extractor with great versatility

Usage

Extractor preparation

You will want to start your custom extractor preparation. To create them you need an example html files you want to extract.

You can use the following command to get html files from the CommonCrawl dataset:

$ cmondownload --limit=100 --output_type=html yoursite.com output_dir

This will download a first 100 html files from yoursite.com and save them in output_dir.

Extractor creation

Once you have your the files to extract, you can create your extractor. To do so, you need to create a new python file e.g my_extractor.py in extractors directory and add the following code:

from cmoncrawl.processor.pipeline.extractor import BaseExtractor
class MyExtractor(BaseExtractor):
   def __init__(self):
      # you can force a specific encoding if you know it
      super().__init__(encoding=None)

   def extract_soup(self, soup: BeautifulSoup, metadata: PipeMetadata):
      # here you can extract the data you want from the soup
      # and return a dict with the data you want to save

   # You can also override the following methods to drop the files you don't want to extracti
   # Return True to keep the file, False to drop it
   def filter_raw(self, response: str, metadata: PipeMetadata) -> bool:
      pass
   def filter_soup(self, soup: BeautifulSoup, metadata: PipeMetadata) -> bool:
      pass

# Make sure to instantiate your extractor into extractor variable
# The name must match so that the framework can find it
extractor = MyExtractor()

Config creation

Once you have your extractor, you need to create a config file to run the extractor. For ou

{
    "extractors_path": "./extractors",
    "routes": [
        {
            # Define which url match the extractor, use regex
            "regexes": [".*"],
            "extractors": [{
                "name": "my_extractor",
                # You can use since and to choose the extractor based
                on the date of the crawl
                # You can ommit either of them
                "since": "2009-01-01T00:00:00+00:00",
                "to": "2009-01-01T00:00:00+00:00"
            }]
        },
        # More routes here
    ]
}

Run the extractor

To test the extraction, you can use the following command:

$ cmonextract --mode=html html_file1 html_file2 ... html_fileN extraction_output_dir config_file

Get records from dataset

Once you have your extractor tested, we can start crawling. To do this you will proceed in two steps:

1. Get the list of records to extract

To do this, you can use the following command:

$ cmoncrawl --limit=100000 --output_type=record yoursite.com output_dir

This will download the first 100000 records from yoursite.com and save them in output_dir. By default it saves 100_000 records per file, you can change this with the --max_crawl_per_file option.

2. Extract the records

Once you have the records, you can use the following command to extract them:

$ cmonextract --nproc=4 --mode=record record_file1 record_file2 ... record_fileN extraction_output_dir config_file

Note that you can use the --nproc option to specify the number of processes to use for the extraction. Multiprocessing is done on file level, so if you have just one file it will not be used.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CmonCrawl-0.8.0.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

CmonCrawl-0.8.0-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file CmonCrawl-0.8.0.tar.gz.

File metadata

  • Download URL: CmonCrawl-0.8.0.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for CmonCrawl-0.8.0.tar.gz
Algorithm Hash digest
SHA256 ca538a33b896f804c6aaa72926adb8cf561f3b3bff135a5e8667c4f2f36a823b
MD5 f234ca02914059e5453ffa95cd4c17ab
BLAKE2b-256 632aa71beb5893608ef95910307cb5938cfd6c9c16beb09679decb1df7921254

See more details on using hashes here.

File details

Details for the file CmonCrawl-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: CmonCrawl-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for CmonCrawl-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05e89a72b32dfb0eb247fd5525ac9df34e9066f307d85eeda80e340a3afbdc02
MD5 014921f71fb64e0f4f6f1e6bb8d34fbb
BLAKE2b-256 3bf21129fe33aa9e18b2c91949a6f0f1f46d73bdc004c2f4184fb0ababf5deb0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page