Skip to main content

Extract the main article content (and optionally comments) from a web page

Project description

Dragnet

Dragnet isn’t interested in the shiny chrome or boilerplate dressing of a web page. It’s interested in… ‘just the facts.’ The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks.

For more information on our approach check out:

GETTING STARTED

Depending on your use case, we provide two separate models to extract just the main article content or the content and any user generated comments. Each model implements the analyze method that takes an HTML string and returns the content string.

import requests
from dragnet import content_extractor, content_comments_extractor

# fetch HTML
url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
r = requests.get(url)

# get main article without comments
content = content_extractor.analyze(r.content)

# get article and comments
content_comments = content_comments_extractor.analyze(r.content)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dragnet-1.0.0.tar.gz (922.4 kB view details)

Uploaded Source

Built Distribution

dragnet-1.0.0-cp27-none-macosx_10_10_intel.whl (1.1 MB view details)

Uploaded CPython 2.7 macOS 10.10+ intel

File details

Details for the file dragnet-1.0.0.tar.gz.

File metadata

  • Download URL: dragnet-1.0.0.tar.gz
  • Upload date:
  • Size: 922.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for dragnet-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b9907843d1e7aea239907c748ab9313cbe3974dbc467f955e5b3c1f6270f5c55
MD5 1a71b6ad3ad87d98488e0dc4a2d848f6
BLAKE2b-256 7cba81ca8ac1d42248495a765de3f26fb5a3b1dc5d0f66eab739bd9e598a52bc

See more details on using hashes here.

File details

Details for the file dragnet-1.0.0-cp27-none-macosx_10_10_intel.whl.

File metadata

File hashes

Hashes for dragnet-1.0.0-cp27-none-macosx_10_10_intel.whl
Algorithm Hash digest
SHA256 3f8e47495322ca02c1540a29dbdf61add117c26044261501484f22d9b5994559
MD5 84fb8d211155099f6406c2ada3f63578
BLAKE2b-256 34688434e4cfdec13447b4455ce20099f63f4f74eb47f8bc9325c42ea6bf67ab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page