Skip to main content

Extract the main article content (and optionally comments) from a web page

Project description

Dragnet

Dragnet isn’t interested in the shiny chrome or boilerplate dressing of a web page. It’s interested in… ‘just the facts.’ The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks.

For more information on our approach check out:

Installing

The build requires numpy, lxml and a new version of Cython, so first make sure they are installed, then install Dragnet:

pip install numpy
pip install --upgrade cython
pip install lxml
pip install dragnet

GETTING STARTED

Depending on your use case, we provide two separate models to extract just the main article content or the content and any user generated comments. Each model implements the analyze method that takes an HTML string and returns the content string.

import requests
from dragnet import content_extractor, content_comments_extractor

# fetch HTML
url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
r = requests.get(url)

# get main article without comments
content = content_extractor.analyze(r.content)

# get article and comments
content_comments = content_comments_extractor.analyze(r.content)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dragnet-1.1.0.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

dragnet-1.1.0-py2.7-linux-x86_64.egg (2.7 MB view details)

Uploaded Source

File details

Details for the file dragnet-1.1.0.tar.gz.

File metadata

  • Download URL: dragnet-1.1.0.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for dragnet-1.1.0.tar.gz
Algorithm Hash digest
SHA256 9f1f7a5daf846642fe97d4e48ea4d7d064da8c8eaed4fdf3e33f007a82ec5b1d
MD5 369523fcf40ecfacee8d99e454c35ed1
BLAKE2b-256 35455a06f688d8bc94404b2062e74a1bce57bff7618060a26a3b76992ed0ea69

See more details on using hashes here.

File details

Details for the file dragnet-1.1.0-py2.7-linux-x86_64.egg.

File metadata

File hashes

Hashes for dragnet-1.1.0-py2.7-linux-x86_64.egg
Algorithm Hash digest
SHA256 6216f7e7f539b307d5872a6094d7258bd64a3df8125e5d71afc0683a7fb0a8bc
MD5 7d3247009cf458997250657f53c43bc4
BLAKE2b-256 cf9df8fe02440406f697d08edfde779d7738dee92ff1157e960d47b935541b69

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page