Extract the main article content (and optionally comments) from a web page
Project description
Dragnet
Dragnet isn’t interested in the shiny chrome or boilerplate dressing of a web page. It’s interested in… ‘just the facts.’ The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks.
For more information on our approach check out:
The Dragnet homepage
Our paper Content Extraction Using Diverse Feature Sets, published at WWW in 2013, gives an overview of the machine learning approach.
A comparison of Dragnet and alternate content extraction packages.
This blog post explains the intuition behind the algorithms.
Installing
The build requires numpy, lxml and a new version of Cython, so first make sure they are installed, then install Dragnet:
pip install numpy
pip install --upgrade cython
pip install lxml
pip install dragnet
GETTING STARTED
Depending on your use case, we provide two separate models to extract just the main article content or the content and any user generated comments. Each model implements the analyze method that takes an HTML string and returns the content string.
import requests
from dragnet import content_extractor, content_comments_extractor
# fetch HTML
url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
r = requests.get(url)
# get main article without comments
content = content_extractor.analyze(r.content)
# get article and comments
content_comments = content_comments_extractor.analyze(r.content)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dragnet-1.1.0.tar.gz
.
File metadata
- Download URL: dragnet-1.1.0.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f1f7a5daf846642fe97d4e48ea4d7d064da8c8eaed4fdf3e33f007a82ec5b1d |
|
MD5 | 369523fcf40ecfacee8d99e454c35ed1 |
|
BLAKE2b-256 | 35455a06f688d8bc94404b2062e74a1bce57bff7618060a26a3b76992ed0ea69 |
File details
Details for the file dragnet-1.1.0-py2.7-linux-x86_64.egg
.
File metadata
- Download URL: dragnet-1.1.0-py2.7-linux-x86_64.egg
- Upload date:
- Size: 2.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6216f7e7f539b307d5872a6094d7258bd64a3df8125e5d71afc0683a7fb0a8bc |
|
MD5 | 7d3247009cf458997250657f53c43bc4 |
|
BLAKE2b-256 | cf9df8fe02440406f697d08edfde779d7738dee92ff1157e960d47b935541b69 |