A classifier for detecting soft 404 pages
Project description
A “soft” 404 page is a page that is served with 200 status, but is really a page that says that content is not available.
Installation
pip install git+https://github.com/scrapinghub/webstruct.git pip install soft404
Usage
>>> from soft404 import Soft404Classifier >>> clf = Soft404Classifier() >>> clf.predict('<h1>Page not found</h1>') 0.9736860086882132
Development
Getting data for training
Install dev requirements:
pip install -r requirements_dev.txt
Run the crawler for a while (results will appear in pages.jl.gz file):
cd crawler scrapy crawl spider -o gzip:pages.jl -s JOBDIR=job
Training
First, extract text and structure from html:
./soft404/convert_to_text.py pages.jl.gz items
This will produce two files, items.meta.jl.gz and items.items.jl.gz. Next, train the classifier:
./soft404/train.py items
Vectorizer takes a while to run, but it’s result is cached (the filename where it is cached will be printed on the next run). If you are happy with results, save the classifier:
./soft404/train.py items --save soft404/clf.joblib
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
soft404-0.1.0.tar.gz
(73.3 kB
view hashes)
Built Distribution
Close
Hashes for soft404-0.1.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dcbe8e9ffc19bc2efde5f002dfba06427d15d7bdce8a998b3b0b77234c7b9087 |
|
MD5 | 191fc20b37ad3091136744f8e0e98d07 |
|
BLAKE2b-256 | 200f663cbdc592a9edb9f47df67e6446e48af192d5af6620b0be93123799d062 |