Commandline tools for training Fathom rulesets
Project description
This is the commandline trainer for Fathom, which itself is a supervised-learning system for recognizing parts of web pages. It also includes other commandline tools for ruleset development, like fathom-unzip, fathom-pick, and fathom-list. See docs for the trainer here.
Version History
- 3.2
Add fathom-test tool for computing test-corpus accuracies.
Add fathom-extract to break down frozen pages into small enough pieces to check into GitHub.
Add fathom-serve to dodge the CORS errors that otherwise happen when loading extracted pages.
Add a test harness for the Python code.
Add confidence intervals for false positives and false negatives in trainer.
Add precision and recall numbers to trainer.
Add optional positive-sample weighting in trainer, for trading off between precision and recall.
Add experimental support for deeper neural networks in trainer.
Add recognition-time speed metrics to trainer.
- 3.1
Add fathom-list tool.
Further optimize trainer: about 17x faster for a 60-sample corpus, with superlinear improvements for larger ones.
- 3.0
Move to Fathom repo.
Add fathom-unzip and fathom-pick.
Switch to the Adam optimizer, which is significantly more turn-key, to the point where it doesn’t need its learning-rate decay set manually.
Tolerate pages for which no candidate nodes were collected.
Add 95% CI for per-page training accuracy.
Add validation-guided early stopping.
Revise per-page accuracy calculation and display.
Shuffle training samples before training.
Add false-positive and false-negative numbers to per-tag metrics.
- 3.0a1
First release, intended for use with Fathom itself 3.0 or later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for fathom_web-3.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11083fa1ffdadb968c43c7b97a1ba1aec7b7d7d066015d0d8d9c3db33d77b0a7 |
|
MD5 | 6428c4eadbaf539322dbe114f9022cad |
|
BLAKE2b-256 | 38ee8a3f8773c65ddaf8eecc6903d95c54d26df347039a7b1ecf7a479abca967 |