A webmining CLI tool & library for python.
Project description
A webmining CLI tool & library for python.
Minet features:
- Multithreaded HTML fetching
- Multiprocessing text content extraction
- Facebook's share count fetching
- Custom scraping script?
Installation
minet
can be installed using pip:
pip install minet
You can also create a Minet executable.
Commands
fetch
Handy command to fetch the HTML content of every url provided in a potentially huge given csv. Works in a multithreaded & lazy way (the csv is not loaded into memory), and starts where it stopped at last execution.
minet fetch COLUMN FILE
Additional options:
-s STORAGE_LOCATION
specifies the location where the (temporary) HTML files are stored. Is ./data by default.-id COLUMN_NAME
: name of the url ID column, if present in the csvFILE
. Used for the name of the HTML files. If not specified, UUIDs are generated.--monitoring_file FILE_NAME
: location of the monitoring file used to save progress. Is ./data/monitoring.csv by default.
Example
Imagine you have a urls.csv
file containing urls - in a column called 'urls'
- you want to extract data from. Just use this command:
minet fetch url urls.csv
That's it, your HTML files are stored in ./data/htmlfiles, ready for text content extraction for instance.
Quickly fetches the (rounded*) Facebook share count of each url in a given csv, without the need of an API nor access token (and thus no rate limitation). Works in a multithreaded & lazy way (the csv is not loaded into memory).
The share count of a url is the sum of :
- the number of likes of the url
- the number of shares of the url
- the number of likes & comments on stories about this url
minet facebook COLUMN FILE
Additional options:
-o OUTPUT
specifies the location of the output csv (being the source csvFILE
with an additional facebook_share_count column). Isstdout
by default.
Example
Let's say you have a urls.csv
file with - in a 'url'
column - the urls you want the share count of.
Just use this command:
minet facebook url urls.csv -o urls_with_fb_data.csv
As a result, you get a urls_with_fb_data.csv
file with a facebook_share_count column.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.