A webmining CLI tool & library for python.
Project description
A webmining CLI tool & library for python.
Minet features:
- Multithreaded HTML fetching
- Multiprocessing text content extraction
- Facebook's share count fetching
- Custom scraping script?
Installation
minet
can be installed using pip:
pip install minet
You can also create a Minet executable.
Commands
fetch
Handy command to fetch the HTML content of every url provided in a potentially huge given csv. Works in a multithreaded & lazy way (the csv is not loaded into memory), and starts where it stopped at last execution.
minet fetch COLUMN FILE
Additional options:
-s STORAGE_LOCATION
specifies the location where the (temporary) HTML files are stored. Is ./data by default.-id COLUMN_NAME
: name of the url ID column, if present in the csvFILE
. Used for the name of the HTML files. If not specified, UUIDs are generated.--monitoring_file FILE_NAME
: location of the monitoring file used to save progress. Is ./data/monitoring.csv by default.
Example
Imagine you have a urls.csv
file containing urls - in a column called 'urls'
- you want to extract data from. Just use this command:
minet fetch url urls.csv
That's it, your HTML files are stored in ./data/htmlfiles, ready for text content extraction for instance.
Quickly fetches the (rounded*) Facebook share count of each url in a given csv, without the need of an API nor access token (and thus no rate limitation). Works in a multithreaded & lazy way (the csv is not loaded into memory).
The share count of a url is the sum of :
- the number of likes of the url
- the number of shares of the url
- the number of likes & comments on stories about this url
minet facebook COLUMN FILE
Additional options:
-o OUTPUT
specifies the location of the output csv (being the source csvFILE
with an additional facebook_share_count column). Isstdout
by default.
Example
Let's say you have a urls.csv
file with - in a 'url'
column - the urls you want the share count of.
Just use this command:
minet facebook url urls.csv -o urls_with_fb_data.csv
As a result, you get a urls_with_fb_data.csv
file with a facebook_share_count column.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file minet-0.1.0.tar.gz
.
File metadata
- Download URL: minet-0.1.0.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3939b02b90aca2c2671298722fee3c94a148a5a53fe5f04efc682343da91d0eb |
|
MD5 | f0453c90d19b2ca898682dbb2b150e8e |
|
BLAKE2b-256 | 929de8fb6facfbe6ab17d5143662c1929616860ea3156298e437f1d13ce52439 |
Provenance
File details
Details for the file minet-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: minet-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba39871535d8fa30d1e4f533e017158b0663d4e1249c9e1fee204c7c6ec8fe3c |
|
MD5 | 5571ef6e2a02b3b018332f0afd7255a0 |
|
BLAKE2b-256 | 985295db37eb913d6582230f60e7e75a383d1d84298f4cf589ca1270395253e8 |