Skip to main content

A webmining CLI tool & library for python.

Project description

Build Status

Minet

A webmining CLI tool & library for python.

Minet features:

  • Multithreaded HTML fetching
  • Multiprocessing text content extraction
  • Facebook's share count fetching
  • Custom scraping script?

Installation

minet can be installed using pip:

pip install minet

You can also create a Minet executable.

Commands

fetch

Handy command to fetch the HTML content of every url provided in a potentially huge given csv. Works in a multithreaded & lazy way (the csv is not loaded into memory), and starts where it stopped at last execution.

minet fetch COLUMN FILE

Additional options:

  • -s STORAGE_LOCATION specifies the location where the (temporary) HTML files are stored. Is ./data by default.
  • -id COLUMN_NAME : name of the url ID column, if present in the csv FILE. Used for the name of the HTML files. If not specified, UUIDs are generated.
  • --monitoring_file FILE_NAME : location of the monitoring file used to save progress. Is ./data/monitoring.csv by default.

Minet

Example

Imagine you have a urls.csv file containing urls - in a column called 'urls' - you want to extract data from. Just use this command:

minet fetch url urls.csv

That's it, your HTML files are stored in ./data/htmlfiles, ready for text content extraction for instance.


facebook

Quickly fetches the (rounded*) Facebook share count of each url in a given csv, without the need of an API nor access token (and thus no rate limitation). Works in a multithreaded & lazy way (the csv is not loaded into memory).

The share count of a url is the sum of :

  • the number of likes of the url
  • the number of shares of the url
  • the number of likes & comments on stories about this url
minet facebook COLUMN FILE

Additional options:

  • -o OUTPUT specifies the location of the output csv (being the source csv FILE with an additional facebook_share_count column). Is stdout by default.

Minet

Example

Let's say you have a urls.csv file with - in a 'url' column - the urls you want the share count of.

Just use this command:

minet facebook url urls.csv -o urls_with_fb_data.csv

As a result, you get a urls_with_fb_data.csv file with a facebook_share_count column.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minet-0.0.1.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

minet-0.0.1-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file minet-0.0.1.tar.gz.

File metadata

  • Download URL: minet-0.0.1.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3affdc2f26f8c95895b2139d1ec9bc8df7b93ebdf4ac1dc9c0b671a7c4ebe6d1
MD5 efeda5377eb247b922a6b0995559b655
BLAKE2b-256 bdc5c534075e2303b31258d2e4ae58178df367d9d0ad75c3c92a17c94977c9b5

See more details on using hashes here.

Provenance

File details

Details for the file minet-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: minet-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb4e5789e2331a4889d904769b7b24af8af051b871518a0f5545440941590808
MD5 ce258f487c9c8692231cf82d05068729
BLAKE2b-256 1c7eed0407ad065c145d4c6dc759be2875d1bc61e9c66742f10e746fb0018929

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page