Skip to main content

A webmining CLI tool & library for python.

Project description

Build Status

Minet

A webmining CLI tool & library for python.

Minet features:

  • Multithreaded HTML fetching
  • Multiprocessing text content extraction
  • Facebook's share count fetching
  • Custom scraping script?

Installation

minet can be installed using pip:

pip install minet

You can also create a Minet executable.

Commands

fetch

Handy command to fetch the HTML content of every url provided in a potentially huge given csv. Works in a multithreaded & lazy way (the csv is not loaded into memory), and starts where it stopped at last execution.

minet fetch COLUMN FILE

Additional options:

  • -s STORAGE_LOCATION specifies the location where the (temporary) HTML files are stored. Is ./data by default.
  • -id COLUMN_NAME : name of the url ID column, if present in the csv FILE. Used for the name of the HTML files. If not specified, UUIDs are generated.
  • --monitoring_file FILE_NAME : location of the monitoring file used to save progress. Is ./data/monitoring.csv by default.

Minet

Example

Imagine you have a urls.csv file containing urls - in a column called 'urls' - you want to extract data from. Just use this command:

minet fetch url urls.csv

That's it, your HTML files are stored in ./data/htmlfiles, ready for text content extraction for instance.


facebook

Quickly fetches the (rounded*) Facebook share count of each url in a given csv, without the need of an API nor access token (and thus no rate limitation). Works in a multithreaded & lazy way (the csv is not loaded into memory).

The share count of a url is the sum of :

  • the number of likes of the url
  • the number of shares of the url
  • the number of likes & comments on stories about this url
minet facebook COLUMN FILE

Additional options:

  • -o OUTPUT specifies the location of the output csv (being the source csv FILE with an additional facebook_share_count column). Is stdout by default.

Minet

Example

Let's say you have a urls.csv file with - in a 'url' column - the urls you want the share count of.

Just use this command:

minet facebook url urls.csv -o urls_with_fb_data.csv

As a result, you get a urls_with_fb_data.csv file with a facebook_share_count column.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minet-0.1.0.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

minet-0.1.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file minet-0.1.0.tar.gz.

File metadata

  • Download URL: minet-0.1.0.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3939b02b90aca2c2671298722fee3c94a148a5a53fe5f04efc682343da91d0eb
MD5 f0453c90d19b2ca898682dbb2b150e8e
BLAKE2b-256 929de8fb6facfbe6ab17d5143662c1929616860ea3156298e437f1d13ce52439

See more details on using hashes here.

Provenance

File details

Details for the file minet-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: minet-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba39871535d8fa30d1e4f533e017158b0663d4e1249c9e1fee204c7c6ec8fe3c
MD5 5571ef6e2a02b3b018332f0afd7255a0
BLAKE2b-256 985295db37eb913d6582230f60e7e75a383d1d84298f4cf589ca1270395253e8

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page