Skip to main content

A webmining CLI tool & library for python.

Project description

Build Status

Minet

A webmining CLI tool & library for python.

Minet features:

  • Multithreaded HTML fetching
  • Multiprocessing text content extraction
  • Facebook's share count fetching
  • Custom scraping script?

Installation

minet can be installed using pip:

pip install minet

You can also create a Minet executable.

Commands

fetch

Handy command to fetch the HTML content of every url provided in a potentially huge given csv. Works in a multithreaded & lazy way (the csv is not loaded into memory), and starts where it stopped at last execution.

minet fetch COLUMN FILE

Additional options:

  • -s STORAGE_LOCATION specifies the location where the (temporary) HTML files are stored. Is ./data by default.
  • -id COLUMN_NAME : name of the url ID column, if present in the csv FILE. Used for the name of the HTML files. If not specified, UUIDs are generated.
  • --monitoring_file FILE_NAME : location of the monitoring file used to save progress. Is ./data/monitoring.csv by default.

Minet

Example

Imagine you have a urls.csv file containing urls - in a column called 'urls' - you want to extract data from. Just use this command:

minet fetch url urls.csv

That's it, your HTML files are stored in ./data/htmlfiles, ready for text content extraction for instance.


facebook

Quickly fetches the (rounded*) Facebook share count of each url in a given csv, without the need of an API nor access token (and thus no rate limitation). Works in a multithreaded & lazy way (the csv is not loaded into memory).

The share count of a url is the sum of :

  • the number of likes of the url
  • the number of shares of the url
  • the number of likes & comments on stories about this url
minet facebook COLUMN FILE

Additional options:

  • -o OUTPUT specifies the location of the output csv (being the source csv FILE with an additional facebook_share_count column). Is stdout by default.

Minet

Example

Let's say you have a urls.csv file with - in a 'url' column - the urls you want the share count of.

Just use this command:

minet facebook url urls.csv -o urls_with_fb_data.csv

As a result, you get a urls_with_fb_data.csv file with a facebook_share_count column.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minet-0.2.0.tar.gz (11.8 kB view details)

Uploaded Source

Built Distributions

minet-0.2.0-py3.6.egg (32.4 kB view details)

Uploaded Source

minet-0.2.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file minet-0.2.0.tar.gz.

File metadata

  • Download URL: minet-0.2.0.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0e3d6eb646d64c0f40a6f1ffa8cf6188916e32067a7fe16e3839ba8dd9520405
MD5 1c981b05eced406fd7bd0349bc6cca21
BLAKE2b-256 8287d9c87a32dc4eb4057d8b10eb69b89f702dc1b199b17bb11290dfaa0d25c0

See more details on using hashes here.

Provenance

File details

Details for the file minet-0.2.0-py3.6.egg.

File metadata

  • Download URL: minet-0.2.0-py3.6.egg
  • Upload date:
  • Size: 32.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.2.0-py3.6.egg
Algorithm Hash digest
SHA256 71e1fefb2c2cba6d2d3147bc2166c2698d1742f6c5461cd59c8b8cf0dadc47fa
MD5 3c51f2e2c6b9bfcc085a4c95c2d99ed9
BLAKE2b-256 e4318d08266213e56bdbfb0763a0f2d95b8a48f7d31091fbe564594b71b43043

See more details on using hashes here.

Provenance

File details

Details for the file minet-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: minet-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f6c32c1fc8dad154d0723cc2fceaf81eea024f0ee771f09402fc7a1179239edc
MD5 1a12ce42d63b3914500b10a2b9ba83e2
BLAKE2b-256 10028b9252b9731dd25390ac73e923082268685ae57d63dda98279d0f72b17fc

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page