Skip to main content

A webmining CLI tool & library for python.

Project description

Build Status

Minet

minet is webmining CLI tool & library for python. It adopts a lo-fi approach to various webmining problems by letting you perform a variety of actions from the comfort of your command line. No database needed: raw data files will get you going.

Features

  • Multithreaded, memory-efficient fetching from the web.
  • Multiprocessed raw text content extraction from HTML pages.
  • Multiprocessed scraping from HTML pages using a comfy JSON DSL.
  • Data collection from various APIs such as CrowdTangle.

Installation

minet can be installed using pip:

pip install minet

Commands

Basic commands

API-related commands

fetch

Minet Fetch Command
===================

Use multiple threads to fetch batches of urls from a CSV file. The
command outputs a CSV report with additional metadata about the
HTTP calls and will generally write the retrieved files in a folder
given by the user.

positional arguments:
  column                                  Column of the CSV file containing urls to fetch.
  file                                    CSV file containing the urls to fetch.

optional arguments:
  -h, --help                              show this help message and exit
  --contents-in-report                    Whether to include retrieved contents, e.g. html, directly in the report
                                          and avoid writing them in a separate folder. This requires to standardize
                                          encoding and won't work on binary formats.
  -d OUTPUT_DIR, --output-dir OUTPUT_DIR  Directory where the fetched files will be written. Defaults to "content".
  -f FILENAME, --filename FILENAME        Name of the column used to build retrieved file names. Defaults to an uuid v4 with correct extension.
  --filename-template FILENAME_TEMPLATE   A template for the name of the fetched files.
  -g, --grab-cookies                      Whether to attempt to grab cookies from your computer's chrome browser.
  --standardize-encoding                  Whether to systematically convert retrieved text to UTF-8.
  -o OUTPUT, --output OUTPUT              Path to the output report file. By default, the report will be printed to stdout.
  -s SELECT, --select SELECT              Columns to include in report (separated by `,`).
  -t THREADS, --threads THREADS           Number of threads to use. Defaults to 25.
  --throttle THROTTLE                     Time to wait - in seconds - between 2 calls to the same domain. Defaults to 0.2.
  --total TOTAL                           Total number of lines in CSV file. Necessary if you want to display a finite progress indicator.
  --url-template URL_TEMPLATE             A template for the urls to fetch. Handy e.g. if you need to build urls from ids etc.

examples:

. Fetching a batch of url from existing CSV file:
    `minet fetch url_column file.csv > report.csv`

. CSV input from stdin:
    `xsv select url_column file.csv | minet fetch url_column > report.csv`

. Fetching a single url, useful to pipe into `minet scrape`:
    `minet fetch http://google.com | minet scrape ./scrape.json > scraped.csv`

extract

If you want to be able to use the extract command, you will need to install the dragnet library. Because it is a bit cumbersome to install, it's not included in minet's dependencies yet.

Just run the following & in the same order (dragnet needs to have specific deps installed before it can be able to compile its native files):

pip install lxml numpy Cython
pip install dragnet
Minet Extract Command
=====================

Use multiple processes to extract raw text from a batch of HTML files.
This command can either work on a `minet fetch` report or on a bunch
of files. It will output an augmented report with the extracted text.

positional arguments:
  report                                          Input CSV fetch action report file.

optional arguments:
  -h, --help                                      show this help message and exit
  -e {dragnet,html2text}, --extractor {dragnet,html2text}
                                                  Extraction engine to use. Defaults to `dragnet`.
  -i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
                                                  Directory where the HTML files are stored. Defaults to "content".
  -o OUTPUT, --output OUTPUT                      Path to the output report file. By default, the report will be printed to stdout.
  -p PROCESSES, --processes PROCESSES             Number of processes to use. Defaults to 4.
  -s SELECT, --select SELECT                      Columns to include in report (separated by `,`).
  --total TOTAL                                   Total number of HTML documents. Necessary if you want to display a finite progress indicator.

examples:

. Extracting raw text from a `minet fetch` report:
    `minet extract report.csv > extracted.csv`

. Working on a report from stdin:
    `minet fetch url_column file.csv | minet extract > extracted.csv`

. Extracting raw text from a bunch of files:
    `minet extract --glob "./content/*.html" > extracted.csv`

scrape

TODO: document the scraping DSL

Minet Scrape Command
====================

Use multiple processes to scrape data from a batch of HTML files.
This command can either work on a `minet fetch` report or on a bunch
of files. It will output the scraped items.

positional arguments:
  scraper                                         Path to a scraper definition file.
  report                                          Input CSV fetch action report file.

optional arguments:
  -h, --help                                      show this help message and exit
  -i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
                                                  Directory where the HTML files are stored. Defaults to "content".
  -o OUTPUT, --output OUTPUT                      Path to the output report file. By default, the report will be printed to stdout.
  -p PROCESSES, --processes PROCESSES             Number of processes to use. Defaults to 4.
  --total TOTAL                                   Total number of HTML documents. Necessary if you want to display a finite progress indicator.

examples:

. Scraping item from a `minet fetch` report:
    `minet scrape scraper.json report.csv > scraped.csv`

. Working on a report from stdin:
    `minet fetch url_column file.csv | minet fetch scraper.json > scraped.csv`

. Scraping items from a bunch of files:
    `minet scrape scraper.json --glob "./content/*.html" > scraped.csv`

CrowdTangle

leaderboard

Minet CrowdTangle Leaderboard Command
=====================================

Gather information and aggregated stats about pages and groups of
the designated dashboard (indicated by a given token).

optional arguments:
  -h, --help                            show this help message and exit
  -o OUTPUT, --output OUTPUT            Path to the output file. By default, everything will be printed to stdout.
  -t TOKEN, --token TOKEN               CrowdTangle dashboard API token.
  --no-breakdown                        Whether to skip statistics breakdown by post type in the CSV output.
  -f {csv,jsonl}, --format {csv,jsonl}  Output format. Defaults to `csv`.
  -l LIMIT, --limit LIMIT               Maximum number of posts to retrieve. Will fetch every post by default.
  --list-id LIST_ID                     Optional list id from which to retrieve accounts.

examples:

. Fetching accounts statistics for every account in your dashboard:
    `minet ct leaderboard --token YOUR_TOKEN > accounts-stats.csv`

lists

Minet CrowdTangle Lists Command
===============================

Retrieve the lists from a CrowdTangle dashboard (indicated by a
given token).

optional arguments:
  -h, --help                  show this help message and exit
  -o OUTPUT, --output OUTPUT  Path to the output file. By default, everything will be printed to stdout.
  -t TOKEN, --token TOKEN     CrowdTangle dashboard API token.

examples:

. Fetching a dashboard's lists:
    `minet ct lists --token YOUR_TOKEN > lists.csv`

posts

Minet CrowdTangle Posts Command
===============================

Gather post data from the designated dashboard (indicated by
a given token).

optional arguments:
  -h, --help                                      show this help message and exit
  -o OUTPUT, --output OUTPUT                      Path to the output file. By default, everything will be printed to stdout.
  -t TOKEN, --token TOKEN                         CrowdTangle dashboard API token.
  --end-date END_DATE                             The latest date at which a post could be posted (UTC!).
  -f {csv,jsonl}, --format {csv,jsonl}            Output format. Defaults to `csv`.
  --language LANGUAGE                             Language of posts to retrieve.
  -l LIMIT, --limit LIMIT                         Maximum number of posts to retrieve. Will fetch every post by default.
  --list-ids LIST_IDS                             Ids of the lists from which to retrieve posts, separated by commas.
  --sort-by {date,interaction_rate,overperforming,total_interactions,underperforming}
                                                  The order in which to retrieve posts. Defaults to `date`.
  --start-date START_DATE                         The earliest date at which a post could be posted (UTC!).
  --url-report URL_REPORT                         Path to an optional report file to write about urls found in posts.

examples:

. Fetching the 500 most latest posts from a dashboard:
    `minet ct posts --token YOUR_TOKEN --limit 500 > latest-posts.csv`

search

Minet CrowdTangle Search Command
================================

Search posts on the whole CrowdTangle platform.

positional arguments:
  terms                                           The search query term or terms.

optional arguments:
  -h, --help                                      show this help message and exit
  -o OUTPUT, --output OUTPUT                      Path to the output file. By default, everything will be printed to stdout.
  -t TOKEN, --token TOKEN                         CrowdTangle dashboard API token.
  --end-date END_DATE                             The latest date at which a post could be posted (UTC!).
  -f {csv,jsonl}, --format {csv,jsonl}            Output format. Defaults to `csv`.
  -l LIMIT, --limit LIMIT                         Maximum number of posts to retrieve. Will fetch every post by default.
  --offset OFFSET                                 Count offset.
  --partition-strategy {day}                      Query partition strategy to use to overcome the API search result limits.
  -p PLATFORMS, --platforms PLATFORMS             The platforms, separated by comma from which to retrieve posts.
  --sort-by {date,interaction_rate,overperforming,total_interactions,underperforming}
                                                  The order in which to retrieve posts. Defaults to `date`.
  --start-date START_DATE                         The earliest date at which a post could be posted (UTC!).
  --types TYPES                                   Types of post to include, separated by comma.
  --url-report URL_REPORT                         Path to an optional report file to write about urls found in posts.

examples:

. Fetching a dashboard's lists:
    `minet ct search --token YOUR_TOKEN > posts.csv`

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minet-0.10.0.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

minet-0.10.0-py3-none-any.whl (33.6 kB view details)

Uploaded Python 3

File details

Details for the file minet-0.10.0.tar.gz.

File metadata

  • Download URL: minet-0.10.0.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.4

File hashes

Hashes for minet-0.10.0.tar.gz
Algorithm Hash digest
SHA256 cbf8a355b9e11a28ccae3ffb6f103cc136aba5a7d8f63621e15ed6bbe013611f
MD5 a183704ae41d443514dceb2ae7358f64
BLAKE2b-256 d03dabffc0d43e976382a186154a5bc0bcd8b347d70d4a971909de95c738a3f3

See more details on using hashes here.

Provenance

File details

Details for the file minet-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: minet-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 33.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.4

File hashes

Hashes for minet-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 966f504749e9651ca8d42351b25b087d3d22607d38282fb0996b2ac10ef05715
MD5 98ef59fdb008a75f0dba228236aa4ac8
BLAKE2b-256 dc36b27479869d5b55ef8046a17999db3f12a3c2737f5b1e70b62179590c062f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page