minet

A webmining CLI tool & library for python.

Project description

Minet

minet is webmining CLI tool & library for python. It adopts a lo-fi approach to various webmining problems by letting you perform a variety of actions from the comfort of your command line. No database needed: raw data files will get you going.

Features

Multithreaded, memory-efficient fetching from the web.
Multiprocessed raw text content extraction from HTML pages.
Multiprocessed scraping from HTML pages using a comfy JSON DSL.
Data collection from various APIs such as CrowdTangle.

Installation

minet can be installed using pip:

pip install minet

Commands

Basic commands

fetch
extract
scrape

API-related commands

CrowdTangle (ct)
- leaderboard
- lists
- posts
- search

fetch

Minet Fetch Command
===================

Use multiple threads to fetch batches of urls from a CSV file. The
command outputs a CSV report with additional metadata about the
HTTP calls and will generally write the retrieved files in a folder
given by the user.

positional arguments:
  column                                  Column of the CSV file containing urls to fetch.
  file                                    CSV file containing the urls to fetch.

optional arguments:
  -h, --help                              show this help message and exit
  --contents-in-report                    Whether to include retrieved contents, e.g. html, directly in the report
                                          and avoid writing them in a separate folder. This requires to standardize
                                          encoding and won't work on binary formats.
  -d OUTPUT_DIR, --output-dir OUTPUT_DIR  Directory where the fetched files will be written. Defaults to "content".
  -f FILENAME, --filename FILENAME        Name of the column used to build retrieved file names. Defaults to an uuid v4 with correct extension.
  --filename-template FILENAME_TEMPLATE   A template for the name of the fetched files.
  -g, --grab-cookies                      Whether to attempt to grab cookies from your computer's chrome browser.
  --standardize-encoding                  Whether to systematically convert retrieved text to UTF-8.
  -o OUTPUT, --output OUTPUT              Path to the output report file. By default, the report will be printed to stdout.
  -s SELECT, --select SELECT              Columns to include in report (separated by `,`).
  -t THREADS, --threads THREADS           Number of threads to use. Defaults to 25.
  --throttle THROTTLE                     Time to wait - in seconds - between 2 calls to the same domain. Defaults to 0.2.
  --total TOTAL                           Total number of lines in CSV file. Necessary if you want to display a finite progress indicator.
  --url-template URL_TEMPLATE             A template for the urls to fetch. Handy e.g. if you need to build urls from ids etc.

examples:

. Fetching a batch of url from existing CSV file:
    `minet fetch url_column file.csv > report.csv`

. CSV input from stdin:
    `xsv select url_column file.csv | minet fetch url_column > report.csv`

. Fetching a single url, useful to pipe into `minet scrape`:
    `minet fetch http://google.com | minet scrape ./scrape.json > scraped.csv`

extract

If you want to be able to use the extract command, you will need to install the dragnet library. Because it is a bit cumbersome to install, it's not included in minet's dependencies yet.

Just run the following & in the same order (dragnet needs to have specific deps installed before it can be able to compile its native files):

pip install lxml numpy Cython
pip install dragnet

Minet Extract Command
=====================

Use multiple processes to extract raw text from a batch of HTML files.
This command can either work on a `minet fetch` report or on a bunch
of files. It will output an augmented report with the extracted text.

positional arguments:
  report                                          Input CSV fetch action report file.

optional arguments:
  -h, --help                                      show this help message and exit
  -e {dragnet,html2text}, --extractor {dragnet,html2text}
                                                  Extraction engine to use. Defaults to `dragnet`.
  -i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
                                                  Directory where the HTML files are stored. Defaults to "content".
  -o OUTPUT, --output OUTPUT                      Path to the output report file. By default, the report will be printed to stdout.
  -p PROCESSES, --processes PROCESSES             Number of processes to use. Defaults to 4.
  -s SELECT, --select SELECT                      Columns to include in report (separated by `,`).
  --total TOTAL                                   Total number of HTML documents. Necessary if you want to display a finite progress indicator.

examples:

. Extracting raw text from a `minet fetch` report:
    `minet extract report.csv > extracted.csv`

. Working on a report from stdin:
    `minet fetch url_column file.csv | minet extract > extracted.csv`

. Extracting raw text from a bunch of files:
    `minet extract --glob "./content/*.html" > extracted.csv`

scrape

TODO: document the scraping DSL

Minet Scrape Command
====================

Use multiple processes to scrape data from a batch of HTML files.
This command can either work on a `minet fetch` report or on a bunch
of files. It will output the scraped items.

positional arguments:
  scraper                                         Path to a scraper definition file.
  report                                          Input CSV fetch action report file.

optional arguments:
  -h, --help                                      show this help message and exit
  -i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
                                                  Directory where the HTML files are stored. Defaults to "content".
  -o OUTPUT, --output OUTPUT                      Path to the output report file. By default, the report will be printed to stdout.
  -p PROCESSES, --processes PROCESSES             Number of processes to use. Defaults to 4.
  --total TOTAL                                   Total number of HTML documents. Necessary if you want to display a finite progress indicator.

examples:

. Scraping item from a `minet fetch` report:
    `minet scrape scraper.json report.csv > scraped.csv`

. Working on a report from stdin:
    `minet fetch url_column file.csv | minet fetch scraper.json > scraped.csv`

. Scraping items from a bunch of files:
    `minet scrape scraper.json --glob "./content/*.html" > scraped.csv`

CrowdTangle

leaderboard

Minet CrowdTangle Leaderboard Command
=====================================

Gather information and aggregated stats about pages and groups of
the designated dashboard (indicated by a given token).

optional arguments:
  -h, --help                            show this help message and exit
  -o OUTPUT, --output OUTPUT            Path to the output file. By default, everything will be printed to stdout.
  -t TOKEN, --token TOKEN               CrowdTangle dashboard API token.
  --no-breakdown                        Whether to skip statistics breakdown by post type in the CSV output.
  -f {csv,jsonl}, --format {csv,jsonl}  Output format. Defaults to `csv`.
  -l LIMIT, --limit LIMIT               Maximum number of posts to retrieve. Will fetch every post by default.
  --list-id LIST_ID                     Optional list id from which to retrieve accounts.

examples:

. Fetching accounts statistics for every account in your dashboard:
    `minet ct leaderboard --token YOUR_TOKEN > accounts-stats.csv`

lists

Minet CrowdTangle Lists Command
===============================

Retrieve the lists from a CrowdTangle dashboard (indicated by a
given token).

optional arguments:
  -h, --help                  show this help message and exit
  -o OUTPUT, --output OUTPUT  Path to the output file. By default, everything will be printed to stdout.
  -t TOKEN, --token TOKEN     CrowdTangle dashboard API token.

examples:

. Fetching a dashboard's lists:
    `minet ct lists --token YOUR_TOKEN > lists.csv`

posts

Minet CrowdTangle Posts Command
===============================

Gather post data from the designated dashboard (indicated by
a given token).

optional arguments:
  -h, --help                                      show this help message and exit
  -o OUTPUT, --output OUTPUT                      Path to the output file. By default, everything will be printed to stdout.
  -t TOKEN, --token TOKEN                         CrowdTangle dashboard API token.
  --end-date END_DATE                             The latest date at which a post could be posted (UTC!).
  -f {csv,jsonl}, --format {csv,jsonl}            Output format. Defaults to `csv`.
  --language LANGUAGE                             Language of posts to retrieve.
  -l LIMIT, --limit LIMIT                         Maximum number of posts to retrieve. Will fetch every post by default.
  --list-ids LIST_IDS                             Ids of the lists from which to retrieve posts, separated by commas.
  --sort-by {date,interaction_rate,overperforming,total_interactions,underperforming}
                                                  The order in which to retrieve posts. Defaults to `date`.
  --start-date START_DATE                         The earliest date at which a post could be posted (UTC!).
  --url-report URL_REPORT                         Path to an optional report file to write about urls found in posts.

examples:

. Fetching the 500 most latest posts from a dashboard:
    `minet ct posts --token YOUR_TOKEN --limit 500 > latest-posts.csv`

search

Minet CrowdTangle Search Command
================================

Search posts on the whole CrowdTangle platform.

positional arguments:
  terms                                           The search query term or terms.

optional arguments:
  -h, --help                                      show this help message and exit
  -o OUTPUT, --output OUTPUT                      Path to the output file. By default, everything will be printed to stdout.
  -t TOKEN, --token TOKEN                         CrowdTangle dashboard API token.
  --end-date END_DATE                             The latest date at which a post could be posted (UTC!).
  -f {csv,jsonl}, --format {csv,jsonl}            Output format. Defaults to `csv`.
  -l LIMIT, --limit LIMIT                         Maximum number of posts to retrieve. Will fetch every post by default.
  --offset OFFSET                                 Count offset.
  -p PLATFORMS, --platforms PLATFORMS             The platforms, separated by comma from which to retrieve posts.
  --sort-by {date,interaction_rate,overperforming,total_interactions,underperforming}
                                                  The order in which to retrieve posts. Defaults to `date`.
  --start-date START_DATE                         The earliest date at which a post could be posted (UTC!).
  --types TYPES                                   Types of post to include, separated by comma.
  --url-report URL_REPORT                         Path to an optional report file to write about urls found in posts.

examples:

. Fetching a dashboard's lists:
    `minet ct search --token YOUR_TOKEN > posts.csv`

Project details

Release history Release notifications | RSS feed

3.1.0

Oct 2, 2024

3.0.0

Sep 11, 2024

2.0.8

Aug 7, 2024

2.0.7

Jul 5, 2024

2.0.6

Jul 5, 2024

2.0.4

May 28, 2024

2.0.3

Apr 24, 2024

2.0.2

Apr 22, 2024

2.0.1

Apr 16, 2024

2.0.0

Apr 16, 2024

1.5.1

Mar 28, 2024

1.5.0

Mar 28, 2024

1.4.1

Mar 19, 2024

1.4.0

Feb 23, 2024

1.3.5

Feb 15, 2024

1.3.3

Feb 14, 2024

1.3.2

Jan 17, 2024

1.3.1

Jan 17, 2024

1.3.0

Jan 16, 2024

1.2.2

Dec 20, 2023

1.2.1

Dec 13, 2023

1.2.0

Dec 7, 2023

1.1.10

Nov 28, 2023

1.1.9

Nov 27, 2023

1.1.8

Nov 23, 2023

1.1.7

Nov 15, 2023

1.1.6

Nov 8, 2023

1.1.5

Nov 7, 2023

1.1.4

Nov 6, 2023

1.1.3

Nov 3, 2023

1.1.2

Nov 2, 2023

1.1.1

Nov 2, 2023

1.1.0

Nov 2, 2023

1.0.1

Oct 26, 2023

1.0.0

Oct 23, 2023

1.0.0a55 pre-release

Oct 19, 2023

1.0.0a54 pre-release

Oct 16, 2023

1.0.0a53 pre-release

Oct 10, 2023

1.0.0a52 pre-release

Sep 29, 2023

1.0.0a51 pre-release

Aug 21, 2023

1.0.0a50 pre-release

Aug 18, 2023

1.0.0a49 pre-release

Aug 15, 2023

1.0.0a48 pre-release

Aug 3, 2023

1.0.0a47 pre-release

Aug 3, 2023

1.0.0a46 pre-release

Jul 28, 2023

1.0.0a45 pre-release

Jul 24, 2023

1.0.0a44 pre-release

Jul 20, 2023

1.0.0a43 pre-release

Jul 12, 2023

1.0.0a42 pre-release

Jul 12, 2023

1.0.0a41 pre-release

Jul 5, 2023

1.0.0a40 pre-release

Jul 5, 2023

1.0.0a39 pre-release

Jul 5, 2023

1.0.0a38 pre-release

Jun 27, 2023

1.0.0a37 pre-release

Jun 14, 2023

1.0.0a36 pre-release

Jun 9, 2023

1.0.0a35 pre-release

Jun 1, 2023

1.0.0a34 pre-release

Jun 1, 2023

1.0.0a33 pre-release

May 31, 2023

1.0.0a32 pre-release

May 26, 2023

1.0.0a31 pre-release

May 25, 2023

1.0.0a30 pre-release

May 17, 2023

1.0.0a29 pre-release

May 15, 2023

1.0.0a28 pre-release

May 15, 2023

1.0.0a27 pre-release

May 10, 2023

1.0.0a26 pre-release

May 5, 2023

1.0.0a25 pre-release

May 2, 2023

1.0.0a24 pre-release

Apr 27, 2023

1.0.0a23 pre-release

Apr 27, 2023

1.0.0a22 pre-release

Apr 27, 2023

1.0.0a21 pre-release

Apr 26, 2023

1.0.0a20 pre-release

Apr 25, 2023

1.0.0a19 pre-release

Apr 21, 2023

1.0.0a18 pre-release

Apr 21, 2023

1.0.0a17 pre-release

Apr 20, 2023

1.0.0a16 pre-release

Apr 20, 2023

1.0.0a15 pre-release

Apr 1, 2023

1.0.0a14 pre-release

Mar 27, 2023

1.0.0a13 pre-release

Mar 15, 2023

1.0.0a12 pre-release

Mar 15, 2023

1.0.0a11 pre-release

Mar 14, 2023

1.0.0a10 pre-release

Mar 13, 2023

1.0.0a9 pre-release

Mar 13, 2023

1.0.0a8 pre-release

Mar 10, 2023

1.0.0a7 pre-release

Mar 10, 2023

1.0.0a6 pre-release

Mar 9, 2023

1.0.0a5 pre-release

Mar 9, 2023

1.0.0a4 pre-release

Mar 8, 2023

1.0.0a3 pre-release

Mar 3, 2023

1.0.0a2 pre-release

Mar 1, 2023

1.0.0a1 pre-release

Feb 28, 2023

0.67.1

Feb 1, 2023

0.67.0

Jan 26, 2023

0.66.2

Jan 20, 2023

0.66.1

Dec 13, 2022

0.66.0

Dec 7, 2022

0.65.0

Nov 9, 2022

0.64.0

Nov 8, 2022

0.63.1

Oct 14, 2022

0.63.0

Oct 14, 2022

0.62.1

Sep 26, 2022

0.62.0

Sep 21, 2022

0.61.6

Sep 14, 2022

0.61.5

Aug 10, 2022

0.61.4

Jul 29, 2022

0.61.3

Jul 27, 2022

0.61.2

Jul 27, 2022

0.61.1

Jul 26, 2022

0.61.0

Jul 25, 2022

0.60.4

May 19, 2022

0.60.3

May 5, 2022

0.60.2

Apr 27, 2022

0.60.1

Apr 11, 2022

0.60.0

Apr 6, 2022

0.59.0

Apr 6, 2022

0.58.1

Mar 2, 2022

0.58.0

Feb 23, 2022

0.57.0

Feb 14, 2022

0.56.4

Jan 12, 2022

0.56.2

Dec 17, 2021

0.56.1

Dec 8, 2021

0.56.0

Dec 6, 2021

0.55.9

Nov 19, 2021

0.55.8

Nov 19, 2021

0.55.7

Nov 19, 2021

0.55.6

Nov 18, 2021

0.55.3

Nov 18, 2021

0.55.2

Nov 9, 2021

0.55.1

Nov 9, 2021

0.55.0

Nov 4, 2021

0.54.1

Nov 3, 2021

0.54.0

Oct 21, 2021

0.53.10

Oct 12, 2021

0.53.9

Sep 29, 2021

0.53.8

Sep 15, 2021

0.53.7

Sep 10, 2021

0.53.6

Sep 10, 2021

0.53.5

Sep 8, 2021

0.53.4

Aug 25, 2021

0.53.3

Jul 23, 2021

0.53.2

Jul 5, 2021

0.53.1

Jul 1, 2021

0.53.0

Jun 29, 2021

0.52.13

Jun 22, 2021

0.52.12

Jun 21, 2021

0.52.11

Jun 16, 2021

0.52.10

Jun 14, 2021

0.52.9

Jun 9, 2021

0.52.8

Jun 4, 2021

0.52.7

Jun 3, 2021

0.52.6

May 20, 2021

0.52.5

May 8, 2021

0.52.4

May 7, 2021

0.52.3

Apr 29, 2021

0.52.2

Apr 28, 2021

0.52.1

Apr 28, 2021

0.52.0

Apr 28, 2021

0.51.7

Apr 26, 2021

0.51.6

Apr 20, 2021

0.51.5

Apr 16, 2021

0.51.4

Apr 16, 2021

0.51.3

Apr 14, 2021

0.51.2

Apr 13, 2021

0.51.1

Apr 12, 2021

0.51.0

Apr 11, 2021

0.50.1

Apr 9, 2021

0.50.0

Apr 2, 2021

0.49.4

Mar 30, 2021

0.49.3

Mar 29, 2021

0.49.2

Mar 29, 2021

0.49.1

Mar 27, 2021

0.49.0

Mar 26, 2021

0.48.1

Mar 20, 2021

0.48.0

Mar 20, 2021

0.47.0

Mar 15, 2021

0.46.5

Mar 8, 2021

0.46.4

Mar 7, 2021

0.46.3

Mar 6, 2021

0.46.2

Mar 6, 2021

0.46.1

Mar 5, 2021

0.46.0

Mar 4, 2021

0.45.1

Mar 3, 2021

0.45.0

Mar 3, 2021

0.44.0

Feb 25, 2021

0.43.0

Feb 24, 2021

0.42.5

Feb 23, 2021

0.42.3

Feb 22, 2021

0.42.2

Feb 19, 2021

0.42.1

Feb 18, 2021

0.42.0

Feb 10, 2021

0.41.5

Feb 8, 2021

0.41.4

Feb 4, 2021

0.41.3

Feb 3, 2021

0.41.2

Feb 2, 2021

0.41.1

Feb 2, 2021

0.41.0

Feb 2, 2021

0.40.0

Jan 22, 2021

0.39.6

Jan 4, 2021

0.39.5

Jan 4, 2021

0.39.4

Dec 19, 2020

0.39.3

Dec 19, 2020

0.39.2

Dec 19, 2020

0.39.1

Dec 19, 2020

0.39.0

Dec 16, 2020

0.38.0

Dec 16, 2020

0.37.1

Dec 4, 2020

0.37.0

Dec 1, 2020

0.36.0

Nov 24, 2020

0.35.5

Nov 23, 2020

0.35.4

Nov 23, 2020

0.35.3

Nov 18, 2020

0.35.2

Nov 16, 2020

0.35.1

Nov 13, 2020

0.35.0

Nov 11, 2020

0.34.1

Nov 10, 2020

0.34.0

Nov 10, 2020

0.33.0

Nov 6, 2020

0.32.5

Nov 2, 2020

0.32.4

Nov 2, 2020

0.32.3

Nov 2, 2020

0.32.2

Oct 30, 2020

0.32.1

Oct 28, 2020

0.32.0

Oct 28, 2020

0.31.1

Jul 10, 2020

0.31.0

Jul 2, 2020

0.30.1

May 15, 2020

0.30.0

May 15, 2020

0.29.1

May 7, 2020

0.29.0

Apr 27, 2020

0.28.0

Apr 24, 2020

0.27.0

Mar 30, 2020

0.26.0

Mar 24, 2020

0.25.0

Mar 9, 2020

0.24.2

Mar 6, 2020

0.24.1

Mar 6, 2020

0.24.0

Feb 26, 2020

0.23.0

Feb 18, 2020

0.22.0

Jan 30, 2020

0.21.1

Jan 6, 2020

0.21.0

Jan 6, 2020

0.20.2

Oct 23, 2019

0.20.1

Oct 18, 2019

0.20.0

Oct 18, 2019

0.19.0

Oct 18, 2019

0.18.0

Oct 15, 2019

0.17.0

Oct 14, 2019

0.16.0

Oct 10, 2019

0.15.0

Oct 9, 2019

0.14.3

Oct 7, 2019

0.14.2

Oct 7, 2019

0.14.1

Oct 4, 2019

0.14.0

Oct 4, 2019

0.13.0

Oct 2, 2019

0.12.0

Oct 2, 2019

0.11.0

Sep 30, 2019

0.10.0

Sep 24, 2019

0.9.1

Sep 23, 2019

0.9.0

Sep 23, 2019

0.8.2

Sep 19, 2019

0.8.1

Sep 19, 2019

0.8.0

Sep 17, 2019

This version

0.7.0

Jul 23, 2019

0.6.0

Jul 19, 2019

0.5.0

Jul 17, 2019

0.4.1

Jul 16, 2019

0.4.0

Jul 16, 2019

0.3.0

Jul 15, 2019

0.2.0

Jun 18, 2019

0.1.0

Mar 22, 2019

0.0.1

Feb 28, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minet-0.7.0.tar.gz (19.8 kB view details)

Uploaded Jul 23, 2019 Source

Built Distribution

minet-0.7.0-py3-none-any.whl (26.2 kB view details)

Uploaded Jul 23, 2019 Python 3

File details

Details for the file minet-0.7.0.tar.gz.

File metadata

Download URL: minet-0.7.0.tar.gz
Upload date: Jul 23, 2019
Size: 19.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`3604d3a37141226b3c8fe432dc30defe6d8e078770a5919ecd8a3c0f9e6263a3`
MD5	`aa01884889716e86f42459220d0aa0f4`
BLAKE2b-256	`f801ecdd1ed0fe2dd67c5c8e8bf3bcd1e20bd8eb964ee9660897a45b027afcc7`

See more details on using hashes here.

Provenance

File details

Details for the file minet-0.7.0-py3-none-any.whl.

File metadata

Download URL: minet-0.7.0-py3-none-any.whl
Upload date: Jul 23, 2019
Size: 26.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for minet-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c3b34b456eba20af4aaf9f2105c84917f30258bfe0211299ef629cbee107daf9`
MD5	`a81f17cd8fefea54482411a6ebc9999e`
BLAKE2b-256	`3212042c7a0d841acb488c380073b0ccd3323988e34ac4ca871ea85c47dbfa68`