A webmining CLI tool & library for python.
Project description
minet is a webmining CLI tool & library for python. It adopts a lo-fi approach to various webmining problems by letting you perform a variety of actions from the comfort of your command line. No database needed: raw data files will get you going.
In addition, minet also exposes its high-level programmatic interface as a library so you can tweak its behavior at will.
Features
- Multithreaded, memory-efficient fetching from the web.
- Multithreaded, scalable crawling using a comfy DSL.
- Multiprocessed raw text content extraction from HTML pages.
- Multiprocessed scraping from HTML pages using a comfy DSL.
- URL-related heuristics utilities such as normalization and matching.
- Data collection from various APIs such as CrowdTangle.
Installation
minet
can be installed using pip:
pip install minet
Cookbook
To learn how to use minet
and understand how it may fit your use cases, you should definitely check out our Cookbook.
Usage
CLI
Global utilities
Basic commands
Platform-related commands
API
CLI
-h/--help
If you need help about a command, don't hesitate to use the -h/--help
flag or the help
command:
minet ct posts -h
# or:
minet ct posts --help
# or
minet help ct posts
To check the installed version of minet
, you can use the --version
flag:
minet --version
>>> minet x.x.x
crawl
usage: minet crawl [-h] [-d OUTPUT_DIR] [--resume] [--throttle THROTTLE] crawler
Minet Crawl Command
===================
Use multiple threads to crawl the web using minet crawling and
scraping DSL.
positional arguments:
crawler Path to the crawler definition file.
optional arguments:
-h, --help show this help message and exit
-d OUTPUT_DIR, --output-dir OUTPUT_DIR Output directory.
--resume Whether to resume an interrupted crawl.
--throttle THROTTLE Time to wait - in seconds - between 2 calls to the same domain. Defaults to 0.2.
examples:
. TODO:
`minet crawl`
fetch
usage: minet fetch [-h] [--compress] [--contents-in-report] [-d OUTPUT_DIR]
[-f FILENAME] [--filename-template FILENAME_TEMPLATE]
[-g {firefox,chrome}] [-H HEADERS] [--resume]
[--standardize-encoding] [-o OUTPUT] [-s SELECT] [-t THREADS]
[--throttle THROTTLE] [--total TOTAL]
[--url-template URL_TEMPLATE] [-X METHOD]
column [file]
Minet Fetch Command
===================
Use multiple threads to fetch batches of urls from a CSV file. The
command outputs a CSV report with additional metadata about the
HTTP calls and will generally write the retrieved files in a folder
given by the user.
positional arguments:
column Column of the CSV file containing urls to fetch.
file CSV file containing the urls to fetch.
optional arguments:
-h, --help show this help message and exit
--compress Whether to compress the contents.
--contents-in-report, --no-contents-in-report Whether to include retrieved contents, e.g. html, directly in the report
and avoid writing them in a separate folder. This requires to standardize
encoding and won't work on binary formats.
-d OUTPUT_DIR, --output-dir OUTPUT_DIR Directory where the fetched files will be written. Defaults to "content".
-f FILENAME, --filename FILENAME Name of the column used to build retrieved file names. Defaults to an uuid v4 with correct extension.
--filename-template FILENAME_TEMPLATE A template for the name of the fetched files.
-g {firefox,chrome}, --grab-cookies {firefox,chrome}
Whether to attempt to grab cookies from your computer's browser.
-H HEADERS, --header HEADERS Custom headers used with every requests.
--resume Whether to resume from an aborted report.
--standardize-encoding Whether to systematically convert retrieved text to UTF-8.
-o OUTPUT, --output OUTPUT Path to the output report file. By default, the report will be printed to stdout.
-s SELECT, --select SELECT Columns to include in report (separated by `,`).
-t THREADS, --threads THREADS Number of threads to use. Defaults to 25.
--throttle THROTTLE Time to wait - in seconds - between 2 calls to the same domain. Defaults to 0.2.
--total TOTAL Total number of lines in CSV file. Necessary if you want to display a finite progress indicator.
--url-template URL_TEMPLATE A template for the urls to fetch. Handy e.g. if you need to build urls from ids etc.
-X METHOD, --request METHOD The http method to use. Will default to GET.
examples:
. Fetching a batch of url from existing CSV file:
`minet fetch url_column file.csv > report.csv`
. CSV input from stdin:
`xsv select url_column file.csv | minet fetch url_column > report.csv`
. Fetching a single url, useful to pipe into `minet scrape`:
`minet fetch http://google.com | minet scrape ./scrape.json > scraped.csv`
extract
If you want to be able to use the extract
command, you will need to install the dragnet
library. Because it is a bit cumbersome to install, it's not included in minet
's dependencies yet.
Just run the following & in the same order (dragnet
needs to have specific deps installed before it can be able to compile its native files):
pip install lxml numpy Cython
pip install dragnet
usage: minet extract [-h] [-e {dragnet,html2text}] [-i INPUT_DIRECTORY]
[-o OUTPUT] [-p PROCESSES] [-s SELECT] [--total TOTAL]
[report]
Minet Extract Command
=====================
Use multiple processes to extract raw text from a batch of HTML files.
This command can either work on a `minet fetch` report or on a bunch
of files. It will output an augmented report with the extracted text.
positional arguments:
report Input CSV fetch action report file.
optional arguments:
-h, --help show this help message and exit
-e {dragnet,html2text}, --extractor {dragnet,html2text}
Extraction engine to use. Defaults to `dragnet`.
-i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
Directory where the HTML files are stored. Defaults to "content".
-o OUTPUT, --output OUTPUT Path to the output report file. By default, the report will be printed to stdout.
-p PROCESSES, --processes PROCESSES Number of processes to use. Defaults to 4.
-s SELECT, --select SELECT Columns to include in report (separated by `,`).
--total TOTAL Total number of HTML documents. Necessary if you want to display a finite progress indicator.
examples:
. Extracting raw text from a `minet fetch` report:
`minet extract report.csv > extracted.csv`
. Working on a report from stdin:
`minet fetch url_column file.csv | minet extract > extracted.csv`
. Extracting raw text from a bunch of files:
`minet extract --glob "./content/*.html" > extracted.csv`
scrape
TODO: document the scraping DSL
usage: minet scrape [-h] [-f {csv,jsonl}] [-g GLOB] [-i INPUT_DIRECTORY]
[-o OUTPUT] [-p PROCESSES] [--total TOTAL]
scraper [report]
Minet Scrape Command
====================
Use multiple processes to scrape data from a batch of HTML files.
This command can either work on a `minet fetch` report or on a bunch
of files. It will output the scraped items.
positional arguments:
scraper Path to a scraper definition file.
report Input CSV fetch action report file.
optional arguments:
-h, --help show this help message and exit
-f {csv,jsonl}, --format {csv,jsonl} Output format.
-g GLOB, --glob GLOB Whether to scrape a bunch of html files on disk matched by a glob pattern rather than sourcing them from a CSV report.
-i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
Directory where the HTML files are stored. Defaults to "content".
-o OUTPUT, --output OUTPUT Path to the output report file. By default, the report will be printed to stdout.
-p PROCESSES, --processes PROCESSES Number of processes to use. Defaults to 4.
--total TOTAL Total number of HTML documents. Necessary if you want to display a finite progress indicator.
examples:
. Scraping item from a `minet fetch` report:
`minet scrape scraper.json report.csv > scraped.csv`
. Working on a report from stdin:
`minet fetch url_column file.csv | minet scrape scraper.json > scraped.csv`
. Scraping a single page from the web:
`minet fetch https://news.ycombinator.com/ | minet scrape scraper.json > scraped.csv`
. Scraping items from a bunch of files:
`minet scrape scraper.json --glob "./content/*.html" > scraped.csv`
url-join
usage: minet url-join [-h] [-o OUTPUT] [-s SELECT] column1 file1 column2 file2
Minet Url Join Command
======================
Join two CSV files by matching them on columns containing urls. In
fact, the command will index the first file's urls into a
hierchical trie before attempting to match the second file's ones.
positional arguments:
column1 Name of the url column in the first file.
file1 Path to the first file.
column2 Name of the url column in the second file.
file2 Path to the second file.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT Path to the output joined file. By default, the join will be printed to stdout.
-s SELECT, --select SELECT Columns from the first file to keep, separated by comma.
examples:
. Joining two files:
`minet url-join url webentities.csv post_url posts.csv > joined.csv`
. Keeping only some columns from first file:
`minet url-join url webentities.csv post_url posts.csv -s url,id > joined.csv`
url-parse
usage: minet url-parse [-h] [-o OUTPUT] [-s SELECT] [--separator SEPARATOR]
[--total TOTAL]
column [file]
Minet Url Parse Command
=======================
Overload a CSV file containing urls with a selection of additional
metadata such as their normalized version, domain name etc.
positional arguments:
column Name of the column containing urls.
file Target CSV file.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT Path to the output file. By default, the result will be printed to stdout.
-s SELECT, --select SELECT Columns to keep in output, separated by comma.
--separator SEPARATOR Split url column by a separator?
--total TOTAL Total number of lines in CSV file. Necessary if you want to display a finite progress indicator.
examples:
. Creating a report about a file's urls:
`minet url-report url posts.csv > report.csv`
. Keeping only selected columns from the input file:
`minet url-report url posts.csv -s id,url,title > report.csv`
. Multiple urls joined by separator:
`minet url-report urls posts.csv --separator "|" > report.csv`
CrowdTangle
usage: minet crowdtangle [-h] [--rate-limit RATE_LIMIT] [-o OUTPUT] [-t TOKEN]
{leaderboard,lists,posts,search,summary} ...
Minet Crowdtangle Command
=========================
Gather data from the CrowdTangle APIs easily and efficiently.
optional arguments:
-h, --help show this help message and exit
--rate-limit RATE_LIMIT Authorized number of hits by minutes. Defaults to 6.
-o OUTPUT, --output OUTPUT Path to the output file. By default, everything will be printed to stdout.
-t TOKEN, --token TOKEN CrowdTangle dashboard API token.
actions:
{leaderboard,lists,posts,search,summary} Action to perform using the CrowdTangle API.
leaderboard
usage: minet crowdtangle leaderboard [-h] [--rate-limit RATE_LIMIT] [-o OUTPUT]
[-t TOKEN] [--no-breakdown]
[-f {csv,jsonl}] [-l LIMIT]
[--list-id LIST_ID]
Minet CrowdTangle Leaderboard Command
=====================================
Gather information and aggregated stats about pages and groups of
the designated dashboard (indicated by a given token).
optional arguments:
-h, --help show this help message and exit
--rate-limit RATE_LIMIT Authorized number of hits by minutes. Defaults to 6.
-o OUTPUT, --output OUTPUT Path to the output file. By default, everything will be printed to stdout.
-t TOKEN, --token TOKEN CrowdTangle dashboard API token.
--no-breakdown Whether to skip statistics breakdown by post type in the CSV output.
-f {csv,jsonl}, --format {csv,jsonl} Output format. Defaults to `csv`.
-l LIMIT, --limit LIMIT Maximum number of posts to retrieve. Will fetch every post by default.
--list-id LIST_ID Optional list id from which to retrieve accounts.
examples:
. Fetching accounts statistics for every account in your dashboard:
`minet ct leaderboard --token YOUR_TOKEN > accounts-stats.csv`
lists
usage: minet crowdtangle lists [-h] [--rate-limit RATE_LIMIT] [-o OUTPUT]
[-t TOKEN]
Minet CrowdTangle Lists Command
===============================
Retrieve the lists from a CrowdTangle dashboard (indicated by a
given token).
optional arguments:
-h, --help show this help message and exit
--rate-limit RATE_LIMIT Authorized number of hits by minutes. Defaults to 6.
-o OUTPUT, --output OUTPUT Path to the output file. By default, everything will be printed to stdout.
-t TOKEN, --token TOKEN CrowdTangle dashboard API token.
examples:
. Fetching a dashboard's lists:
`minet ct lists --token YOUR_TOKEN > lists.csv`
posts
usage: minet crowdtangle posts [-h] [--rate-limit RATE_LIMIT] [-o OUTPUT]
[-t TOKEN] [--end-date END_DATE] [-f {csv,jsonl}]
[--language LANGUAGE] [-l LIMIT]
[--list-ids LIST_IDS]
[--partition-strategy PARTITION_STRATEGY]
[--resume]
[--sort-by {date,interaction_rate,overperforming,total_interactions,underperforming}]
[--start-date START_DATE]
[--url-report URL_REPORT]
Minet CrowdTangle Posts Command
===============================
Gather post data from the designated dashboard (indicated by
a given token).
optional arguments:
-h, --help show this help message and exit
--rate-limit RATE_LIMIT Authorized number of hits by minutes. Defaults to 6.
-o OUTPUT, --output OUTPUT Path to the output file. By default, everything will be printed to stdout.
-t TOKEN, --token TOKEN CrowdTangle dashboard API token.
--end-date END_DATE The latest date at which a post could be posted (UTC!).
-f {csv,jsonl}, --format {csv,jsonl} Output format. Defaults to `csv`.
--language LANGUAGE Language of posts to retrieve.
-l LIMIT, --limit LIMIT Maximum number of posts to retrieve. Will fetch every post by default.
--list-ids LIST_IDS Ids of the lists from which to retrieve posts, separated by commas.
--partition-strategy PARTITION_STRATEGY Query partition strategy to use to overcome the API search result limits. Should either be `day` or a number of posts.
--resume Whether to resume an interrupted collection. Requires -o/--output & --sort-by date
--sort-by {date,interaction_rate,overperforming,total_interactions,underperforming}
The order in which to retrieve posts. Defaults to `date`.
--start-date START_DATE The earliest date at which a post could be posted (UTC!).
--url-report URL_REPORT Path to an optional report file to write about urls found in posts.
examples:
. Fetching the 500 most latest posts from a dashboard:
`minet ct posts --token YOUR_TOKEN --limit 500 > latest-posts.csv`
search
usage: minet crowdtangle posts [-h] [--rate-limit RATE_LIMIT] [-o OUTPUT]
[-t TOKEN] [--end-date END_DATE] [-f {csv,jsonl}]
[--language LANGUAGE] [-l LIMIT]
[--list-ids LIST_IDS]
[--partition-strategy PARTITION_STRATEGY]
[--resume]
[--sort-by {date,interaction_rate,overperforming,total_interactions,underperforming}]
[--start-date START_DATE]
[--url-report URL_REPORT]
Minet CrowdTangle Posts Command
===============================
Gather post data from the designated dashboard (indicated by
a given token).
optional arguments:
-h, --help show this help message and exit
--rate-limit RATE_LIMIT Authorized number of hits by minutes. Defaults to 6.
-o OUTPUT, --output OUTPUT Path to the output file. By default, everything will be printed to stdout.
-t TOKEN, --token TOKEN CrowdTangle dashboard API token.
--end-date END_DATE The latest date at which a post could be posted (UTC!).
-f {csv,jsonl}, --format {csv,jsonl} Output format. Defaults to `csv`.
--language LANGUAGE Language of posts to retrieve.
-l LIMIT, --limit LIMIT Maximum number of posts to retrieve. Will fetch every post by default.
--list-ids LIST_IDS Ids of the lists from which to retrieve posts, separated by commas.
--partition-strategy PARTITION_STRATEGY Query partition strategy to use to overcome the API search result limits. Should either be `day` or a number of posts.
--resume Whether to resume an interrupted collection. Requires -o/--output & --sort-by date
--sort-by {date,interaction_rate,overperforming,total_interactions,underperforming}
The order in which to retrieve posts. Defaults to `date`.
--start-date START_DATE The earliest date at which a post could be posted (UTC!).
--url-report URL_REPORT Path to an optional report file to write about urls found in posts.
examples:
. Fetching the 500 most latest posts from a dashboard:
`minet ct posts --token YOUR_TOKEN --limit 500 > latest-posts.csv`
summary
usage: minet crowdtangle summary [-h] [--rate-limit RATE_LIMIT] [-o OUTPUT]
[-t TOKEN] [--start-date START_DATE]
[--total TOTAL]
column [file]
Minet CrowdTangle Link Summary Command
======================================
Retrieve aggregated statistics about link sharing
on the Crowdtangle API and by platform.
positional arguments:
column Name of the column containing the URL in the CSV file.
file CSV file containing the inquired URLs.
optional arguments:
-h, --help show this help message and exit
--rate-limit RATE_LIMIT Authorized number of hits by minutes. Defaults to 6.
-o OUTPUT, --output OUTPUT Path to the output file. By default, everything will be printed to stdout.
-t TOKEN, --token TOKEN CrowdTangle dashboard API token.
--start-date START_DATE The earliest date at which a post could be posted (UTC!).
--total TOTAL Total number of HTML documents. Necessary if you want to display a finite progress indicator.
examples:
. Computing a summary of aggregated stats for urls contained in a CSV row:
`minet ct summary url urls.csv --token YOUR_TOKEN --start-date 2019-01-01 > summary.csv`
usage: minet facebook [-h] {comments} ...
Minet Facebook Command
======================
Collects data from Facebook.
optional arguments:
-h, --help show this help message and exit
actions:
{comments} Action to perform to collect data on Facebook
comments
usage: minet facebook comments [-h] [-c COOKIE] [-o OUTPUT] url
Minet Facebook Comments Command
===============================
Scrape series of comments on Facebook.
positional arguments:
url Url of the post from which to scrape comments.
optional arguments:
-h, --help show this help message and exit
-c COOKIE, --cookie COOKIE Authenticated cookie to use or browser from which to extract it (support "firefox" and "chrome").
-o OUTPUT, --output OUTPUT Path to the output report file. By default, the report will be printed to stdout.
examples:
. Fetching a dashboard's lists:
`minet fb comments`
Hyphe
dump
usage: minet hyphe dump [-h] [-d OUTPUT_DIR] [--body] url corpus
Minet Hyphe Dump Command
========================
Command dumping page-level information from a given
Hyphe corpus.
positional arguments:
url Url of the Hyphe API.
corpus Id of the corpus.
optional arguments:
-h, --help show this help message and exit
-d OUTPUT_DIR, --output-dir OUTPUT_DIR Output directory for dumped files. Will default to some name based on corpus name.
--body Whether to download pages body.
examples:
. Dumping a corpus into the ./corpus directory:
`minet hyphe dump http://myhyphe.com/api/ corpus-name -d corpus`
Mediacloud
topic
stories
usage: minet mediacloud topic stories [-h] [-t TOKEN] [-o OUTPUT] topic_id
Minet Mediacloud Topic Stories Command
======================================
Retrieves the list of stories from a mediacloud topic.
positional arguments:
topic_id Id of the topic
optional arguments:
-h, --help show this help message and exit
-t TOKEN, --token TOKEN Mediacloud API token (also called key).
-o OUTPUT, --output OUTPUT Path to the output report file. By default, the report will be printed to stdout.
API
multithreaded_fetch
Function fetching urls in a multithreaded fashion.
from minet import multithreaded_fetch
# Most basic usage
urls = ['https://google.com', 'https://twitter.com']
for result in multithreaded_fetch(urls):
print(result.url, result.response.status)
# Using a list of dicts
urls = [
{
'url': 'https://google.com',
'label': 'Google'
},
{
'url': 'https://twitter.com',
'label': 'Twitter'
}
]
for result in multithreaded_fetch(urls, key=lambda x: x['url']):
print(result.item['label'], result.response.status)
Arguments:
- iterator iterable: An iterator over urls or arbitrary items, if you provide a
key
argument along with it. - key ?callable: A function extracting the url to fetch from the items yielded by the provided iterator.
- request_args ?callable: A function returning arguments to pass to the internal
request
helper for a call. - threads ?int [
25
]: Number of threads to use. - throttle ?float|callable [
0.2
]: Per-domain throttle in seconds. Or a function taking the domain and current item and returning the throttle to apply. - guess_extension ?bool [
True
]: Whether to attempt to guess the resource's extension. - guess_encoding ?bool [
True
]: Whether to attempt to guess the resource's encoding. - buffer_size ?int [
25
]: Max number of items per domain to enqueue into memory in hope of finding a new domain that can be processed immediately. - insecure ?bool [
False
]: Whether to ignore SSL certification errors when performing requests. - timeout ?float|urllib3.Timeout: Custom timeout for every request.
Yields:
A FetchWorkerResult
having the following attributes:
- url ?string: the fetched url.
- item any: original item from the iterator.
- error ?Exception: an error.
- response ?urllib3.HTTPResponse: the http response.
- meta ?dict: additional metadata:
- mime ?string: resource's mimetype.
- ext ?string: resource's extension.
- encoding ?string: resource's encoding.
multithreaded_resolve
Function resolving url redirections in a multithreaded fashion.
from minet import multithreaded_resolve
# Most basic usage
urls = ['https://bit.ly/whatever', 'https://t.co/whatever']
for result in multithreaded_resolve(urls):
print(result.stack)
# Using a list of dicts
urls = [
{
'url': 'https://bit.ly/whatever',
'label': 'Bit.ly'
},
{
'url': 'https://t.co/whatever',
'label': 'Twitter'
}
]
for result in multithreaded_resolve(urls, key=lambda x: x['url']):
print(result.stack)
Arguments:
- iterator iterable: An iterator over urls or arbitrary items, if you provide a
key
argument along with it. - key ?callable: A function extracting the url to fetch from the items yielded by the provided iterator.
- resolve_args ?callable: A function returning arguments to pass to the internal
resolve
helper for a call. - threads ?int [
25
]: Number of threads to use. - throttle ?float|callable [
0.2
]: Per-domain throttle in seconds. Or a function taking the domain and current item and returning the throttle to apply. - max_redirects ?int [
5
]: Max number of redirections to follow. - follow_refresh_header ?bool [
False
]: Whether to followRefresh
headers or not. - follow_meta_refresh ?bool [
False
]: Whether to follow meta refresh tags. It's more costly because we need to stream the start of the response's body and cannot rely on headers alone. - buffer_size ?int [
25
]: Max number of items per domain to enqueue into memory in hope of finding a new domain that can be processed immediately. - insecure ?bool [
False
]: Whether to ignore SSL certification errors when performing requests. - timeout ?float|urllib3.Timeout: Custom timeout for every request.
Yields:
A ResolveWorkerResult
having the following attributes:
- url ?string: the fetched url.
- item any: original item from the iterator.
- error ?Exception: an error.
- stack ?list: the redirection stack.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file minet-0.21.0.tar.gz
.
File metadata
- Download URL: minet-0.21.0.tar.gz
- Upload date:
- Size: 61.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7c3f50ab6f7de3b63cfa2fc2d65fafd33f7dd80aac73bb74226615f15ee12c5 |
|
MD5 | 894d6f0628bdd0acb9405a36c2a8ae1d |
|
BLAKE2b-256 | b00cfce8400f27d0139002a57c1db0c8a5ace1588b462f254931faa7c249b99d |
Provenance
File details
Details for the file minet-0.21.0-py3-none-any.whl
.
File metadata
- Download URL: minet-0.21.0-py3-none-any.whl
- Upload date:
- Size: 65.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e1f4ea00673b76ee3b904da4ace29bb25cdda71e1e2463b2230c61ddf1a66b7 |
|
MD5 | cbaa83173fc8ab1712d0f5923c2c2171 |
|
BLAKE2b-256 | 12612034c0e15db59d8937aa31ec92752bbba77d7ba6afb4d89800808b058edf |