Skip to main content

Twitter stream & search API grabber

Project description

DOI

logo logo

A command line tool for long-term tweets collection. Gazouilloire combines two methods to collect tweets from the Twitter API ("search" and "filter") in order to maximize the number of collected tweets, and automatically fills the gaps in the collection in case of connexion errors or reboots. It handles various config options such as:

Python >= 3.7 compatible.

Summary

Installation

  • Install gazouilloire

    pip install gazouilloire
    
  • Install ElasticSearch, version 7.X (you can also use Docker for this)

  • Init gazouilloire collection in a specific directory...

    gazou init path/to/collection/directory
    
  • ...or in the current directory

    gazou init
    

a config.json file is created. Open it to configure the collection parameters.

Quick start

  • Set your Twitter API key and generate the related Access Token

    "twitter": {
        "key": "<Consumer Key (API Key)>xxxxxxxxxxxxxxxxxxxxx",
        "secret": "<Consumer Secret (API Secret)>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "oauth_token": "<Access Token>xxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "oauth_secret": "<Access Token Secret>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    }
    
  • Set your ElasticSearch connection (host & port) within the database section and choose a database name that will host your corpus' index:

    "database": {
        "host": "localhost",
        "port": 9200,
        "db_name": "medialab-tweets"
    }
    

Note that ElasticSearch's databases names must be lowercased and without any space or accented character.

  • Write down the list of desired keywords and @users and/or the list of desired url_pieces as json arrays:

    "keywords": [
        "amour",
        "\"mots successifs\"",
        "@medialab_scpo"
    ],
    "url_pieces": [
        "medialab.sciencespo.fr/fr"
    ],
    

    Read below the advanced settings section to setup more filters and options or to get precisions on how to properly write your queries within keywords.

  • Start the collection by typing the following command in your terminal:

    gazou run
    

    or, if the config file is located in another directory than the current one:

    gazou run path/to/collection/directory
    

    Read below the daemon section to let gazouilloire run continuously on a server and how to set up automatic restarts.

Disk space

Before starting the collection, you should make sure that you will have enough disk space. It takes about 1GB per million tweets collected (without images and other media contents).

You should also consider starting gazouilloire in multi-index mode if the collection is planed to exceed 100 million tweets, or simply restart your collection in a new folder and a new db_name (i.e. open another ElasticSearch index) if the current collection exceeds 150 million tweets.

As a point of comparison, here is the number of tweets sent during the whole year 2021 containing certain keywords (the values were obtained with the API V2 tweets count endpoint):

Query Number of tweets in 2021
lemondefr lang:fr 3 million
macron lang:fr 21 million
vaccine 176 million

Export the tweets in CSV format

  • Data is stored in your ElasticSearch, which you can direcly query. But you can also export it easily in CSV format:

    # Export all fields from all tweets:
    gazou export
    
  • By default, the export command writes in stdout. You can also use the -o option to write into a file:

    gazou export > my_tweets_file.csv
    # is equivalent to
    gazou export -o my_tweets_file.csv
    

    Although if you interrupt the export and need to resume it to complete in multiple sequences, only the -o option will work with the --resume option.

  • Other available options:

    # Get documentation for all options of gazou export (-h or --help)
    gazou export -h
    
    # By default, the export will show a progressbar, which you can disable like this:
    gazou export --quiet
    
    # Export a csv of all tweets between 2 dates or datetimes (--since is inclusive and --until exclusive):
    gazou export --since 2021-03-24 --until 2021-03-25
    # or
    gazou export --since 2021-03-24T12:00:00 --until 2021-03-24T13:00:00
    
    # Export a csv of all tweets having a specific word in their text:
    gazou export medialab
    
    # Export a csv of all tweets having one of many specific words in their text:
    gazou export medialab digitalhumanities datajournalism '#python'
    
    # List all available fields for each tweet:
    gazou export --list-fields
    
    # Export only a selection of fields (-c / --columns or -s / --select the xsv way):
    gazou export -c id,user_screen_name,local_time,links
    # or for example to export only the text of the tweets:
    gazou export --select text
    
    # Exclude tweets collected via conversations or quotes (i.e. which do not match the keywords defined in config.json)
    gazou export --exclude-threads
    
    # Exclude retweets from the export
    gazou export --exclude-retweets
    
    # Export all tweets matching a specific ElasticSearch term query, for instance by user name:
    gazou export "{'user_screen_name': 'medialab_ScPo'}"
    
    # Take a csv file with an "id" column and export only the tweets whose ids are included in this file:
    gazou export --export-tweets-from-file list_of_ids.csv
    
    # You can of course combine all of these options, for instance:
    gazou export medialab --since "2021-03-24" --until "2021-03-25" -c text --exclude-threads --exclude-retweets -o medialab_tweets_210324_no_threads_no_rts.csv
    

Advanced parameters

Many advanced settings can be used to better filter the tweets collected and complete the corpus. They can all be modified within the config.json file.

- keywords

Some advanced filters can be used in combination with the keywords, such as -undesiredkeyword, filter:links, -filter:media, -filter:retweets, etc. See Twitter API's documentation for more details.

When adding a Twitter user as a keyword, such as "@medialab_ScPo", Gazouilloire will query specifically "from:medialab_Scpo OR to:medialab_ScPo OR @medialab_ScPo" so that all tweets mentionning the user will also be collected.

Using upper or lower case characters in keywords won't change anything.

Avoid using accented characters (Twitter will automatically return both tweets with and without accents, for instance searching "heros" will find both tweets with "heros" and "héros").

Regarding hashtags, note that querying a word without the # character will return both tweets with the regular word and tweets with the hashtag. Adding a hashtag with the # characters inside keywords will only collect tweets with the hashtag.

Note that there are three possibilities to filter further:

- language

In order to collect only tweets written in a specific language: just add "language": "fr" to the config (the language should be written in ISO 639-1 code)

- geolocation

Just add "geolocation": "Paris, France" field to the config with the desired geographical boundaries or give in coordinates of the desired box (for instance [48.70908786918211, 2.1533203125, 49.00274483644453, 2.610626220703125])

- time_limited_keywords

In order to filter on specific keywords during planned time periods, for instance:

"time_limited_keywords": {
      "#fosdem": [
          ["2021-01-27 04:30", "2021-01-28 23:30"]
      ]
  }

- url_pieces

To search for specific parts of websites, one can input pieces of urls as keywords in this field. For instance:

"url_pieces": [
    "medialab.sciencespo.fr",
    "github.com/medialab"
]

- resolve_redirected_links

Set to true or false to enable or disable automatic resolution of all links found in tweets (t.co links are always handled, but this allows resolving also for all other shorteners such as bit.ly).

The resolving_delay (set to 30 by default) defines for how many days urls returning errors will be retried before leaving them as such.

- grab_conversations

Set to true to activate automatic recursive retrieval within the corpus of all tweets to which collected tweets are answering (warning: one should account for the presence of these when processing data, it often results in collecting tweets which do not contain the queried keywords and/or which are way out of the collection time period).

- catchup_past_week

Twitter's free API allows to collect tweets up to 7 days in the past, which gazouilloire does by default when starting a new corpus. Set this option to false to disable this and only collect tweets posted after the collection was started.

- download_media

Configure this option to activate automatic downloading within media_directory of photos and/or videos posted by users within the collected tweets (this does not include images from social cards). For instance the following configuration will only collect pictures without videos or gifs:

"download_media": {
    "photo": true,
    "video": false,
    "animated_gif": false,
    "media_directory": "path/to/media/directory"
}

All fields can also be set to true to download everything. media_directory is the folder where Gazouilloire stores the images & videos. It should either be an absolute path ("/home/user/gazouilloire/my_collection/my_images"), or a path relative to the directory where config.json is located ("my_images").

- timezone

Adjust the timezone within which tweets timestamps should be computed. Allowed values are proposed on Gazouilloire's startup when setting up an invalid one.

- verbose

When set to true, logs will be way more explicit regarding Gazouilloire's interactions with Twitter's API.

Daemon mode

For production use and long term data collection, Gazouilloire can run as a daemon (which means that it executes in the background, and you can safely close the window within which you started it).

  • Start the collection in daemon mode with:

    gazou start
    
  • Stop the daemon with:

    gazou stop
    
  • Restart the daemon with:

    gazou restart
    
  • Access the current collection status (running/not running, nomber of collected tweets, disk usage, etc.) with

    gazou status
    
  • Gazouilloire should normally restart on its own in case of temporary internet access outages but it might occasionnally fail for various reasons such as ElasticSearch having crashed for instance. In order to ensure a long term collection remains up and running without always checking it, we recommand to program automatic restarts of Gazouilloire at least once every week using cronjobs (missing tweets will be completed up to 7 days after a crash). In order to do so, a restart.sh script is proposed that handles restarting ElasticSearch whenever necessary. Just copy paste it within your corpus directory. Usecases and cronjobs examples are proposed as comments at the top of the script.

  • An example script daily_mail_export.sh is also proposed to perform daily tweets exports and get them by e-mail. Feel free to reuse and tailor it to your own needs.

Reset

  • Gazouilloire stores its current search state in the collection directory. This means that if you restart Gazouilloire in the same directory, it will not search again for tweets that were already collected. If you want a fresh start, you can reset the search state, as well as everything that was saved on disk, using:

    gazou reset
    
  • You can also choose to delete only some elements, e.g. the tweets stored in ElasticSearch and the media files:

    gazou reset --only tweets,media
    

    Possible values for the --only argument: tweets,links,logs,piles,search_state,media

Development

To install Gazouilloire's latest development version or to help develop it, clone the repository and install your local version using the setup.py file:

git clone https://github.com/medialab/gazouilloire
cd gazouilloire
python setup.py install

Gazouilloire's main code relies in gazouilloire/run.py in which the whole multiprocess architecture is orchestrated. Below is a diagram of all processes and queues.

  • The searcher collects tweets querying Twitter's search API v1.1 for all keywords sequentially as much as the API rates allows
  • The streamer collects realtime tweets using Twitter's streaming API v1.1 and info on deleted tweets from users explicity followed as keywords
  • The depiler processes and reformats tweets and deleted tweets using twitwi before indexing them into ElasticSearch. It also extracts media urls and parent tweets to feed the downloader and the catchupper
  • The downloader requests all media urls and stores them on the filesystem (if the download_media option is enabled)
  • The catchupper collects recursively via Twitter's lookup API v1.1 parent tweets of all collected tweets that are part of a thread and feeds back the depiler (if the grab_conversations option is enabled)
  • The resolver runs multithreaded queries on all urls found as links within the collected tweets and tries to resolve them to get unshortened and harmonized urls (if the resolve_redirected_links option is enabled) thanks to minet

All three queues are backed up on filesystem in pile_***.json files to be reloaded at next restart whenever Gazouilloire is shut down.

multiprocesses

Troubleshooting

ElasticSearch

  • Remember to set the heap size (at 1GB by default) when moving to production. 1GB is fine for indices under 15-20 million tweets, but be sure to set a higher value for heavier corpora.

    Set these values here /etc/elasticsearch/jvm.options (if you use ElasticSearch as a service) or here your_installation_folder/config/jvm.options (if you have a custom installation folder):

    -Xms2g
    -Xmx2g
    

    Here the heap size is set at 2GB (set the values at -Xms5g -Xmx5g if you need 5GB, etc).

  • If you encounter this ElasticSearch error message: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]:

    :arrow_right: Increase the max_map_count value:

    sudo sysctl -w vm.max_map_count=262144
    

    (source)

  • If you get a ClusterBlockException [SERVICE_UNAVAILABLE/1/state not recovered / initialized] when starting ElasticSearch:

    :arrow_right: Check the value of gateway.recover_after_nodes in /etc/elasticsearch/elasticsearch.yml:

    sudo [YOUR TEXT EDITOR] /etc/elasticsearch/elasticsearch.yml
    

    Edit the value of gateway.recover_after_nodes to match your number of nodes (usually 1 - easily checked here : http://host:port/_nodes).

Publications

Gazouilloire presentations

Publications using Gazouilloire

Publications talking about Gazouilloire

Credits & License

Benjamin Ooghe-Tabanou, Jules Farjas, Béatrice Mazoyer & al @ Sciences Po médialab

Read more about Gazouilloire's migration from Python2 & Mongo to Python3 & ElasticSearch in Jules' report.

Discover more of our projects at médialab tools.

This work has been supported by DIME-Web, part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Gazouilloire is a free open source software released under GPL 3.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gazouilloire-1.1.0.tar.gz (67.4 kB view details)

Uploaded Source

Built Distribution

gazouilloire-1.1.0-py3-none-any.whl (67.0 kB view details)

Uploaded Python 3

File details

Details for the file gazouilloire-1.1.0.tar.gz.

File metadata

  • Download URL: gazouilloire-1.1.0.tar.gz
  • Upload date:
  • Size: 67.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for gazouilloire-1.1.0.tar.gz
Algorithm Hash digest
SHA256 dd4c10215ce18f085e7a7e8f9f8633b57e3dee2b80bb5ae936ca88659f4d373f
MD5 f0371499653f2087ba7f9434babcbe33
BLAKE2b-256 7ca694c68903177b650c7a9a31bd510a8ff9f5ac611d8fcb0763d2da95cdc923

See more details on using hashes here.

File details

Details for the file gazouilloire-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: gazouilloire-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 67.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for gazouilloire-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8ae9c1015014c5441f6c19c3a432f832a6e219c17e096702d535aef48393ce3
MD5 383cbf6074b53534100d782c8b4d5aa9
BLAKE2b-256 d3873e43fa2ae10e05e078818442d71bd99e4705a1801b8c5646fdc30d6c8b5e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page