Twitter stream & search API grabber

Project description

Gazouilloire

A command line tool for long-term tweets collection. Gazouilloire combines two methods to collect tweets from the Twitter API ("search" and "filter") in order to maximize the number of collected tweets, and automatically fills the gaps in the collection in case of connexion errors or reboots.It handles various config options such as:

collecting only during specific time periods
limiting the collection to some locations
resolving redirected urls
downloading only certain types of media contents (only photos and no videos, for example)
unfolding Twitter conversations

Python >= 3.7 compatible.

HowTo

Install gazouilloire
```
pip install gazouilloire
```
Install Elasticsearch, version 7.X (you can also use Docker for this)
Init gazouilloire collection in a specific directory...
```
gazouilloire init path/to/collection/directory
```
...or in the current directory
```
gazouilloire init
```

a config.json file is created. Open it to configure the collection parameters.

Set your Twitter API key and generate the related Access Token

"twitter": {
   "key": "<Consumer Key (API Key)>xxxxxxxxxxxxxxxxxxxxx",
   "secret": "<Consumer Secret (API Secret)>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
   "oauth_token": "<Access Token>xxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
   "oauth_secret": "<Access Token Secret>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}

Write down the list of desired keywords and @users and/or the list of desired url_pieces as json arrays:
```
"keywords": [
    "amour",
    "\"mots successifs\"",
    "@medialab_scpo"
],
"url_pieces": [
    "medialab.sciencespo.fr/fr"
],
```
Some advanced filters can be used in combination with the keywords, such as -undesiredkeyword, filter:links, -filter:media, -filter:retweets, etc. See Twitter API's documentation for more details.

Avoid using accented characters (Twitter will automatically return both tweets with and without accents, for instance searching "heros" will find both tweets with "heros" and "héros").

Note that there are three possibilities to filter further:
- language: in order to collect only tweets written in a specific language : just add "language": "fr" to the config (the language should be written in ISO 639-1 code)
- geolocation: just add "geolocation": "Paris, France" field to the config with the desired geographical boundaries or give in coordinates of the desired box (for instance [48.70908786918211, 2.1533203125, 49.00274483644453, 2.610626220703125])
- time_limited_keywords: in order to filter on specific keywords during planned time periods, for instance:
```
"time_limited_keywords": {
      "#fosdem": [
          ["2021-01-27 04:30", "2021-01-28 23:30"]
      ]
  },
```
Setup extra options:
- resolve_redirected_links: set to true or false to enable or disable automatic resolution of all links found in tweets (t.co links are always handled, but this allows resolving also all other shorteners like bit.ly).
- grab_conversations: set to true to activate automatic iterative collection of all tweets to which collected tweets are answering (warning: one should account for the presence of these when processing data, it often results in collecting tweets way out of the collection time period).
- catchup_past_week: Twitter's free API allows to collect tweet up to 7 days in the past only which gazouilloire does by default, set this option to false to disable this and only collect tweets posted after the collection was started.
- download_media: set "download_media": {"photo": true, "video": false, "animated_gif": false} to activate automatic downloading of photos posted by users, without videos or gifs (this does not include images from social cards). All fields can also be set to true to download everything. Setup the media_directory field in complement to setup the absolute path where Gazouilloire should store the images and videos on the machine.
- timezone: adjust the timezone within which tweets timestamps should be computed. Allowed values are proposed on Gazouilloire's startup when setting up an invalid one.

Starting the collection:

Before starting the collection, you should make sure that you will have enough disk space. It takes about 1Go per million tweets collected (without images and other media contents).

You should also plan to restart your collection in a new folder (i.e. open another elasticsearch index) if the current collection exceeds 150 million tweets.

To start the collection:

Run with:

gazouilloire run path/to/collection/directory

or, to run the script in the current directory:

gazouilloire run

The tool can also run as daemon with:
```
gazouilloire start
```
Stop the daemon with :
```
gazouilloire stop
```
Access the current collection status (running/not running, nomber of collected docs, disk usage, etc.) with
```
gazouilloire status
```
Gazouilloire stores its current search state in the collection directory. This means that if you restart Gazouilloire in the same directory, it will not search again for tweets that were already collected. If you want a fresh start, you can reset the search state, as well as everything that was saved on disk, with:
```
gazouilloire reset
```
You can also choose to delete only some elements, e.g. the tweets stored in elasticsearch and the media files:
```
gazouilloire reset --only tweets,media
```
Possible values for the --only argument: tweets,links,logs,piles,search_state,media
Data is stored in your ElasticSearch, which you can direcly query. But you can also export it easily in csv format:
```
# Export all fields from all tweets:
gazouilloire export
# or
gazou export
```
By default, the export command writes in stdout. You can also use the -o option to write into a file:
```
gazou export > my_tweets_file.csv
# is equivalent to
gazou export -o my_tweets_file.csv
```

Other available options:

# Export a csv of all tweets having a specific word in their text:
gazou export medialab

# Export a csv of all tweets between 2 dates (the last date is excluded):
gazou export --since "2021-03-24T12:00" --until "2021-03-24T13:00"
# or
gazou export --since "2021-03-24" --until "2021-03-25"

# Export a csv of all tweets having one of many specific words in their text:
gazou export medialab digitalhumanities datajournalism '#python'

# Export only a selection of columns:
gazouilloire export --columns/-c id,user_screen_name,local_time,links
# or
gazou export --select/-s id,user_screen_name,local_time,links
# Other example: export only the text of the tweets:
gazou export -s text

# Exclude tweets from conversations or from quotes (i.e. that do not match the keywords defined in config.json)
gazou export --exclude-threads

# Exclude retweets from the export
gazou export --exclude-retweets

# Export all tweets matching a specific Elasticsearch term query, for instance by user name:
gazou export "{'user_screen_name': 'medialab_ScPo'}"

# Take a csv file with an "id" column and return all tweets matching these ids:
gazou export --export-tweets-from-file yourfile.csv

Troubleshooting

Elasticsearch
- Remember to set the heap size (at 1GB by default) when moving to production. 1GB is fine for indices under 15-20 million tweets, but be sure to set a higher value for heavier corpora.
  
  Set these values here /etc/elasticsearch/jvm.options (if you use Elasticsearch as a service) or here your_installation_folder/config/jvm.options (if you have a custom installation folder):
```
-Xms2g
-Xmx2g
```
  Here the heap size is set at 2GB (set the values at -Xms5g -Xmx5g if you need 5GB, etc).
- If you encounter this Elasticsearch error message: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]:
  
  :arrow_right: Increase the max_map_count value:
```
sudo sysctl -w vm.max_map_count=262144
```
  (source)
- If you get a ClusterBlockException [SERVICE_UNAVAILABLE/1/state not recovered / initialized] when starting Elasticsearch:
  
  :arrow_right: Check the value of gateway.recover_after_nodes in /etc/elasticsearch/elasticsearch.yml:
```
sudo [YOUR TEXT EDITOR] /etc/elasticsearch/elasticsearch.yml
```
  Edit the value of gateway.recover_after_nodes to match your number of nodes (usually 1 - easily checked here : http://host:port/_nodes).

Publications using Gazouilloire

CASTALDO Maria, VENTURINI Tommaso, FRASCA Paolo, GARGIULO Floriana, "The Rhythms of the Night: increase in online night activity and emotional resilience during the Covid-19 lockdown" (2020). arXiv preprint arXiv:2007.09353.
WARD Jeremy K, GUILLE-ESCURET Paul, ALAPETITE Clément, "Les « antivaccins », figure de l’anti-Science" (2019), in Déviance et Société, 2019/2 (Vol. 43), p. 221-251. DOI: 10.3917/ds.432.0221
RICCI, Donato, COLOMBO, Gabriele, MEUNIER, Axel, et al. Designing Digital Methods to monitor and inform Urban Policy. The case of Paris and its Urban Nature initiative. In: 3rd International Conference on Public Policy (ICPP3)-Panel T10P6 Session 1 Digital Methods for Public Policy. SGP, 2017. p. 1-37.
DOUAY, Nicolas, REYS, Aurélien, ROBIN, Sabrina. L’usage de Twitter par les maires d’Île-de-France. NETCOM, 29-3/4 | 2015 : Visualisation des réseaux, de l’information et de l’espace, p. 275-296.
ANTOLINOS-BASSO Diégo, PADDEU Flaminia, DOUAY Nicolas, BLANC Nathalie. Pourquoi le débat #EuropaCity n’a pas pris sur Twitter ?. RESET, 7 | 2018. DOI : 10.4000/reset.1070

Publications talking about Gazouilloire

JULLIARD, Virginie. #Theoriedugenre: comment débat-on du genre sur Twitter ?. Questions de communication, 2016, no 2, p. 135-157.
BOTTINI, Thomas et JULLIARD, Virginie. Entre informatique et sémiotique. Réseaux, 2017, no 4, p. 35-69.

Credits & License

Benjamin Ooghe-Tabanou, Jules Farjas, Béatrice Mazoyer & al @ Sciences Po médialab

Read more about Gazouilloire's migration from Python2 & Mongo to Python3 & ElasticSearch in Jules' report.

Discover more of our projects at médialab tools.

This work is supported by DIME-Web, part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Gazouilloire is a free open source software released under GPL 3.0 license.

Project details

Release history Release notifications | RSS feed

1.5.0

Jul 3, 2023

1.4.0

Dec 8, 2022

1.3.0

Sep 5, 2022

1.2.3

Sep 5, 2022

1.2.2

Jul 25, 2022

1.2.1

Jun 24, 2022

1.2.0

Jun 24, 2022

1.1.1

Mar 7, 2022

1.1.0

Feb 14, 2022

1.0.2

Dec 10, 2021

1.0.1

Sep 10, 2021

This version

1.0.0

Jul 7, 2021

1.0.0a13 pre-release

Jul 5, 2021

1.0.0a12 pre-release

Jun 16, 2021

1.0.0a11 pre-release

Jun 9, 2021

1.0.0a10 pre-release

May 25, 2021

1.0.0a9 pre-release

May 7, 2021

1.0.0a8 pre-release

May 6, 2021

1.0.0a7 pre-release

Apr 21, 2021

1.0.0a6 pre-release

Apr 19, 2021

1.0.0a5 pre-release

Apr 15, 2021

1.0.0a4 pre-release

Mar 4, 2021

1.0.0a3 pre-release

Feb 11, 2021

1.0.0a2 pre-release

Feb 5, 2021

1.0.0a1 pre-release

Jan 12, 2021

1.0.0a0 pre-release

Jan 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gazouilloire-1.0.0.tar.gz (55.4 kB view details)

Uploaded Jul 7, 2021 Source

Built Distribution

gazouilloire-1.0.0-py3-none-any.whl (68.9 kB view details)

Uploaded Jul 7, 2021 Python 3

File details

Details for the file gazouilloire-1.0.0.tar.gz.

File metadata

Download URL: gazouilloire-1.0.0.tar.gz
Upload date: Jul 7, 2021
Size: 55.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.2

File hashes

Hashes for gazouilloire-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`8ecaebd05165bc781011d032e744ba07b46d04ef499b2ae77a59233af32e2179`
MD5	`bda05256c009203e5934013bf6ae1fad`
BLAKE2b-256	`b058fdacca447f2ace34f32b7766ef9300967c74196ff0b37e4541da7625b2dd`

See more details on using hashes here.

File details

Details for the file gazouilloire-1.0.0-py3-none-any.whl.

File metadata

Download URL: gazouilloire-1.0.0-py3-none-any.whl
Upload date: Jul 7, 2021
Size: 68.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.2

File hashes

Hashes for gazouilloire-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`caa7e2386567106e06e9209616a7c69692b13bc073a777b28e264a39a3d06713`
MD5	`beedf2e7a98f38d45f730f49182a7e09`
BLAKE2b-256	`c8aed888cfbca12004d6a6dcd90027c4f56c3e8c7bc0f1366cf65f28b44f5741`