Skip to main content

Twitter stream & search API grabber

Project description

Gazouilloire

Twitter stream + search API grabber handling various config options such as collecting only during specific time periods, or limiting the collection to some locations. Automatically goes back to "fill in the gaps" when there are cuts in the tweet collection.

Python >= 3.7 compatible.

HowTo

  • Install gazouilloire

    pip install gazouilloire
    
  • Install Elasticsearch (version 7.X)

  • Init gazouilloire collection in a specific directory...

    gazouilloire init path/to/collection/directory
    
  • ...or in the current directory

    gazouilloire init
    

a config.json file is created. Open it to configure the collection parameters.

  • Set your Twitter API key and generate the related Access Token

    "twitter": {
       "key": "<Consumer Key (API Key)>xxxxxxxxxxxxxxxxxxxxx",
       "secret": "<Consumer Secret (API Secret)>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
       "oauth_token": "<Access Token>xxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
       "oauth_secret": "<Access Token Secret>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    }
    
  • Write down the list of desired keywords and @users and/or the list of desired url_pieces as json arrays:

    "keywords": [
        "amour",
        "@medialab_scpo"
    ],
    "url_pieces": [
        "medialab.sciencespo.fr/fr"
    ],
    

    Avoid using accented characters (Twitter will automatically return both tweets with and without accents, for instance searching "heros" will find both tweets with "heros" and "héros").

    Note that there are three possibilities to filter further:

    • language: in order to collect only tweets written in a specific language : just add "language": "fr" to the config (the language should be written in ISO 639-1 code)

    • geolocalisation: there are two ways to specify a geographical filter:

      "geolocation": "Paris, France",
      "geolocation_type": "city", #possible values: "country"/"admin"/"city"
      

      or using a bounding box:

      "geolocation": [48.70908786918211, 2.1533203125, 49.00274483644453, 2.610626220703125],
      
    • time_limited_keywords: in order to filter on specific keywords during planned time period:

      "time_limited_keywords": {
          "#m6": [
              ["2014-05-01 16:00", "2014-05-08 16:05"],
              ["2014-05-08 16:00", "2014-05-08 16:05"],
              ["2014-05-15 16:00", "2014-05-08 16:05"],
              ["2014-05-22 16:00", "2014-05-08 16:05"]
          ],
          "bieber": [
              ["2014-05-08 16:00", "2014-05-08 16:05"]
          ]
      },
      
  • Run with:

    gazouilloire run path/to/collection/directory
    

    or, to run the script in the current directory:

    gazouilloire run
    
  • The tool can also run as daemon with:

    gazouilloire start
    
  • Stop the daemon with :

    gazouilloire stop
    
  • Access the current collection status (running/not running, nomber of collected docs, disk usage, etc.) with

    gazouilloire status
    
  • Gazouilloire stores its current search state in the collection directory. This means that if you restart Gazouilloire, it will not search again for tweets that were already found. If you want a fresh start (e.g. if you modify the query terms in config.json), you can reset the search state with:

    gazouilloire reset -i none
    

    The --es_index/-i option allows you to also remove the links or tweets Elasticsearch indices. To remove only links and search state:

    gazouilloire reset -i links
    

    To remove only tweets and search state:

    gazouilloire reset -i tweets
    

    To remove links, tweets and search state:

    gazouilloire reset
    
  • Data is stored in your ElasticSearch, which you can direcly query. But you can also export it easily in csv format:

    # Export all fields from all tweets:
    gazouilloire export
    # or
    gazou export
    
  • By default, the export command writes in stdout. You can also use the -o option to write into a file:

    gazou export > my_tweets_file.csv
    # is equivalent to
    gazou export -o my_tweets_file.csv
    
  • Other available options:

    # Export a csv of all tweets having a specific word in their text:
    gazou export medialab
    
    # Export a csv of all tweets having one of many specific words in their text:
    gazou export medialab digitalhumanities datajournalism '#python'
    
    # Export only a selection of columns:
    gazouilloire export --columns/-c id,user_screen_name,local_time,links
    # or
    gazou export --select/-s id,user_screen_name,local_time,links
    # Other example: export only the text of the tweets:
    gazou export -s text
    
    # Exclude tweets from conversations or from quotes (i.e. that do not match the keywords defined in config.json)
    gazou export --exclude_threads
    
    # Export all tweets matching a specific Elasticsearch term query, for instance by user name:
    gazou export "{'user_screen_name': 'medialab_ScPo'}"
    

Troubleshooting

  • Elasticsearch

    • Remember to set the heap size (at 1GB by default) when moving to production. 1GB is fine for indices under 15-20 million tweets, but be sure to set a higher value for heavier corpora.

      Set these values here /etc/elasticsearch/jvm.options (if you use Elasticsearch as a service) or here your_installation_folder/config/jvm.options (if you have a custom installation folder):

      -Xms2g
      -Xmx2g
      

      Here the heap size is set at 2GB (set the values at -Xms5g -Xmx5g if you need 5GB, etc).

    • If you encounter this Elasticsearch error message: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]:

      :arrow_right: Increase the max_map_count value:

      sudo sysctl -w vm.max_map_count=262144
      

      (source)

    • If you get a ClusterBlockException [SERVICE_UNAVAILABLE/1/state not recovered / initialized] when starting Elasticsearch:

      :arrow_right: Check the value of gateway.recover_after_nodes in /etc/elasticsearch/elasticsearch.yml:

      sudo [YOUR TEXT EDITOR] /etc/elasticsearch/elasticsearch.yml
      

      Edit the value of gateway.recover_after_nodes to match your number of nodes (usually 1 - easily checked here : http://host:port/_nodes).

Publications using Gazouilloire

Publications talking about Gazouilloire

Credits & License

Benjamin Ooghe-Tabanou, Jules Farjas, Béatrice Mazoyer & al @ Sciences Po médialab

Read more about Gazouilloire's migration from Python2 & Mongo to Python3 & ElasticSearch in Jules' report.

Discover more of our projects at médialab tools.

This work is supported by DIME-Web, part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Gazouilloire is a free open source software released under GPL 3.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gazouilloire-1.0.0a1.tar.gz (39.2 kB view details)

Uploaded Source

Built Distribution

gazouilloire-1.0.0a1-py3-none-any.whl (61.1 kB view details)

Uploaded Python 3

File details

Details for the file gazouilloire-1.0.0a1.tar.gz.

File metadata

  • Download URL: gazouilloire-1.0.0a1.tar.gz
  • Upload date:
  • Size: 39.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.2

File hashes

Hashes for gazouilloire-1.0.0a1.tar.gz
Algorithm Hash digest
SHA256 78411d1d98c5f9b1669fb1e5526f91497ed96b61a2de3761bf03f2c084ec50a4
MD5 d875b28d9ca2f096d601e1596955057b
BLAKE2b-256 6327fefee01d88a8b736791c4c384bdd82d1f54e349f26c98bd68ab0da433132

See more details on using hashes here.

File details

Details for the file gazouilloire-1.0.0a1-py3-none-any.whl.

File metadata

  • Download URL: gazouilloire-1.0.0a1-py3-none-any.whl
  • Upload date:
  • Size: 61.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.2

File hashes

Hashes for gazouilloire-1.0.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 c7aba8ad4acc995d1ec2845ef095bc264cb066a29bd49f3ec4e9922df9c3fd1b
MD5 7110ff9ff7098ec90ced3b2dad939cd7
BLAKE2b-256 13b9594920032da28fcbab97fcd44a1a5854fceb8c696efde526a5e86d9f471a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page