Flexible, high-scale API to elasticsearch
Project description
pyelasticsearch is a clean, future-proof, high-scale API to elasticsearch. It provides…
Transparent conversion of Python data types to and from JSON, including datetimes and the arbitrary-precision Decimal type
Translation of HTTP failure status codes into exceptions
Connection pooling
HTTP authentication
Load balancing across nodes in a cluster
Failed-node marking to avoid downed nodes for a period
Optional automatic retrying of failed requests
Thread safety
Loosely coupled design, letting you customize things like JSON encoding and bulk indexing
For more on our philosophy and history, see Comparison with elasticsearch-py, the “Official Client”.
A Taste of the API
Make a pooling, balancing, all-singing, all-dancing connection object:
>>> from pyelasticsearch import ElasticSearch >>> es = ElasticSearch('http://localhost:9200/')
Index a document:
>>> es.index('contacts', ... 'person', ... {'name': 'Joe Tester', 'age': 25, 'title': 'QA Master'}, ... id=1) {u'_type': u'person', u'_id': u'1', u'ok': True, u'_version': 1, u'_index': u'contacts'}
Index a couple more documents, this time in a single request using the bulk-indexing API:
>>> docs = [{'id': 2, 'name': 'Jessica Coder', 'age': 32, 'title': 'Programmer'}, ... {'id': 3, 'name': 'Freddy Tester', 'age': 29, 'title': 'Office Assistant'}] >>> es.bulk((es.index_op(doc, id=doc.pop('id')) for doc in docs), ... index='contacts', ... doc_type='person')
If we had many documents and wanted to chunk them for performance, bulk_chunks() would easily rise to the task, dividing either at a certain number of documents per batch or, for curated platforms like Google App Engine, at a certain number of bytes. Thanks to the decoupled design, you can even substitute your own batching function if you have unusual needs. Bulk indexing is the most demanding ES task in most applications, so we provide very thorough tools for representing operations, optimizing wire traffic, and dealing with errors. See bulk() for more.
Refresh the index to pick up the latest:
>>> es.refresh('contacts') {u'ok': True, u'_shards': {u'successful': 5, u'failed': 0, u'total': 10}}
Get just Jessica’s document:
>>> es.get('contacts', 'person', 2) {u'_id': u'2', u'_index': u'contacts', u'_source': {u'age': 32, u'name': u'Jessica Coder', u'title': u'Programmer'}, u'_type': u'person', u'_version': 1, u'exists': True}
Perform a simple search:
>>> es.search('name:joe OR name:freddy', index='contacts') {u'_shards': {u'failed': 0, u'successful': 42, u'total': 42}, u'hits': {u'hits': [{u'_id': u'1', u'_index': u'contacts', u'_score': 0.028130024999999999, u'_source': {u'age': 25, u'name': u'Joe Tester', u'title': u'QA Master'}, u'_type': u'person'}, {u'_id': u'3', u'_index': u'contacts', u'_score': 0.028130024999999999, u'_source': {u'age': 29, u'name': u'Freddy Tester', u'title': u'Office Assistant'}, u'_type': u'person'}], u'max_score': 0.028130024999999999, u'total': 2}, u'timed_out': False, u'took': 4}
Perform a search using the elasticsearch query DSL:
>>> query = { ... 'query': { ... 'filtered': { ... 'query': { ... 'query_string': {'query': 'name:tester'} ... }, ... 'filter': { ... 'range': { ... 'age': { ... 'from': 27, ... 'to': 37, ... }, ... }, ... }, ... }, ... }, ... } >>> es.search(query, index='contacts') {u'_shards': {u'failed': 0, u'successful': 42, u'total': 42}, u'hits': {u'hits': [{u'_id': u'3', u'_index': u'contacts', u'_score': 0.19178301, u'_source': {u'age': 29, u'name': u'Freddy Tester', u'title': u'Office Assistant'}, u'_type': u'person'}], u'max_score': 0.19178301, u'total': 1}, u'timed_out': False, u'took': 2}
Delete the index:
>>> es.delete_index('contacts') {u'acknowledged': True, u'ok': True}
For more, see the full API Documentation.
Changelog
v1.2.3 (2015-04-17)
Make delete_all_indexes() work.
Fix a bug in which specifying _all as an index name sometimes caused doctype names to be treated as index names.
v1.2.2 (2015-04-10)
Correct a typo in the bulk() docs.
v1.2.1 (2015-04-09)
Update ES doc links, now that Elastic has changed domains and reorganized its docs.
Require elasticsearch lib 1.3 or greater, as that’s when it started exposing ConnectionTimeout.
v1.2 (2015-03-06)
Make sure the Content-Length header gets set when calling create_index() with no explicit settings arg. This solves 411s when using nginx as a proxy.
Add doc_as_upsert() arg to update().
Make bulk_chunks() compute perfectly optimal results, no longer ever exceeding the byte limit unless a single document is over the limit on its own.
v1.1 (2015-02-12)
Introduce new bulk API, supporting all types of bulk operations (index, update, create, and delete), providing chunking via bulk_chunks(), and introducing per-action error-handling. All errors raise exceptions–even individual failed operations–and the exceptions expose enough data to identify operations for retrying or reporting. The design is decoupled in case you want to create your own chunkers or operation builders.
Deprecate bulk_index() in favor of the more capable bulk().
Make one last update to bulk_index(). It now catches individual operation failures, raising BulkError. Also add the index_field and type_field args, allowing you to index across different indices and doc types within one request.
ElasticSearch object now defaults to http://localhost:9200/ if you don’t provide any node URLs.
Improve docs: give a better overview on the front page, and document how to customize JSON encoding.
v1.0 (2015-01-23)
Switch to elasticsearch-py’s transport and downtime-pooling machinery, much of which was borrowed from us anyway.
Make bulk indexing (and likely other network things) 15 times faster.
Add a comparison with the official client to the docs.
Fix delete_by_query() to work with ES 1.0 and later.
Bring percolate() es_kwargs up to date.
Fix all tests that were failing on modern versions of ES.
Tolerate errors that are non-strings and create exceptions for them properly.
v0.7.1 (2014-08-12)
Brings tests up to date with update_aliases() API change.
v0.7 (2014-08-12)
When an id_field is specified for bulk_index(), don’t index it under its original name as well; use it only as the _id.
Rename aliases() to get_aliases() for consistency with other methods. Original name still works but is deprecated. Add an alias kwarg to the method so you can fetch specific aliases.
v0.6.1 (2013-11-01)
Update package requirements to allow requests 2.0, which is in fact compatible. (Natim)
Properly raise IndexAlreadyExistsException even if the error is reported by a node other than the one to which the client is directly connected. (Jannis Leidel)
v0.6 (2013-07-23)
bulk_index() now overwrites any existing doc of the same ID and doctype. Before, in certain versions of ES (like 0.90RC2), it did nothing at all if a document already existed, probably much to your surprise. (We removed the 'op_type': 'create' pair, whose intentions were always mysterious.) (Gavin Carothers)
Rename the force_insert kwarg of index() to overwrite_existing. The old name implied the opposite of what it actually did. (Gavin Carothers)
v0.5 (2013-04-20)
Support multiple indices and doctypes in delete_by_query(). Accept both string and JSON queries in the query arg, just as search() does. Passing the q arg explicitly is now deprecated.
Add multi_get.
Add percolate. Thanks, Adam Georgiou and Joseph Rose!
Add ability to specify the parent document in bulk_index(). Thanks, Gavin Carothers!
Remove the internal, undocumented from_python method. django-haystack users will need to upgrade to a newer version that avoids using it.
Refactor JSON encoding machinery. Now it’s clearer how to customize it: just plug your custom JSON encoder class into ElasticSearch.json_encoder.
Don’t crash under python -OO.
Support non-ASCII URL path components (like Unicode document IDs) and query string param values.
Switch to the nose testrunner.
v0.4.1 (2013-03-25)
Fix a bug introduced in 0.4 wherein “None” was accidentally sent to ES when an ID wasn’t passed to index().
v0.4 (2013-03-19)
Support Python 3.
Support more APIs:
cluster_state
get_settings
update_aliases and aliases
update (existed but didn’t work before)
Support the size param of the search method. (You can now change es_size to size in your code if you like.)
Support the fields param on index and update methods, new since ES 0.20.
Maintain better precision of floats when passed to ES.
Change endpoint of bulk indexing so it works on ES < 0.18.
Support documents whose ID is 0.
URL-escape path components, so doc IDs containing funny chars work.
Add a dedicated IndexAlreadyExistsError exception for when you try to create an index that already exists. This helps you trap this situation unambiguously.
Add docs about upgrading from pyes.
Remove the undocumented and unused to_python method.
v0.3 (2013-01-10)
Correct the requests requirement to require a version that has everything we need. In fact, require requests 1.x, which has a stable API.
Add update() method.
Make send_request method public so you can use ES APIs we don’t yet explicitly support.
Handle JSON translation of Decimal class and sets.
Make more_like_this() take an arbitrary request body so you can filter the returned docs.
Replace the fields arg of more_like_this with mlt_fields. This makes it actually work, as it’s the param name ES expects.
Make explicit our undeclared dependency on simplejson.
v0.2 (2012-10-06)
Many thanks to Erik Rose for almost completely rewriting the API to follow best practices, improve the API user experience, and make pyelasticsearch future-proof.
Backward-incompatible changes:
Simplify search() and count() calling conventions. Each now supports either a textual or a dict-based query as its first argument. There’s no longer a need to, for example, pass an empty string as the first arg in order to use a JSON query (a common case).
Standardize on the singular for the names of the index and doc_type kwargs. It’s not always obvious whether an ES API allows for multiple indexes. This was leading me to have to look aside to the docs to determine whether the kwarg was called index or indexes. Using the singular everywhere will result in fewer doc lookups, especially for the common case of a single index.
Rename morelikethis to more_like_this for consistency with other methods.
index() now takes (index, doc_type, doc) rather than (doc, index, doc_type), for consistency with bulk_index() and other methods.
Similarly, put_mapping() now takes (index, doc_type, mapping) rather than (doc_type, mapping, index).
To prevent callers from accidentally destroying large amounts of data…
delete() no longer deletes all documents of a doctype when no ID is specified; use delete_all() instead.
delete_index() no longer deletes all indexes when none are given; use delete_all_indexes() instead.
update_settings() no longer updates the settings of all indexes when none are specified; use update_all_settings() instead.
setup_logging() is gone. If you want to configure logging, use the logging module’s usual facilities. We still log to the “pyelasticsearch” named logger.
Rethink error handling:
Raise a more specific exception for HTTP error codes so callers can catch it without examining a string.
Catch non-JSON responses properly, and raise the more specific NonJsonResponseError instead of the generic ElasticSearchError.
Remove mentions of nonexistent exception types that would cause crashes in their except clauses.
Crash harder if JSON encoding fails: that always indicates a bug in pyelasticsearch.
Remove the ill-defined ElasticSearchError.
Raise ConnectionError rather than ElasticSearchError if we can’t connect to a node (and we’re out of auto-retries).
Raise ValueError rather than ElasticSearchError if no documents are passed to bulk_index.
All exceptions are now more introspectable, because they don’t immediately mash all the context down into a string. For example, you can recover the unmolested response object from ElasticHttpError.
Removed quiet kwarg, meaning we always expose errors.
Other changes:
Add Sphinx documentation.
Add load-balancing across multiple nodes.
Add failover in the case where a node doesn’t respond.
Add close_index, open_index, update_settings, health.
Support passing arbitrary kwargs through to the ES query string. Known ones are taken verbatim; unanticipated ones need an “es_” prefix to guarantee forward compatibility.
Automatically convert datetime objects when encoding JSON.
Recognize and convert datetimes and dates in pass-through kwargs. This is useful for timeout.
In routines that can take either one or many indexes, don’t require the caller to wrap a single index name in a list.
Many other internal improvements
v0.1 (2012-08-30)
Initial release based on the work of Robert Eanes and other authors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.