Pure-Python object database.
Project description
Overview
Dobbin is a transactional object database for Python. It’s a fast and convenient way to persist Python objects on disk.
Key features:
Multi-thread, multi-process with no configuration
Persistent objects carry no overhead in general case
Threads share most object data
Does not attempt to manage memory
Implemented all in Python
Efficient storing and serving of binary streams
Source
The source code is kept in version control. Use this command to anonymously check out the latest project source code:
svn co http://svn.repoze.org/dobbin/trunk dobbin
User’s guide
This is the primary documentation for the database. It uses an interactive narrative which doubles as a doctest.
You can run the tests by issuing the following command at the command-line prompt:
$ python setup.py test
Setup
The first step is to connect the database to storage. The database storage layer is abstracted; included with the database is an implementation which logs transactions to a file, optimized for long-running processes, e.g. application servers.
To configure the transaction log, we simply provide a path. It needn’t point to an existing file; upon the first commit to the database, the file will be created.
>>> from dobbin.storage import TransactionLog >>> storage = TransactionLog(database_path)
We pass the storage to the database constructor for initialization.
>>> from dobbin.database import Database >>> db = Database(storage)
The database is empty to begin; we can verify this by using the len method to determine the number of objects stored.
>>> len(db) 0
This object database uses an object graph persistency model, that is, all persisted objects must be connected to the same graph. Connected in this context means that another connected object owns a Python-reference to it.
The empty database has no elected root object; if we ask for it, we simply get None as the answer.
>>> db.get_root() is None True
Setting the root
Any persistent object can be elected as the database root object. Persistent objects must inherit from the Persistent class. These objects form the basis of the concurrency model; overlapping transactions may write a disjoint set of objects (conflict resolution mechanisms are available to ease this requirement).
>>> from dobbin.persistent import Persistent >>> obj = Persistent()
Persistent objects are read-only by default; the state (dict) is shared between threads. It is not difficult to use or abuse this in general, but we do prevent setting attributes on objects in shared state to manifest this point.
>>> obj.name = "John" Traceback (most recent call last): ... RuntimeError: Can't set attribute in read-only mode.
If we use the checkout method on the object, its state changes from read-only to thread-local.
>>> from dobbin.persistent import checkout >>> checkout(obj)
The object identity is never changed, but the object state is masked by a thread-local dictionary.
>>> obj.name = 'John' >>> obj.name 'John'
When an object is first checked out by some thread, a counter is set to keep track of how many threads have checked out the object. When it falls to zero (always on a transaction boundary), it’s retracted to the previous shared state.
Electing a database root
We can elect this object as the root of the database.
>>> db.set_root(obj) >>> obj._p_oid 0
The object is now the root of the object graph. To persist changes on disk, we commit the transaction.
>>> transaction.commit()
As expected, the database contains one object.
>>> len(db) 1
The storage layer should report that a single transaction has been logged.
>>> len(storage) 1
Transactions
The transaction log always appends data; it will grow with every transaction.
>>> checkout(obj) >>> obj.name = 'James' >>> transaction.commit()
Verify transaction count.
>>> len(storage) 2
Conflicts
When two threads try to make changes to the same objects, we have a write conflict. One thread is guaranteed to win; with conflict resolution, both may.
In a new thread, we check out an object, make changes to it, then wait for a semaphore before we commit.
>>> from threading import Semaphore >>> flag = Semaphore()
>>> def run(): ... checkout(obj) ... obj.name = 'Bob' ... flag.acquire() ... transaction.commit() ... flag.release()
>>> from threading import Thread >>> thread = Thread(target=run)
>>> flag.acquire() True
>>> thread.start()
We do the same in the main thread.
>>> checkout(obj) >>> obj.name = 'Bill'
Releasing the semaphore, the thread will attempt to commit the transaction.
>>> flag.release() >>> thread.join()
The transaction was committed.
>>> len(storage) 4
Trying to commit the transaction in the main thread, we get a write conflict.
>>> transaction.commit() Traceback (most recent call last): ... WriteConflictError...
The commit failed; this has implications beyond the exception being raised. A transaction record was written to disk.
>>> len(storage) 5
Checked out objects have been reverted to the state of the most recent transaction.
>>> obj.name 'Bob'
We must abort the failed transaction explicitly.
>>> transaction.abort()
When all threads are done with an object they’ve previously checked out, its state is retracted to shared. To verify this, we try and set an attribute on it.
>>> obj.name = "John" Traceback (most recent call last): ... RuntimeError: Can't set attribute in read-only mode.
Two threads each belonging to different processes can conflict too, obviously. We can simulate two processes by again opening a new thread, but this time use the second database instance.
We begin a new transaction such that both database instances are up-to-date.
>>> tx = transaction.begin()
Confirm that the storages are indeed up-to-date (and have registered the same number of transactions).
>>> len(storage) == len(new_storage) True
>>> def run(): ... checkout(new_obj) ... new_obj.name = 'Ian' ... flag.acquire() ... transaction.commit() ... flag.release()
>>> thread = Thread(target=run)
>>> flag.acquire() True
>>> thread.start()
We do the same in the main thread.
>>> checkout(obj) >>> obj.name = 'Ilya'
Releasing the semaphore, the thread will attempt to commit the transaction.
>>> flag.release() >>> thread.join()
The transaction was committed.
>>> len(new_storage) 6
If try to commit the transaction in the main thread, we get a read conflict; the reason why it’s not a write conflict is that the storage first catches up on new transactions which causes a read conflict.
>>> transaction.commit() Traceback (most recent call last): ... ReadConflictError...
Again, the failed transaction is recorded.
>>> len(storage) 7
The state of the object reflects the transaction which was committed in the thread.
>>> obj.name 'Ian'
We clean up from the failed transaction.
>>> transaction.abort()
More objects
When objects are added to the object graph, they are automatically persisted.
>>> another = Persistent() >>> checkout(another) >>> another.name = 'Karla'
>>> checkout(obj) >>> obj.another = another
We commit the transaction and observe that the object count has grown. The new object has been assigned an oid as well (these are not in general predictable; they are assigned by the storage).
>>> transaction.commit() >>> len(db) 2
>>> another._p_oid is not None True
As we check out the object that carries the reference and access any attribute, a deep-copy of the shared state is made behind the scenes. Persistent objects are never copied, however, which a simple identity check will confirm.
>>> checkout(obj) >>> obj.another is another True
Circular references are permitted.
>>> checkout(another) >>> another.another = obj >>> transaction.commit()
Again, we can verify the identity.
>>> another.another is obj True
Storing files
We can persist open files (or any stream object) by enclosing them in a persistent file wrapper. The wrapper is immutable; it’s for single use only.
>>> from tempfile import TemporaryFile >>> file = TemporaryFile() >>> file.write('abc') >>> file.seek(0)
Note that the file is read from the current position and until the end of the file.
>>> from dobbin.persistent import PersistentFile >>> pfile = PersistentFile(file)
Let’s store this persistent file as an attribute on our object.
>>> checkout(obj) >>> obj.file = pfile >>> transaction.commit()
Note that the persistent file has been given a new class by the storage layer. It’s the same object (in terms of object identity), but since it’s now stored in the database and is only available as a file stream, we call it a persistent stream.
>>> obj.file <dobbin.storage.PersistentStream object at ...>
We must manually close the file we provided to the persistent wrapper (or let it fall out of scope).
>>> file.close()
Using persistent streams
There are two ways to use persistent streams; either by iterating through it, in which case it automatically gets a file handle (implicitly closed when the iterator is garbage-collected), or through a file-like API.
We use the open method to open the stream; this is always required when using the stream as a file.
>>> obj.file.open() >>> obj.file.read() 'abc'
The seek and tell methods work as expected.
>>> obj.file.tell() 3L
We can seek to the beginning and repeat the exercise.
>>> obj.file.seek(0) >>> obj.file.read() 'abc'
As any file, we have to close it after use.
>>> obj.file.close()
In addition we can use iteration to read the file; in this case, we needn’t bother opening or closing the file. This is automatically done for us. Note that this makes persistent streams suitable as return values for WSGI applications.
>>> "".join(obj.file) 'abc'
Iteration is strictly independent from the other methods. We can observe that the file remains closed.
>>> obj.file.closed True
Cleanup
>>> transaction.commit()
This concludes the narrative.
Notes
Most users of the database will want to get acquainted with the information in this section, especially before deployment.
Configuration
The default storage option (the transaction log) keeps data in a single file. Multiple processes may connect to the same file and share the same database. No further configuration is required; the storage uses native file-locking to ensure exclusive write-access.
You may want to compile Python with the --without-pymalloc flag to use native memory allocation. This may improve performance in applications that connect to large databases due to better paging.
Motivation
There are other object databases available for Python, most importantly the ZODB from Zope Corporation (available under the BSD-like ZPL license).
Notable differences:
Dobbin is pure Python
1/20 the codebase
Less overhead
The assumptions that Dobbin makes lead to a simple design; the case of the ZODB is the exact opposite. Which is more reasonable comes down to these assumptions.
Architecture
Dobbin does not try to limit its memory usage, in any way. The assumption that lead to this decision is that it’s faster to page in CPython-objects from swap than read pickles from the database file and restore the objects which adds an allocation overhead besides the expensive unpickle operation.
Persistent objects are kept in a shared state when possible, meaning that data is shared between threads. The exception is when threads want to change the state as part of a transaction. Objects are then checked out (an explicit function call) which puts the object in a local state; objects in this state have a local deep-copy of the shared state, which they are free to change.
Another objective was to get rid of the requirement of a master node in order for several processes to share a single database. Instead we use native file-system locking and pull-based transaction propagation. There is no inherent network-support; it may be possible to use a virtualized file system (this is on a strictly theoretical basis; it has not been attempted).
The database relies on the transaction package to support two-phase commits.
Changes
0.1 (2009-09-26)
Initial public release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.