Helpers to bridge different Python envs and OpenOffice.org.
Project description
ulif.openoffice
Bridging Python and OpenOffice.org.
This package provides tools like daemons and converters to ease access to OpenOffice.org installations for Python programmers.
The main purpose of the whole package is to provide support for converting office documents from Python using OpenOffice.org but without the need to have PyUNO support with the Python binary that actually runs your Python application (like Plone, for instance).
The complete documentation of the most recent release can be found at
What does ulif.openoffice provide?
oooctl
A commandline script to start/stop OpenOffice.org as a daemon (without X). While OOo brings this functionality out-of-the-box, the deamon also monitors status of the OOo server and restarts it if necessary.
pyunoctl
A commandline script to start/stop a converter daemon, that listens for requests to convert office documents from .odt, .doc, .docx, etc. to HTML or PDF using OpenOffice.org. Includes a caching mechanism that holds docs already converted.
An API to access any pyunoctl daemon programmatically using Python.
Introduction
What ulif.openoffice is
ulif.openoffice is a Python package to support document transformations using OpenOffice.org (OOo).
It provides components to ask a running OOo-server for document conversions from office-type documents like .doc or .odt to HTML or PDF. Using ulif.openoffice you can trigger such conversions via commandline or via a Python-API that works also with Python versions without any PyUNO support.
Furthermore, it provides a caching server that caches all documents once converted and delivers them in case a document is requested again. Depending on your needs this can speed-up things by factor 10 or more.
Sources
ulif.openoffice is hosted on:
where you can get latest released versions.
The subversion repository of the sources is:
Requirements
ulif.openoffice requires some PyUNO-capable Python somewhere to do the actual conversions. It also provides a client-API for Python code that does not provide that support. Current Debian-based distributions normally offer a package for PyUNO support.
ulif.openoffice is tested on Debian-based systems, most notably Ubuntu, and won’t work on Windows.
The package is designed for server-based deployments. While the OOo-server is running, you cannot use the office-suite on your desktop (at least at time of writing this). This is a limitation of OOo itself.
Overview
ulif.openoffice mainly provides three different components:
An oooctl server that runs in background, starts a local OOo-server and monitors its status. If the OOo server process dies, it is restarted by oooctl.
A pyunoserver, which is a TCP-server that implements an own protocol to listen for conversion requests. When a valid request arrives, it tries to contact a local OOo server to do the conversion.
The pyunoserver also runs a cache manager that caches already converted documents and delivers them in case the conversioned version already exists.
This component needs access to the PyUNO library.
A client library to talk to the PyUNO server. This component does not require PyUNO.
The three components play together roughly as shown in the following figure:
The blue lines show the way from a source document (in .doc format) to the OpenOffice.org server while the red lines show the way back of the converted document (PDF).
Use of client-API, oooctl server and cache is optional.
All this currently happens on the same machine. There are plans for support of multi-machine scenarios with distributed servers and load-balancing features.
Prerequisites
There are, unfortunately, zillions of possibilities why you cannot start OpenOffice.org as in background on a system.
The scripts in here were tested with Ubuntu and work.
It is mandatory, that the system user running oooctl is a regular user with at least a home directory. OpenOffice.org relies on that directory to store information even in headless mode.
Recent OpenOffice.org versions require no X-server for running.
If you want to use a Ubuntu (or Debian) prepared install of OOo, you must make sure, that you apt-get-installed the following packages:
openoffice.org-headless (for Ubuntu < 9.04, not needed for newer)
openoffice.org-java-common
additionally to the usual OOo packages, i.e.:
openoffice.org (at least for Ubuntu >= 9.04)
msttcorefonts
The latter is optional but needed to have the most common fonts used with OpenOffice.org documents available. Without the correct fonts installed, results of document transforms might be poor.
Then, you need at least one Python version, which supports:
$ python -c "import uno"
without raising any exceptions.
On newer Ubuntu versions you can install:
* ``python-uno`` (if available)
The clients and other software apart from the oooctl-server and the pyuno-server can be run with a different Python version.
If you successfully installed this package on a different system, we’d be glad to hear from you, especially, if you could tell, what system-packages you used.
Building
Using zc.buildout and SVN checkout
To use buildout with an SVN-checkout of the package from somwhere below
First make sure, that you entered your UNO-supporting Python version in buildout.cfg. By default it will assume, that this is /usr/bin/python.
If:
$ /usr/bin/python -c "import uno"
gives an exception on your system, you must edit buildout.cfg, section [unopython], to tell where the supporting Python can be found.
Then run:
$ python bootstrap/bootstrap.py
with the Python version, your client should later run with. This can be the UNO-supporting Python but don’t has to.
This way you can, for example, use the client components with Python 2.4 while the ooo-server and pyuno-server will run with Python 2.6.
After running bootstrap.py, do:
$ ./bin/buildout
which will create all scripts in bin/.
Using easy_install
Instead of using zc.buildout you can also use easy_install.
If using easy_install, you might have to install the package twice: one time with a Python binary that support PyUNO and one time with a Python binary that will be used by your application.
Make sure, you have at least one Python version that supports PyUNO. See Prerequisites above.
For this Python-version install easy_install (only needed if not already existent, of course:
$ wget http://peak.telecommunity.com/dist/ez_setup.py $ path/to/pyuno/supporting-python ez_setup.py
Install ulif.openoffice for this Python-version:
$ path/to/pyuno/supporting-easy_install ulif.openoffice
Do the same for the Python-version used by your application:
$ path/to/myapp/supporting-python ez_setup.py $ path/to/myapp/supporting-easy_install ulif.openoffice
It is generally useful to do this in virtualenv environments.
Using the scripts
There are four main components that come with ulif.openoffice:
an oooctl-server that starts OpenOffice.org in background.
a pyuno-server that listens for requests to convert docs. This server depends on a running oooctl-server.
a client component that can be accessed via API and can talk to the pyuno-server. This way you can convert docs from Python and the Python version has not to provide the uno lib.
a converter script (also in ./bin), you can use on the commandline. It depends on a running oooctl server and can convert docs to .txt, .html and .pdf format. It is merely a little test programme that was used during development, but you might have some use for it.
You can start the oooctl-server with:
$ ./bin/oooctl start
Do:
$ ./bin/oooctl --help
to see all options.
You can stop the daemon with:
$ ./bin/oooctl stop
The same applies to the pyuno-server:
$ ./bin/pyunoctl start $ ./bin/pyunoctl --help $ ./bin/pyunoctl stop
do what you think they do.
The converter script can be called like this:
$ ./bin/convert sourcefile.doc
to create a sourcefile.txt conversion.
Do:
$ ./bin/convert --pdf sourcefile.doc
to create a PDF of sourefile.doc, and:
$ ./bin/convert --html sourcefile.doc
to create an HTML version of sourcefile.doc.
For the client API see the .txt files in the source.
Examples
oooctl – the OOo daemon
We can start an OpenOffice.org daemon using the oooctl script. This daemon starts an already installed OpenOffice.org instance as server (without GUI, so it is usable on servers).
The oooctl script is defined in setup.py to be installed as a console script, so if you install ulif.openoffice with easy_install or setup.py, an executable script will be installed in your local bin/ directory.
Here we ‘fake’ this install by using buildout, which will install the script in our test environment.
To do so we create a buildout.cfg file:
>>> write('buildout.cfg', ... ''' ... [buildout] ... parts = openoffice-ctl ... offline = true ... ... [unopython] ... executable = /usr/bin/python ... ... [openoffice-ctl] ... recipe = zc.recipe.egg ... eggs = ulif.openoffice ... python = unopython ... ''')
Now we can run buildout to install our script (and other scripts, described below, as well):
>>> print system(join('bin', 'buildout')) Installing openoffice-ctl. Generated script '.../bin/pyunoctl'. Generated script '.../bin/convert'. Generated script '.../bin/oooctl'. <BLANKLINE>
The script provides help with the -h switch:
>>> print system(join('bin', 'oooctl') + ' -h') Usage: oooctl [options] start|fg|stop|restart|status ...
The main actions are to call the script with one of the:
start|fg|stop|restart|status
commands as argument, where fg means: start in foreground. This can be handy, if you want the process to be monitored by third-party tools like some supervisor daemon or similar. In that case the process will not detach from the invoking shell on startup.
We set the oooctl path as a var for further use:
>>> oooctl_bin = join('bin', 'oooctl')
-b – Setting the OpenOffice.org installation path
oooctl needs to know, which OOo install should be used and where it lives. We can set this path to the binary using the -b or --binarypath switch of oooctl.
By default this path is set to:
>>> from ulif.openoffice.oooctl import OOO_BINARY >>> OOO_BINARY '/usr/lib/openoffice/program/soffice'
which might not be true for your local system.
For our local test we create an executable script which will fake a real OpenOffice.org binary:
>>> import sys >>> write('fake_soffice', ... '''#!%s ... import sys ... import pprint ... sys.stdout.write("Fake soffice started with these options/args:\\n") ... pprint.pprint(sys.argv) ... sys.stderr.flush() ... sys.stdout.flush() ... while 1: ... pass ... ''' % sys.executable)
This script will simply loop forever (well, sort of). We determine the exact absolute path of our ‘binary’:
>>> import os >>> soffice_path = os.path.join(os.getcwd(), 'fake_soffice')
We must make this script executable:
>>> os.chmod('fake_soffice', 0700)
Now we can call the daemon and tell it to start our faked office server:
>>> print system("%s -b %s start" % (join('bin', 'oooctl'), soffice_path)) starting OpenOffice.org server, going into background... started with pid ... <BLANKLINE>
We can get the daemon status:
>>> print system(join('bin', 'oooctl') + ' status') Status: Running (PID ...) <BLANKLINE>
We can stop the server:
>>> print system(join('bin', 'oooctl') + ' stop') stopping pid ... done. <BLANKLINE>
(Re-)Directing the daemon input and output
By default the daemonized programme’s output will be redirected to /dev/null. You can, however use the --stdout, --stderr and --stdin options to set appropriate log files.
We create a temporary log file:
>>> import tempfile >>> (tmp_fd, tmp_path) = tempfile.mkstemp()
Now we start the OOo server with the tempfile as logger:
>>> print system(join('bin', 'oooctl') + ' -b %s start' % ( ... soffice_path, ) ... + ' --stdout="%s"' % tmp_path) starting OpenOffice.org server, going into background... started with pid ... <BLANKLINE>>>> print system(join('bin', 'oooctl') + ' stop') stopping pid ... done. <BLANKLINE>
In the logfile we can see what arguments and options the daemon used:
>>> cat (tmp_path) Fake soffice started with these options/args: ['/sample-buildout/fake_soffice', '-accept=socket,host=localhost,port=2002;urp;', '-headless', '-nologo', '-nofirststartwizard', '-norestore']
pyunoctl – a conversion daemon
This script starts a server in background that allows conversion of documents using the pyUNO API. It requires a running OO.org server in background (see above).
Currently conversion from all OOo readable formats (.doc, .odt, .txt, …) to HTML and PDF-A is supported. This means, if you can load a document with OpenOffice.org, then this daemon can convert it to HTML or PDF-A.
The conversion daemon starts a server in background (unless you specify fg as startmode, which will keep the server attached to the invoking shell) which listens for conversion requests on a TCP port. It then calls OpenOffice.org via the pyUNO-API to perform the conversion and responses with the path of the generated doc (or an error message).
The conversion server is a multithreaded asynchronous TCP daemon. So, several requests can be served at the same time.
The script provides help with the -h switch:
>>> print system(join('bin', 'pyunoctl') + ' -h') Usage: pyunoctl [options] start|fg|stop|restart|status ...>>> import os
Before we can really use the daemon, we have to fire up the OOo daemon:
>>> print system(join('bin', 'oooctl') + ' --stdout=/tmp/output start') starting OpenOffice.org server, going into background... started with pid ... <BLANKLINE>
Now, we start the pyuno daemon:
>>> print system(join('bin', 'pyunoctl') + ' --stdout=/tmp/out start') starting pyUNO conversion server, going into background... started with pid ... <BLANKLINE>
Testing the conversion daemon
Once, the daemon started we can send requests. One of the commands we can send is to test environment, connection and all that. For this, we need a client that sends commands and parses the responses for us. It is not difficult to write an own client (few lines of socket code will do), but if you’re writing third party software you might use the ready-for-use client from ulif.openoffice.client, which should give you a more consistent API over time and can hide changes in protocol etc.
Using the client in simple form can be done like this:
>>> from ulif.openoffice.client import PyUNOServerClient >>> def send_request(ip, port, message): ... client = PyUNOServerClient(ip, port) ... result = client.sendRequest(message) ... ok = result.ok and 'OK' or 'ERR' ... return '%s %s %s' % (ok, result.status, result.message)
The client returns response objects, which always contain:
- ok
a boolean flag indicating whether the request succeeded
- status
a number indicating the response status. Stati are generally leaned on HTTP status messages, so 200 means ‘okay’ while any other number indicates some problem in processing the request.
- message
Any readable output returned by the server. This includes paths or more verbose error messages in case of errors.
Commands sent always have to be closed by newlines:
>>> command = 'TEST\n'
As the default port is 2009, we can call the client like this:
>>> print send_request('127.0.0.1', 2009, command) OK 0 ...
The response tells us that
the request could be handled (‘OK’),
the status is zero (=no problems),
the version number of the server (‘0.2.1dev’ or similar).
If we send garbage, we get an error:
>>> command = 'Blah\n' >>> print send_request('127.0.0.1', 2009, command) ERR 550 unknown command. Use CONVERT_HTML, CONVERT_PDF or TEST.
Here the server tells us, that
the request could not be handled (‘ERR’)
the status is 550
a hint, what commands we can use to talk to it.
As we can see, we are normally using HTTP status codes. This is also a measure to allow simple switch to HTTP somewhen in the future.
Before we go on, we have to give the server time to start up:
>>> import time >>> time.sleep(3)
Convert to PDF via the conversion daemon
Finally let’s start a real conversion. We have a simple .doc document we’d like to have as PDF. The document is located here:
>>> import os >>> import shutil >>> import ulif.openoffice >>> src_path = os.path.dirname(ulif.openoffice.__file__) >>> src_path = os.path.join( src_path, ... 'tests', 'input', 'simpledoc1.doc') >>> dst_path = os.path.join('home', 'simpledoc1.doc') >>> shutil.copyfile(src_path, dst_path) >>> testdoc_path = os.path.abspath(dst_path)
We tell the machinery to convert to PDF/A by sending the following lines:
CONVERT_PDF PATH=<path-to-source-document>
We start the conversion:
>>> command = ('CONVERT_PDF\nPATH=%s\n' % testdoc_path) >>> result = send_request('127.0.0.1', 2009, command) >>> print result OK 200 /tmp/.../simpledoc1.pdf
We can also use the client component to get convert to PDFs:
>>> from ulif.openoffice.client import PyUNOServerClient >>> client = PyUNOServerClient() >>> response = client.convertFileToPDF(testdoc_path)
The response will contain a status (HTTP equivalent number), a boolean flag indicating whether conversion was performed successfully and a message, which in case of success contains the path of the generated document:
>>> response.status 200>>> response.ok True>>> response.message '/tmp/.../simpledoc1.pdf'
Result directories returned by the client are always temporary directories which can be used by the caller.
Instead of giving a path, we can also use the client with a filename parameter and the contents of the file to be converted. For this, we use the clients convertToPDF method. This consumes slightly more time than the method above:
>>> contents = open(testdoc_path, 'rb').read() >>> response = client.convertToPDF( ... os.path.basename(testdoc_path), contents)
Again, the message attribute of the response tells us, where the generated doc can be found:
>>> response.message '/.../simpledoc1.pdf'
This time the document was created inside a temporary directory, created only for this request. You should not make assumptions about this location.
Convert to HTML via the conversion daemon
Finally let’s start a real conversion. We have a really simple .doc document we’d like to have as HTML.
We tell the machinery to convert to PDF/A by sending the following lines:
CONVERT_HTML PATH=<path-to-source-document>
We start the conversion:
>>> command = ('CONVERT_HTML\nPATH=%s\n' % testdoc_path) >>> result = send_request('127.0.0.1', 2009, command) >>> print result OK 200 /tmp/.../simpledoc1.html>>> result_dir = result.split(' ')[-1] >>> shutil.rmtree(os.path.dirname(result_dir))
We can also use the client component to get convert to HTML:
>>> from ulif.openoffice.client import PyUNOServerClient >>> client = PyUNOServerClient() >>> filecontent = open(testdoc_path, 'rb').read() >>> response = client.convertFileToHTML(testdoc_path)
The response will contain a status (HTTP equivalent number), a boolean flag indicating whether conversion was performed successfully and a message, which in case of success contains the path of the generated HTML document. All embedded files that belong to that document are stored in the same directory as the HTML file:
>>> response.status 200>>> response.ok True>>> response.message '/tmp/.../simpledoc1.html'
Instead of giving a path, we can also use the client with a filename parameter and the contents of the file to be converted. For this, we use the clients convertToHTML method. This consumes slightly more time than the method above:
>>> contents = open(testdoc_path, 'rb').read() >>> response = client.convertToHTML( ... os.path.basename(testdoc_path), contents)
Again, the message attribute of the response tells us, where the generated doc can be found:
>>> response.message '/.../simpledoc1.html'
This time the document was created inside a temporary directory, created only for this request. You should not make assumptions about this location. All accompanied documents like images, etc. are stored in the same directory.
We must remove the result directory ourselve:
>>> import shutil >>> result_dir = os.path.dirname(response.message) >>> if os.path.isdir(result_dir): ... shutil.rmtree(result_dir)
When we sent the same file with a different name, we will get a cached copy but with the name of the new source applied:
>>> new_testdoc_path = os.path.join( ... os.path.dirname(testdoc_path), 'newdoc1.doc') >>> shutil.copyfile(testdoc_path, new_testdoc_path) >>> response = client.convertToHTML( ... os.path.basename(new_testdoc_path), contents) >>> response.message '/tmp/.../newdoc1.html'>>> ls(os.path.dirname(response.message)) - newdoc1.html
We must remove the result directory ourselve:
>>> import shutil >>> result_dir = os.path.dirname(response.message) >>> if os.path.isdir(result_dir): ... shutil.rmtree(result_dir)
Note, that the user that run OO.org server, will need a valid home directory where OOo stores data. We create such a home in the testsetup in the home directory:
>>> print "HOMEDIR>\n", ls('home') HOMEDIR... d .openoffice.or... d .pyunocache - newdoc1.doc - simpledoc1.doc ...
The home also contains the cache dir for the PyUNOServer.
Shut down the pyuno daemon:
>>> print system(join('bin', 'pyunoctl') + ' stop') stopping pid ... done. <BLANKLINE>
pyunoctl – RESTful mode
This script starts a server in background that allows conversion of documents using the pyUNO API. It requires a running OO.org server in background (see above).
Apart from usage in standard raw mode, pyunoctl can also be started as a RESTful HTTP daemon. This enables usage from remote, as all communication is done using the HTTP protocol (including sending and receiving files).
The RESTful HTTP mode can be enabled by setting the:
--mode=rest
option of pyunoctl.
We start pyunoctl in RESTful mode. The OOo daemon was already started before.
>>> print system(join('bin', 'pyunoctl') + ' --stdout=/tmp/out ' ... + '--mode=rest start') startung RESTful HTTP server, going into background... started with pid ... <BLANKLINE>
We send a simple test request, that should give us a status:
>>> import httplib >>> conn = httplib.HTTPConnection('localhost', 2009) >>> conn.request('GET', '/TEST') >>> r1 = conn.getresponse() >>> r1.status, r1.reason (200, 'OK')>>> print r1.read() Server: ulif.openoffice.RESTfulHTTPServer/<VERSION> Python/2.5.2 Date: ... Content-Length:: ... <BLANKLINE> ulif.openoffice.RESTful.HTTPServer <VERSION>
We GET documents from the server by asking for an existing MD5sum. The MD5 sum of a document is also its resource name on the server. If a document does not exist, we get a 404 error:
>>> conn.request('GET', '/non-existing-url') >>> r1 = conn.getresponse() >>> r1.status, r1.reason (404, 'Not Found')>>> conn.close()
We ask for conversion (creating a resource), simply by POSTing a document. As creating POST request is a bit more complex, we use utility functions from the util module:
>>> from ulif.openoffice.util import encode_multipart_formdata >>> fields = [] >>> files = [('document', 'simpledoc1.doc', ... open(testdoc_path, 'rb').read())] >>> content_type, body = encode_multipart_formdata(fields, files) >>> headers = { ... 'User-Agent': 'Test-Agent', ... 'Content-Type': content_type ... }
The content type we use has to be multipart/form-data (instead of x-application/urlencoded):
>>> content_type 'multipart/form-data; boundary=---...'
Actually, we can trigger the POST request:
>>> conn = httplib.HTTPConnection('localhost', 2009) >>> conn.request('POST', '/', body, headers) >>> r1 = conn.getresponse() >>> r1.status, r1.reason (200, 'OK')>>> print r1.read() Client: ('127.0.0.1', ...) Path: / Form data: Uploaded document (name=simpledoc1.doc; 64512 bytes)
Shut down the pyuno daemon:
>>> print system(join('bin', 'pyunoctl') + ' stop') stopping pid ... done. <BLANKLINE>
Shut down the oooctl daemon:
>>> print system(join('bin', 'oooctl') + ' stop') stopping pid ... done. <BLANKLINE>
Clean up:
>>> os.close(tmp_fd) >>> os.unlink(tmp_path)
CHANGES
0.3 (2010-11-17)
Added option to disable caching completely: set --cache-dir to empty string to disable caching [Thanks to Adama Groszer for patches!]
Removed unwanted output when running in foreground mode.
Cachemanager now supports listing all sources contained in cache dir.
Fixed bug in cachemanager: under rare circumstances (two different input files with same MD5 hash digest and identical file stats were considered to be identical by the cachemanager and thus led to inconsistencies in cache). We now check thoroughly whether two such files differ.
Lots of test fixes [Thanks to Adam Groszer for patches!]
0.2.1 (2010-06-13)
Fixed fix to cope with pyuno monkey-patching standard __import__ function. More recent pyuno versions do not do that kind of stuff any more (which is an improvement).
Fixed foreground start of `oooctl` server. It didn’t work correctly with more recent OpenOffice.org/pyuno installs. You now don’t have to press CTRL-C two times anymore when trying to stop a oooctl server running in foreground.
0.2 (2010-05-20)
Added license and copyright file to comply with policy of major Linux distributors.
Added sphinx docs.
Fixed wrong result path when returning cached HTML results.
Added mode fg for oooctl. Using oooctl fg one can start oooctl in foreground now.
Added mode fg for pyunoctl. Using pyunoctl fg one can start pyunoctl in foreground now.
Added state check for oooctl: when OpenOffice.org server is down during runtime it is restarted automatically. The check happens every second.
Use standard lib doctest instead of zope.testing.doctest.
Changed PDF creation: by default now normal PDF (and not PDF/A) is created when converting to PDF. This is due to an endianess bug in many recent OpenOffice.org binaries running on 64-bit platforms.
0.1 (2010-03-02)
Initial implementation.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.