Skip to main content

Helpers to bridge different Python envs and OpenOffice.org.

Project description

ulif.openoffice

Bridging Python and OpenOffice.org.

This package provides tools like daemons and converters to ease access to OpenOffice.org installations for Python programmers.

The main purpose of the whole package is to provide support for converting office documents from Python using OpenOffice.org but without the need to have PyUNO support with the Python binary that actually runs your Python application (like Plone, for instance).

The complete documentation of the most recent release can be found at

http://packages.python.org/ulif.openoffice/

What does ulif.openoffice provide?

  • oooctl

    A commandline script to start/stop OpenOffice.org as a daemon (without X). While OOo brings this functionality out-of-the-box, the deamon also monitors status of the OOo server and restarts it if necessary.

  • pyunoctl

    A commandline script to start/stop a converter daemon, that listens for requests to convert office documents from .odt, .doc, .docx, etc. to HTML or PDF using OpenOffice.org. Includes a caching mechanism that holds docs already converted.

  • An API to access any pyunoctl daemon programmatically using Python.

Introduction

What ulif.openoffice is

ulif.openoffice is a Python package to support document transformations using OpenOffice.org (OOo).

It provides components to ask a running OOo-server for document conversions from office-type documents like .doc or .odt to HTML or PDF. Using ulif.openoffice you can trigger such conversions via commandline or via a Python-API that works also with Python versions without any PyUNO support.

Furthermore, it provides a caching server that caches all documents once converted and delivers them in case a document is requested again. Depending on your needs this can speed-up things by factor 10 or more.

Sources

ulif.openoffice is hosted on:

http://pypi.python.org/pypi/ulif.openoffice

where you can get latest released versions.

The subversion repository of the sources is:

https://svn.gnufix.de/repos/main/ulif.openoffice/

Requirements

ulif.openoffice requires some PyUNO-capable Python somewhere to do the actual conversions. It also provides a client-API for Python code that does not provide that support. Current Debian-based distributions normally offer a package for PyUNO support.

ulif.openoffice is tested on Debian-based systems, most notably Ubuntu, and won’t work on Windows.

The package is designed for server-based deployments. While the OOo-server is running, you cannot use the office-suite on your desktop (at least at time of writing this). This is a limitation of OOo itself.

Overview

ulif.openoffice mainly provides three different components:

  • An oooctl server that runs in background, starts a local OOo-server and monitors its status. If the OOo server process dies, it is restarted by oooctl.

  • A pyunoserver, which is a TCP-server that implements an own protocol to listen for conversion requests. When a valid request arrives, it tries to contact a local OOo server to do the conversion.

    The pyunoserver also runs a cache manager that caches already converted documents and delivers them in case the conversioned version already exists.

    This component needs access to the PyUNO library.

  • A client library to talk to the PyUNO server. This component does not require PyUNO.

The three components play together roughly as shown in the following figure:

overview.png

Fig. 1: Overview of ulif.openoffice components

The blue lines show the way from a source document (in .doc format) to the OpenOffice.org server while the red lines show the way back of the converted document (PDF).

Use of client-API, oooctl server and cache is optional.

All this currently happens on the same machine. There are plans for support of multi-machine scenarios with distributed servers and load-balancing features.

Prerequisites

There are, unfortunately, zillions of possibilities why you cannot start OpenOffice.org as in background on a system.

The scripts in here were tested with Ubuntu and work.

It is mandatory, that the system user running oooctl is a regular user with at least a home directory. OpenOffice.org relies on that directory to store information even in headless mode.

Recent OpenOffice.org versions require no X-server for running.

If you want to use a Ubuntu (or Debian) prepared install of OOo, you must make sure, that you apt-get-installed the following packages:

  • openoffice.org-headless (for Ubuntu < 9.04, not needed for newer)

  • openoffice.org-java-common

additionally to the usual OOo packages, i.e.:

  • openoffice.org (at least for Ubuntu >= 9.04)

  • msttcorefonts

The latter is optional but needed to have the most common fonts used with OpenOffice.org documents available. Without the correct fonts installed, results of document transforms might be poor.

Then, you need at least one Python version, which supports:

$ python -c "import uno"

without raising any exceptions.

On newer Ubuntu versions you can install:

* ``python-uno`` (if available)

The clients and other software apart from the oooctl-server and the pyuno-server can be run with a different Python version.

If you successfully installed this package on a different system, we’d be glad to hear from you, especially, if you could tell, what system-packages you used.

Building

Using zc.buildout and SVN checkout

To use buildout with an SVN-checkout of the package from somwhere below

https://svn.gnufix.de/repos/main/ulif.openoffice/

First make sure, that you entered your UNO-supporting Python version in buildout.cfg. By default it will assume, that this is /usr/bin/python.

If:

$ /usr/bin/python -c "import uno"

gives an exception on your system, you must edit buildout.cfg, section [unopython], to tell where the supporting Python can be found.

Then run:

$ python bootstrap/bootstrap.py

with the Python version, your client should later run with. This can be the UNO-supporting Python but don’t has to.

This way you can, for example, use the client components with Python 2.4 while the ooo-server and pyuno-server will run with Python 2.6.

After running bootstrap.py, do:

$ ./bin/buildout

which will create all scripts in bin/.

Using easy_install

Instead of using zc.buildout you can also use easy_install.

If using easy_install, you might have to install the package twice: one time with a Python binary that support PyUNO and one time with a Python binary that will be used by your application.

  1. Make sure, you have at least one Python version that supports PyUNO. See Prerequisites above.

  2. For this Python-version install easy_install (only needed if not already existent, of course:

    $ wget http://peak.telecommunity.com/dist/ez_setup.py
    $ path/to/pyuno/supporting-python ez_setup.py
  3. Install ulif.openoffice for this Python-version:

    $ path/to/pyuno/supporting-easy_install ulif.openoffice
  4. Do the same for the Python-version used by your application:

    $ path/to/myapp/supporting-python ez_setup.py
    $ path/to/myapp/supporting-easy_install ulif.openoffice

It is generally useful to do this in virtualenv environments.

Using the scripts

There are four main components that come with ulif.openoffice:

  • an oooctl-server that starts OpenOffice.org in background.

  • a pyuno-server that listens for requests to convert docs. This server depends on a running oooctl-server.

  • a client component that can be accessed via API and can talk to the pyuno-server. This way you can convert docs from Python and the Python version has not to provide the uno lib.

  • a converter script (also in ./bin), you can use on the commandline. It depends on a running oooctl server and can convert docs to .txt, .html and .pdf format. It is merely a little test programme that was used during development, but you might have some use for it.

You can start the oooctl-server with:

$ ./bin/oooctl start

Do:

$ ./bin/oooctl --help

to see all options.

You can stop the daemon with:

$ ./bin/oooctl stop

The same applies to the pyuno-server:

$ ./bin/pyunoctl start
$ ./bin/pyunoctl --help
$ ./bin/pyunoctl stop

do what you think they do.

The converter script can be called like this:

$ ./bin/convert sourcefile.doc

to create a sourcefile.txt conversion.

Do:

$ ./bin/convert --pdf sourcefile.doc

to create a PDF of sourefile.doc, and:

$ ./bin/convert --html sourcefile.doc

to create an HTML version of sourcefile.doc.

For the client API see the .txt files in the source.

Examples

oooctl – the OOo daemon

We can start an OpenOffice.org daemon using the oooctl script. This daemon starts an already installed OpenOffice.org instance as server (without GUI, so it is usable on servers).

The oooctl script is defined in setup.py to be installed as a console script, so if you install ulif.openoffice with easy_install or setup.py, an executable script will be installed in your local bin/ directory.

Here we ‘fake’ this install by using buildout, which will install the script in our test environment.

To do so we create a buildout.cfg file:

>>> write('buildout.cfg',
... '''
... [buildout]
... parts = openoffice-ctl
... offline = true
...
... [unopython]
... executable = /usr/bin/python
...
... [openoffice-ctl]
... recipe = zc.recipe.egg
... eggs = ulif.openoffice
... python = unopython
... ''')

Now we can run buildout to install our script (and other scripts, described below, as well):

>>> print system(join('bin', 'buildout'))
Installing openoffice-ctl.
Generated script '.../bin/pyunoctl'.
Generated script '.../bin/convert'.
Generated script '.../bin/oooctl'.
<BLANKLINE>

The script provides help with the -h switch:

>>> print system(join('bin', 'oooctl') + ' -h')
Usage: oooctl [options] start|fg|stop|restart|status
...

The main actions are to call the script with one of the:

start|fg|stop|restart|status

commands as argument, where fg means: start in foreground. This can be handy, if you want the process to be monitored by third-party tools like some supervisor daemon or similar. In that case the process will not detach from the invoking shell on startup.

We set the oooctl path as a var for further use:

>>> oooctl_bin = join('bin', 'oooctl')

-b – Setting the OpenOffice.org installation path

oooctl needs to know, which OOo install should be used and where it lives. We can set this path to the binary using the -b or --binarypath switch of oooctl.

By default this path is set to:

>>> from ulif.openoffice.oooctl import OOO_BINARY
>>> OOO_BINARY
'/usr/lib/openoffice/program/soffice'

which might not be true for your local system.

For our local test we create an executable script which will fake a real OpenOffice.org binary:

>>> import sys
>>> write('fake_soffice',
... '''#!%s
... import sys
... import pprint
... sys.stdout.write("Fake soffice started with these options/args:\\n")
... pprint.pprint(sys.argv)
... sys.stderr.flush()
... sys.stdout.flush()
... while 1:
...     pass
... ''' % sys.executable)

This script will simply loop forever (well, sort of). We determine the exact absolute path of our ‘binary’:

>>> import os
>>> soffice_path = os.path.join(os.getcwd(), 'fake_soffice')

We must make this script executable:

>>> os.chmod('fake_soffice', 0700)

Now we can call the daemon and tell it to start our faked office server:

>>> print system("%s -b %s start" % (join('bin', 'oooctl'), soffice_path))
starting OpenOffice.org server, going into background...
started with pid ...
<BLANKLINE>

We can get the daemon status:

>>> print system(join('bin', 'oooctl') + ' status')
Status: Running (PID ...)
<BLANKLINE>

We can stop the server:

>>> print system(join('bin', 'oooctl') + ' stop')
stopping pid ... done.
<BLANKLINE>

(Re-)Directing the daemon input and output

By default the daemonized programme’s output will be redirected to /dev/null. You can, however use the --stdout, --stderr and --stdin options to set appropriate log files.

We create a temporary log file:

>>> import tempfile
>>> (tmp_fd, tmp_path) = tempfile.mkstemp()

Now we start the OOo server with the tempfile as logger:

>>> print system(join('bin', 'oooctl') + ' -b %s start' % (
...                                                       soffice_path, )
...                                    + ' --stdout="%s"' % tmp_path)
starting OpenOffice.org server, going into background...
started with pid ...
<BLANKLINE>
>>> print system(join('bin', 'oooctl') + ' stop')
stopping pid ... done.
<BLANKLINE>

In the logfile we can see what arguments and options the daemon used:

>>> cat (tmp_path)
Fake soffice started with these options/args:
['/sample-buildout/fake_soffice',
 '-accept=socket,host=localhost,port=2002;urp;',
 '-headless',
 '-nologo',
 '-nofirststartwizard',
 '-norestore']

pyunoctl – a conversion daemon

This script starts a server in background that allows conversion of documents using the pyUNO API. It requires a running OO.org server in background (see above).

Currently conversion from all OOo readable formats (.doc, .odt, .txt, …) to HTML and PDF-A is supported. This means, if you can load a document with OpenOffice.org, then this daemon can convert it to HTML or PDF-A.

The conversion daemon starts a server in background (unless you specify fg as startmode, which will keep the server attached to the invoking shell) which listens for conversion requests on a TCP port. It then calls OpenOffice.org via the pyUNO-API to perform the conversion and responses with the path of the generated doc (or an error message).

The conversion server is a multithreaded asynchronous TCP daemon. So, several requests can be served at the same time.

The script provides help with the -h switch:

>>> print system(join('bin', 'pyunoctl') + ' -h')
Usage: pyunoctl [options] start|fg|stop|restart|status
...
>>> import os

Before we can really use the daemon, we have to fire up the OOo daemon:

>>> print system(join('bin', 'oooctl') + ' --stdout=/tmp/output start')
starting OpenOffice.org server, going into background...
started with pid ...
<BLANKLINE>

Now, we start the pyuno daemon:

>>> print system(join('bin', 'pyunoctl') + ' --stdout=/tmp/out start')
starting pyUNO conversion server, going into background...
started with pid ...
<BLANKLINE>

Testing the conversion daemon

Once, the daemon started we can send requests. One of the commands we can send is to test environment, connection and all that. For this, we need a client that sends commands and parses the responses for us. It is not difficult to write an own client (few lines of socket code will do), but if you’re writing third party software you might use the ready-for-use client from ulif.openoffice.client, which should give you a more consistent API over time and can hide changes in protocol etc.

Using the client in simple form can be done like this:

>>> from ulif.openoffice.client import PyUNOServerClient
>>> def send_request(ip, port, message):
...   client = PyUNOServerClient(ip, port)
...   result = client.sendRequest(message)
...   ok = result.ok and 'OK' or 'ERR'
...   return '%s %s %s' % (ok, result.status, result.message)

The client returns response objects, which always contain:

  • ok

    a boolean flag indicating whether the request succeeded

  • status

    a number indicating the response status. Stati are generally leaned on HTTP status messages, so 200 means ‘okay’ while any other number indicates some problem in processing the request.

  • message

    Any readable output returned by the server. This includes paths or more verbose error messages in case of errors.

Commands sent always have to be closed by newlines:

>>> command = 'TEST\n'

As the default port is 2009, we can call the client like this:

>>> print send_request('127.0.0.1', 2009, command)
OK 0 ...

The response tells us that

  • the request could be handled (‘OK’),

  • the status is zero (=no problems),

  • the version number of the server (‘0.2.1dev’ or similar).

If we send garbage, we get an error:

>>> command = 'Blah\n'
>>> print send_request('127.0.0.1', 2009, command)
ERR 550 unknown command. Use CONVERT_HTML, CONVERT_PDF or TEST.

Here the server tells us, that

  • the request could not be handled (‘ERR’)

  • the status is 550

  • a hint, what commands we can use to talk to it.

As we can see, we are normally using HTTP status codes. This is also a measure to allow simple switch to HTTP somewhen in the future.

Before we go on, we have to give the server time to start up:

>>> import time
>>> time.sleep(3)

Convert to PDF via the conversion daemon

Finally let’s start a real conversion. We have a simple .doc document we’d like to have as PDF. The document is located here:

>>> import os
>>> import shutil
>>> import ulif.openoffice
>>> src_path = os.path.dirname(ulif.openoffice.__file__)
>>> src_path = os.path.join( src_path,
...                  'tests', 'input', 'simpledoc1.doc')
>>> dst_path = os.path.join('home', 'simpledoc1.doc')
>>> shutil.copyfile(src_path, dst_path)
>>> testdoc_path = os.path.abspath(dst_path)

We tell the machinery to convert to PDF/A by sending the following lines:

CONVERT_PDF
PATH=<path-to-source-document>

We start the conversion:

>>> command = ('CONVERT_PDF\nPATH=%s\n' % testdoc_path)
>>> result = send_request('127.0.0.1', 2009, command)
>>> print result
OK 200 /tmp/.../simpledoc1.pdf

We can also use the client component to get convert to PDFs:

>>> from ulif.openoffice.client import PyUNOServerClient
>>> client = PyUNOServerClient()
>>> response = client.convertFileToPDF(testdoc_path)

The response will contain a status (HTTP equivalent number), a boolean flag indicating whether conversion was performed successfully and a message, which in case of success contains the path of the generated document:

>>> response.status
200
>>> response.ok
True
>>> response.message
'/tmp/.../simpledoc1.pdf'

Result directories returned by the client are always temporary directories which can be used by the caller.

Instead of giving a path, we can also use the client with a filename parameter and the contents of the file to be converted. For this, we use the clients convertToPDF method. This consumes slightly more time than the method above:

>>> contents = open(testdoc_path, 'rb').read()
>>> response = client.convertToPDF(
...              os.path.basename(testdoc_path), contents)

Again, the message attribute of the response tells us, where the generated doc can be found:

>>> response.message
'/.../simpledoc1.pdf'

This time the document was created inside a temporary directory, created only for this request. You should not make assumptions about this location.

Convert to HTML via the conversion daemon

Finally let’s start a real conversion. We have a really simple .doc document we’d like to have as HTML.

We tell the machinery to convert to PDF/A by sending the following lines:

CONVERT_HTML
PATH=<path-to-source-document>

We start the conversion:

>>> command = ('CONVERT_HTML\nPATH=%s\n' % testdoc_path)
>>> result = send_request('127.0.0.1', 2009, command)
>>> print result
OK 200 /tmp/.../simpledoc1.html
>>> result_dir = result.split(' ')[-1]
>>> shutil.rmtree(os.path.dirname(result_dir))

We can also use the client component to get convert to HTML:

>>> from ulif.openoffice.client import PyUNOServerClient
>>> client = PyUNOServerClient()
>>> filecontent = open(testdoc_path, 'rb').read()
>>> response = client.convertFileToHTML(testdoc_path)

The response will contain a status (HTTP equivalent number), a boolean flag indicating whether conversion was performed successfully and a message, which in case of success contains the path of the generated HTML document. All embedded files that belong to that document are stored in the same directory as the HTML file:

>>> response.status
200
>>> response.ok
True
>>> response.message
'/tmp/.../simpledoc1.html'

Instead of giving a path, we can also use the client with a filename parameter and the contents of the file to be converted. For this, we use the clients convertToHTML method. This consumes slightly more time than the method above:

>>> contents = open(testdoc_path, 'rb').read()
>>> response = client.convertToHTML(
...              os.path.basename(testdoc_path), contents)

Again, the message attribute of the response tells us, where the generated doc can be found:

>>> response.message
'/.../simpledoc1.html'

This time the document was created inside a temporary directory, created only for this request. You should not make assumptions about this location. All accompanied documents like images, etc. are stored in the same directory.

We must remove the result directory ourselve:

>>> import shutil
>>> result_dir = os.path.dirname(response.message)
>>> if os.path.isdir(result_dir):
...   shutil.rmtree(result_dir)

When we sent the same file with a different name, we will get a cached copy but with the name of the new source applied:

>>> new_testdoc_path = os.path.join(
...   os.path.dirname(testdoc_path), 'newdoc1.doc')
>>> shutil.copyfile(testdoc_path, new_testdoc_path)
>>> response = client.convertToHTML(
...              os.path.basename(new_testdoc_path), contents)
>>> response.message
'/tmp/.../newdoc1.html'
>>> ls(os.path.dirname(response.message))
-  newdoc1.html

We must remove the result directory ourselve:

>>> import shutil
>>> result_dir = os.path.dirname(response.message)
>>> if os.path.isdir(result_dir):
...   shutil.rmtree(result_dir)

Note, that the user that run OO.org server, will need a valid home directory where OOo stores data. We create such a home in the testsetup in the home directory:

>>> print "HOMEDIR>\n", ls('home')
HOMEDIR...
d  .openoffice.or...
d  .pyunocache
-  newdoc1.doc
-  simpledoc1.doc
...

The home also contains the cache dir for the PyUNOServer.

Shut down the pyuno daemon:

>>> print system(join('bin', 'pyunoctl') + ' stop')
stopping pid ... done.
<BLANKLINE>

pyunoctl – RESTful mode

This script starts a server in background that allows conversion of documents using the pyUNO API. It requires a running OO.org server in background (see above).

Apart from usage in standard raw mode, pyunoctl can also be started as a RESTful HTTP daemon. This enables usage from remote, as all communication is done using the HTTP protocol (including sending and receiving files).

The RESTful HTTP mode can be enabled by setting the:

--mode=rest

option of pyunoctl.

We start pyunoctl in RESTful mode. The OOo daemon was already started before.

>>> print system(join('bin', 'pyunoctl') + ' --stdout=/tmp/out '
...              + '--mode=rest start')
startung RESTful HTTP server, going into background...
started with pid ...
<BLANKLINE>

We send a simple test request, that should give us a status:

>>> import httplib
>>> conn = httplib.HTTPConnection('localhost', 2009)
>>> conn.request('GET', '/TEST')
>>> r1 = conn.getresponse()
>>> r1.status, r1.reason
(200, 'OK')
>>> print r1.read()
Server: ulif.openoffice.RESTfulHTTPServer/<VERSION> Python/2.5.2
Date: ...
Content-Length:: ...
<BLANKLINE>
ulif.openoffice.RESTful.HTTPServer <VERSION>

We GET documents from the server by asking for an existing MD5sum. The MD5 sum of a document is also its resource name on the server. If a document does not exist, we get a 404 error:

>>> conn.request('GET', '/non-existing-url')
>>> r1 = conn.getresponse()
>>> r1.status, r1.reason
(404, 'Not Found')
>>> conn.close()

We ask for conversion (creating a resource), simply by POSTing a document. As creating POST request is a bit more complex, we use utility functions from the util module:

>>> from ulif.openoffice.util import encode_multipart_formdata
>>> fields = []
>>> files = [('document', 'simpledoc1.doc',
...           open(testdoc_path, 'rb').read())]
>>> content_type, body = encode_multipart_formdata(fields, files)
>>> headers = {
...    'User-Agent': 'Test-Agent',
...    'Content-Type': content_type
... }

The content type we use has to be multipart/form-data (instead of x-application/urlencoded):

>>> content_type
'multipart/form-data; boundary=---...'

Actually, we can trigger the POST request:

>>> conn = httplib.HTTPConnection('localhost', 2009)
>>> conn.request('POST', '/', body, headers)
>>> r1 = conn.getresponse()
>>> r1.status, r1.reason
(200, 'OK')
>>> print r1.read()
Client: ('127.0.0.1', ...)
Path: /
Form data:
    Uploaded document (name=simpledoc1.doc; 64512 bytes)

Shut down the pyuno daemon:

>>> print system(join('bin', 'pyunoctl') + ' stop')
stopping pid ... done.
<BLANKLINE>

Shut down the oooctl daemon:

>>> print system(join('bin', 'oooctl') + ' stop')
stopping pid ... done.
<BLANKLINE>

Clean up:

>>> os.close(tmp_fd)
>>> os.unlink(tmp_path)

CHANGES

0.3 (2010-11-17)

  • Added option to disable caching completely: set --cache-dir to empty string to disable caching [Thanks to Adama Groszer for patches!]

  • Removed unwanted output when running in foreground mode.

  • Cachemanager now supports listing all sources contained in cache dir.

  • Fixed bug in cachemanager: under rare circumstances (two different input files with same MD5 hash digest and identical file stats were considered to be identical by the cachemanager and thus led to inconsistencies in cache). We now check thoroughly whether two such files differ.

  • Lots of test fixes [Thanks to Adam Groszer for patches!]

0.2.1 (2010-06-13)

  • Fixed fix to cope with pyuno monkey-patching standard __import__ function. More recent pyuno versions do not do that kind of stuff any more (which is an improvement).

  • Fixed foreground start of `oooctl` server. It didn’t work correctly with more recent OpenOffice.org/pyuno installs. You now don’t have to press CTRL-C two times anymore when trying to stop a oooctl server running in foreground.

0.2 (2010-05-20)

  • Added license and copyright file to comply with policy of major Linux distributors.

  • Added sphinx docs.

  • Fixed wrong result path when returning cached HTML results.

  • Added mode fg for oooctl. Using oooctl fg one can start oooctl in foreground now.

  • Added mode fg for pyunoctl. Using pyunoctl fg one can start pyunoctl in foreground now.

  • Added state check for oooctl: when OpenOffice.org server is down during runtime it is restarted automatically. The check happens every second.

  • Use standard lib doctest instead of zope.testing.doctest.

  • Changed PDF creation: by default now normal PDF (and not PDF/A) is created when converting to PDF. This is due to an endianess bug in many recent OpenOffice.org binaries running on 64-bit platforms.

0.1 (2010-03-02)

  • Initial implementation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ulif.openoffice-0.3.tar.gz (1.5 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page