docx (OOXML) to html converter
Project description
pydocx is a parser that breaks down the elements of a docxfile and converts them into different markup languages. Right now, HTML is supported. Markdown and LaTex will be available soon. You can extend any of the available parsers to customize it to your needs. You can also create your own class that inherits DocxParser to create your own methods for a markup language not yet supported.
Currently Supported
- tables
nested tables
rowspans
colspans
lists in tables
- lists
list styles
nested lists
list of tables
list of pragraphs
justification
images
- styles
bold
italics
underline
hyperlinks
headings
Usage
DocxParser includes abstracts methods that each parser overwrites to satsify its own needs. The abstract methods are as follows:
class DocxParser: @property def parsed(self): return self._parsed @property def escape(self, text): return text @abstractmethod def linebreak(self): return '' @abstractmethod def paragraph(self, text): return text @abstractmethod def heading(self, text, heading_level): return text @abstractmethod def insertion(self, text, author, date): return text @abstractmethod def hyperlink(self, text, href): return text @abstractmethod def image_handler(self, path): return path @abstractmethod def image(self, path, x, y): return self.image_handler(path) @abstractmethod def deletion(self, text, author, date): return text @abstractmethod def bold(self, text): return text @abstractmethod def italics(self, text): return text @abstractmethod def underline(self, text): return text @abstractmethod def superscript(self, text): return text @abstractmethod def subscript(self, text): return text @abstractmethod def tab(self): return True @abstractmethod def ordered_list(self, text): return text @abstractmethod def unordered_list(self, text): return text @abstractmethod def list_element(self, text): return text @abstractmethod def table(self, text): return text @abstractmethod def table_row(self, text): return text @abstractmethod def table_cell(self, text): return text @abstractmethod def page_break(self): return True @abstractmethod def indent(self, text, left='', right='', firstLine=''): return text
Docx2Html inherits DocxParser and implements basic HTML handling. Ex.
class Docx2Html(DocxParser): # Escape '&', '<', and '>' so we render the HTML correctly def escape(self, text): return xml.sax.saxutils.quoteattr(text)[1:-1] # return a line break def linebreak(self, pre=None): return '<br />' # add paragraph tags def paragraph(self, text, pre=None): return '<p>' + text + '</p>'
However, let’s say you want to add a specific style to your HTML document. In order to do this, you want to make each paragraph a class of type my_implementation. Simply extend docx2Html and add what you need.
class My_Implementation_of_Docx2Html(Docx2Html): def paragraph(self, text, pre = None): return <p class="my_implementation"> + text + '</p>'
OR, let’s say FOO is your new favorite markup language. Simply customize your own new parser, overwritting the abstract methods of DocxParser
class Docx2Foo(DocxParser): # because linebreaks in are denoted by '!!!!!!!!!!!!' with the FOO markup langauge :) def linebreak(self): return '!!!!!!!!!!!!'
Custom Pre-Processor
When creating your own Parser (as described above) you can now add in your own custom Pre Processor. To do so you will need to set the pre_processor field on the custom parser, like so:
class Docx2Foo(DocxParser): pre_processor_class = FooPrePorcessor
The FooPrePorcessor will need a few things to get you going:
class FooPrePorcessor(PydocxPrePorcessor): def perform_pre_processing(self, root, *args, **kwargs): super(FooPrePorcessor, self).perform_pre_processing(root, *args, **kwargs) self._set_foo(root) def _set_foo(self, root): pass
If you want _set_foo to be called you must add it to perform_pre_processing which is called in the base parser for pydocx.
Everything done during pre-processing is executed prior to parse being called for the first time.
Styles
The base parser Docx2Html relies on certain css class being set for certain behaviour to occur. Currently these include:
class pydocx-insert -> Turns the text green.
class pydocx-delete -> Turns the text red and draws a line through the text.
class pydocx-center -> Aligns the text to the center.
class pydocx-right -> Aligns the text to the right.
class pydocx-left -> Aligns the text to the left.
class pydocx-comment -> Turns the text blue.
class pydocx-underline -> Underlines the text.
Optional Arguments
You can pass in convert_root_level_upper_roman=True to the parser and it will convert all root level upper roman lists to headings instead.
Changelog
- 0.3.0
We switched from using stock xml.etree.ElementTree to using xml.etree.cElementTree. This has resulted in a fairly significant speed increase for python 2.6
It is now possible to create your own pre processor to do additional pre processing.
Superscripts and subscripts are now extracted correctly.
- 0.2.1
Added a changelog
Added the version in pydocx.__init__
Fixed an issue with duplicating content if there was indentation or justification on a p element that had multiple t tags.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file PyDocX-0.3.0.tar.gz
.
File metadata
- Download URL: PyDocX-0.3.0.tar.gz
- Upload date:
- Size: 355.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68c0f63f6814be0b176b06fb52136926b7d7d0a829f66e14a17374a51845c635 |
|
MD5 | cd08ca68fa8049834ab71c912366da44 |
|
BLAKE2b-256 | e7d4a29007adedd21ece4ab9d9ec8d6505401f0438152664253ec1ab1c81fce4 |