A fast parser for reStructuredText
Project description
rst_fast_parse
A fast, spec compliant*, concrete syntax parser for reStructuredText.
In development, use at your own risk
Features:
- Fault tolerant parsing; designed to never raise an exception
- Concrete syntax tokens with full source mapping
- Diagnostics for common issues
- No required dependencies
- Functional parsing design, with no modifiable global state (thread safe)
- Fully typed with "strict" mypy settings
This parser is NOT intended to be a full replacement for the docutils/sphinx rST parser. The initial goal is to parse the "outline" of a reStructuredText document, without necessarily knowing the full information about all roles / directives, into a structure that can be used as a foundation for tools like linters, formatters and Language Servers (as opposed to having to wait for a full sphinx build).
Incremental parsing and formatting is also planned.
* spec compliant for all rST syntaxes (tested extensively against docutils), but no spec exists for all directive/role content, due to their highly dynamic nature.
Usage
To parse a string, use the parse_string
function.
from rst_fast_parse import parse_string
nodes, diagnostics = parse_string("""
Title
-----
hallo
there *world!*
""",
inline_sourcemaps=True
)
assert nodes.debug_repr() == """\
<title style='-'> 1-2
<inline> 1-1
<text> 1:0-1:5
<paragraph> 3-4
<inline> 3-4
<text> 3:0-4:6
<emphasis> 4:6-4:14
"""
Improving performance
Note, if only block line parsing is required, use parse_inlines=False
for a reasonable speed-up.
from rst_fast_parse import parse_string
nodes, diagnostics = parse_string("""
Hello
-----
*world!*
""",
parse_inlines=False)
assert nodes.debug_repr() == """\
<title style='-'> 1-2
<inline> 1-1
<paragraph> 3-3
<inline> 3-3
"""
Also, the inline_sourcemaps
option, to compute and add source mappings to inline nodes, is disabled by default,
since this also has a performance impact.
For comparison, parsing the restructured specification file (>3000 lines) currently takes:
- 25ms with
parse_inlines=False
- 35ms with
parse_inlines=True
- 44ms with
parse_inlines=True, inline_sourcemaps=True
Nesting sections
The parser does not automatically nest sections, based on title underline/overline styles, like docutils, since this is not generally needed for linting or formatting tools, and will allow for incremental parsing.
If you wish to nest sections, you can use the nest_sections
function:
from rst_fast_parse import parse_string, nest_sections
nodes, diagnostics = parse_string("""
Header 1
========
Header 1.1
----------
""")
nodes = nest_sections(nodes)
assert nodes.debug_repr() == """\
<section> 1-4
<title style='='> 1-2
<inline> 1-1
<text>
<section> 3-4
<title style='-'> 3-4
<inline> 3-3
<text>
"""
Directive parsing
Due to the highly dynamic nature of directives, and their tight coupling to docutils/sphinx, the parser does not attempt to parse all directives.
Instead there is a default mapping of standard directives, to a simple declarative definition of the directive. These definitions can be modified and passed to the parser as needed:
from rst_fast_parse import parse_string, get_default_directives
print(get_default_directives())
nodes, diagnostics = parse_string("""
.. note:: This is a note
:class: my-note
""",
directives={
'note': {
"argument": False, # can have an argument
"options": True, # can have an options block
"content": True, # can have a content block
"parse_content": True, # parse content as rST
}
})
assert nodes.debug_repr() == """\
<directive name='note'> 1-2
<options>
<option name='class'> 2-2
<body>
<paragraph> 1-1
<inline> 1-1
<text>
"""
Diagnostics
Diagnostics are returned for any known issues found during parsing.
from rst_fast_parse import parse_string
nodes, diagnostics = parse_string("""
- list `no role name`
no blank line
""")
assert nodes.debug_repr() == """\
<bullet_list symbol='-'> 1-1
<list_item> 1-1
<paragraph> 1-1
<inline> 1-1
<text>
<role>
<paragraph> 2-2
<inline> 2-2
<text>
"""
assert [d.as_dict() for d in diagnostics] == [
{
'code': 'block.blank_line',
'message': 'Blank line expected after Bullet list',
'line_start': 1,
'character_end': 21
},
{
'code': 'inline.role_no_name',
'message': 'Inline role without name.',
'line_start': 1,
'character_start': 7,
'character_end': 21
}
]
Available diagnostic codes:
source.tab_in_line
: Warns on tabs in a line, which can degrade performance of source mapping.block.blank_line
: Warns on missing blank lines between syntax blocks.block.title_line
: Warns on issues with title under/over lines.block.title_disallowed
: Warns on unexpected titles in a context where they are not allowed.block.paragraph_indentation
: Warns on unexpected indentation of a paragraph line.block.literal_no_content
: Warns on literal blocks with no content.block.target_malformed
: Warns on malformed hyperlink targets.block.substitution_malformed
: Warns on malformed substitution definition.block.table_malformed
: Warns on malformed tables.block.inconsistent_title_level
: Warns on inconsistent title levels, e.g. a level 1 title style followed by a level 3 style.block.directive_indented_options
: Warns if the second line of a directive starts with an indented:
.block.directive_malformed
: Warns on malformed directives.inline.no_closing_marker
: Warns on inline markup with no closing marker.inline.role_malformed
: Warns on malformed inline roles.inline.role_no_name
: Warns on inline roles with no name.
Walking the node tree
Use the walk_children
function to walk a node's (block) children.
A builtin use of this is the walk_line_inside
function,
which yields all nodes that contain a given line number.
from rst_fast_parse import parse_string
from rst_fast_parse.nodes import walk_line_inside
nodes, diagnostics = parse_string("""
- a
1. content
- b
""")
assert [e.tagname for e in walk_line_inside(nodes, 3)] == [
'bullet_list', 'list_item', 'enum_list', 'list_item', 'paragraph', 'inline'
]
Command line usage
There is also a simple CLI for linting reStructuredText stdin/files:
$ echo "- a\n1. *b" | python -m rst_fast_parse.cli.lint --print-ast --ast-maps -
<bullet_list symbol='-'> 0-0
<list_item> 0-0
<paragraph> 0-0
<inline> 0-0
<text> 0:2-0:3
<enum_list ptype='period' etype='arabic'> 1-1
<list_item> 1-1
<paragraph> 1-1
<inline> 1-1
<problematic> 1:3-1:4
<text> 1:4-1:5
<stdin>:1:1: Blank line expected after Bullet list [block.blank_line]
<stdin>:2:4: Inline emphasis no closing marker. [inline.no_closing_marker]
Found 2 error.
Design decisions
The parse does not automatically nest sections, based on title underline styles, like docutils. This allows for incremental parsing, as well as a simpler design.
We want to try to avoid any user-defined "dynamic" code execution, e.g. for parsing directive content, since this limits the future ability to convert the codebase to a different language, to configure using a declarative format, or to run in a sandboxed environment.
Licensing
For now the project is under a fairly strict license, and the distributed code is relatively obscured.
This is to mitigate "bad faith" copying of the codebase, especially whilst in development, which unfortunately has happened to me in the past 😒
Changelog
0.0.16
- 🎉 Add inline parsing
- 🎉 Add character-level source mappings for diagnostics
- Refactor elements to nodes
0.0.15
- 🎉 Add directive parsing
- Replace
ElementProtocol.line_inside
withwalk_line_inside
function. - Replace
ElementList
withRootElement
- Add
InlineElement
,ParagraphElement
,BulletListElement
EnumListElement
,FieldListElement
,FieldItemElement
,DefinitionListElement
,DefinitionItemElement
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for rst_fast_parse-0.0.16-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d506cafb451877d09ec6b5a2f227b67bf03fa5e0798f350a6f614ad74d773fbc |
|
MD5 | 07d209286994b7fb3db1e165fb0b66d1 |
|
BLAKE2b-256 | 558bdb82ece0d0026245185052ec74006b0149a7d3414b45d619bf1c8644bed6 |