Age | Commit message (Collapse) | Author | Files | Lines |
|
* [Docx Parser] Move style-parsing-specific code to a new module
* [Docx Writer] Re-use Readers.Docx.Parse.Styles for StyleMap
* [Docx Writer] Move Readers.Docx.StyleMap to Writers.Docx.StyleMap
It's never used outside of writer code, so it makes more sense to scope it under writers really.
|
|
Motivating issues: #5523, #5052, #5074
Style name comparisons are case-insensitive, since those are
case-insensitive in Word.
w:styleId will be used as style name if w:name is missing (this should
only happen for malformed docx and is kept as a fallback to avoid
failing altogether on malformed documents)
Block quote detection code moved from Docx.Parser to Readers.Docx
Code styles, i.e. "Source Code" and "Verbatim Char" now honor style
inheritance
Docx Reader now honours "Compact" style (used in Pandoc-generated docx).
The side-effect is that "Compact" style no longer shows up in
docx+styles output. Styles inherited from "Compact" will still
show up.
Removed obsolete list-item style from divsToKeep. That didn't
really do anything for a while now.
Add newtypes to differentiate between style names, ids, and
different style types (that is, paragraph and character styles)
Since docx style names can have spaces in them, and pandoc-markdown
classes can't, anywhere when style name is used as a class name,
spaces are replaced with ASCII dashes `-`.
Get rid of extraneous intermediate types, carrying styleId information.
Instead, styleId is saved with other style data.
Use RunStyle for inline style definitions only (lacking styleId and styleName);
for Character Styles use CharStyle type (which is basicaly RunStyle with styleId
and StyleName bolted onto it).
|
|
Reduce code duplication, remove redundant brackets, use newtype instead of data where appropriate
|
|
Closes #5545.
|
|
The haddock module header contains essentially the
same information, so the boilerplate is redundant and
just one more thing to get out of sync.
|
|
We had previously walked the document to unwrap sdt/sdtContent and
smartTag tags in `word/document.xml`, but not in the
`word/{foot/end}note.xml` and `word/comments.xml`.
Closes #5302
|
|
Some paths in archives are absolute (have an opening slash) which, for
reasons unknown, produces a failure in the test suite on MS
Windows. This fixes that by removing the leading slash if it exists.
Closes #5277 (previously closed with 4cce0ef but reopened due to this bug).
|
|
This reverts commit 2142bbe572cea00b7bb5ad3e10a3afb26845a1f7.
|
|
Try fixing a parsing error on windows by insisting that the parser use
a Posix filepath library for splitting doc paths in a zipfile. (It
might default on Windows to using a backslash as a separator, while
it's always a forward-slash in zip archives.)
|
|
* clarify function name. We had previously used `getDocumentPath`,
but `Document` is an overdetermined term here. Use
`getDocumentXmlPath` to make clear what we're doing.
* Use field notation for setting ReaderEnv. As we've added (and
continue to add) fields, the assignment by position has gotten
harder to read.
* figure out document.xml path once at the beginning of parsing, and
add it to the environment, so we can avoid repeated lookups.
|
|
Getting the location used to depend on a hard-coded .rels file based
on "word/document.xml". We now dynamically detect that file based on
the document.xml file specified in "_rels/.rels"
|
|
The desktop Word program places the main document file in
"word/document.xml", but the online word places it in
"word/document2.xml". This file path is actually stated in the root
"_rels/.rels" file, in the "Relationship" element with an
"http://../officedocument" type.
Closes #5277
|
|
For some reason, Word in Office 365 Online uses `document2.xml`
for the content, instead of `document.xml`. This causes pandoc
not to be able to parse docx.
This quick fix has the parser check for both `document.xml`
and `document2.xml`.
Addresses #5277, but a more robust solution would be to
get the name of the main document dynamically (who knows
whether it might change again?).
|
|
Quite a few modules were missing copyright notices.
This commit adds copyright notices everywhere via haddock module
headers. The old license boilerplate comment is redundant with this and has
been removed.
Update copyright years to 2019.
Closes #4592.
|
|
There can be overrides for the definitions of certain levels in
numbering definitions. This implements that behavior.
Closes: #5134
|
|
|
|
It had previously been an alias for a tuple.
|
|
These are variants for "complex scripts" like Arabic and
are now treated just like b, i (bold, italic).
Colses #4947.
|
|
|
|
|
|
This seems to be necessary if we are to use our custom Prelude
with ghci.
Closes #4464.
|
|
|
|
Make unwrapSDT into a general `unwrap` function that can unwrap both
nested SDT tags and smartTags. This makes the SmartTags constructor in
the Docx type unnecessary, so we remove it.
Closes #4446
|
|
Previously we had only unwrapped one level of sdt tags. Now we recurse
if we find them.
Closes: #4415
|
|
|
|
We introduce a new module, Text.Pandoc.Readers.Docx.Fields which
contains a simple parsec parser. At the moment, only simple hyperlink
fields are accepted, but that can be extended in the future.
|
|
This will allow us to parse instrTxt inside fldChar tags.
|
|
|
|
This will tell us whether a paragraph break was inserted or
deleted. We add a generalized track-changes parsing function, and use
it in `elemToParPart` as well.
|
|
We're going to want to use it elsewhere as well, in upcoming tracking
of paragraph insertion/deletion.
|
|
Previously we had only read the first child of an sdtContents tag. Now
we replace sdt with all children of the sdtContents tag.
This changes the expected test result of our nested_anchors test,
since now we read docx's generated TOCs.
|
|
We walk through the document (using the zipper in
Text.XML.Light.Cursor) to unwrap the sdt tags before doing the rest of
the parsing of the document. Note that the function is generically
named `walkDocument` in case we need to do any further preprocessing
in the future.
Closes #4190
|
|
|
|
|
|
|
|
We used to parse paragraphs styled with "HeadingN" as "nth-level
header." But if a document has a custom style named "Heading0", this
will produce a 0-level header, which shouldn't exist. We only parse
this style if N>0. Otherwise we treat it as a normal style name, and
follow its dependencies, if any.
Closes #3830.
|
|
This gives 20-30% speedup and reduction of memory
usage in most of the writers.
|
|
This follows the suggestions given by the FSF for GPL licensed software.
<https://www.gnu.org/prep/maintain/html_node/Copyright-Notices.html>
|
|
|
|
Previously we didn't recognize math, for example, when
the xmlns declaration occured on the element and not the root.
Now we recognize either.
Closes #3365.
This patch defines findChildByName, findChildrenByName,
and findAttrByName in Util, and uses these in Parse.
|
|
This just parses inside smartTags and yields their contents,
ignoring the attributes of the smartTag. @jkr, you may want
to adjust this, but I wanted to get a fix in as fast as possible
for the dropped content.
Closes #2242; see also #3412.
|
|
|
|
+ Removed Text.Pandoc.Readers.Docx.Fonts
+ Moved its code to texmath; we now use (from texmath 0.9)
Text.TeXMath.Unicode.Fonts
+ Use texmath 0.9 (currently from git).
+ Updated epub tests because texmath now handles more mathml.
|
|
We wrap `[CHART]` in a `<span class="chart">`. Note that it maps to
inlines because, in docx, anything in a drawing tag can be part of a
larger paragraph.
|
|
We not only want "w:drawing", because that could also include
charts. Now we specify "w:drawing"//"pic:pic". This shouldn't change
behavior at all, but it's a first step toward allowing other sorts of
drawing data as well.
|
|
|
|
We use the "description" field as alt text and the "title" field as
title. These can be accessed through the "Format Picture" dialog in
Word.
|
|
Highly influenced by the docx support, refactored
some code to avoid DRY.
|
|
|
|
|