aboutsummaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Readers
AgeCommit message (Collapse)AuthorFilesLines
2021-07-11Improved parsing of raw LaTeX from Text streams (rawLaTeXParser).John MacFarlane2-11/+37
We now use source positions from the token stream to tell us how much of the text stream to consume. Getting this to work required a few other changes to make token source positions accurate. Closes #7434.
2021-07-09RST reader: fix regression with code includes.John MacFarlane1-1/+5
With the recent changes to include infrastructure, included code blocks were getting an extra newline. Closes #7436. Added regression test.
2021-07-06Recognize data-external when reading HTML img tags (#7429)Michael Hoffmann1-8/+3
Preserve all attributes in img tags. If attributes have a `data-` prefix, it will be stripped. In particular, this preserves a `data-external` attribute as an `external` attribute in the pandoc AST.
2021-07-06Markdown reader: don't try to read contents in self-closing HTML tag.John MacFarlane1-1/+4
Previously we had problems parsing raw HTML with self-closing tags like `<col/>`. The problem was that pandoc would look for a closing tag to close the markdown contents, but the closing tag had, in effect, already been parsed by `htmlTag`. This fixes the issue described in <https://groups.google.com/d/msgid/pandoc-discuss/297bc662-7841-4423-bcbb-534e99bbba09n%40googlegroups.com>.
2021-07-06HTML reader: add col, colgroup to 'closes' definitionsJohn MacFarlane1-1/+3
2021-06-22Fix regression with comment-only YAML metadata blocks.John MacFarlane1-0/+3
Closes #7400.
2021-06-21Improve emailAddress in Text.Pandoc.Parsing.John MacFarlane1-1/+21
Previously the parser would accept characters in domains that are illegal in domains, and this sometimes caused it to gobble bits of the following text. Closes #7398. Note that this change, by itself, caused some txt2tag reader tests to fail. txt2tags allows bare email addresses with a following form query. So, in addition to the change to emailAddress, we modify the txt2tags parser so it can still handle these cases.
2021-06-12Docx reader: handle absolute URIs in Relationship Target.John MacFarlane1-5/+11
Closes #7374.
2021-06-05DocBook reader: Add support for danger elementJan Tojnar1-1/+2
Added in DocBook 5.2: - https://github.com/docbook/docbook/pull/64 - https://tdg.docbook.org/tdg/5.2/danger.html
2021-06-01Markdown reader: fix pipe table regression in 2.11.4.John MacFarlane1-1/+1
Previously pipe tables with empty headers (that is, a header line with all empty cells) would be rendered as headerless tables. This broke in 2.11.4. The fix here is to produce an AST with an empty table head when a pipe table has all empty header cells. Closes #7343.
2021-06-01LaTeX reader: don't allow optional * on symbol control sequences.John MacFarlane1-2/+4
Generally we allow optional starred variants of LaTeX commands (since many allow them, and if we don't accept these explicitly, ignoring the star usually gives acceptable results). But we don't want to do this for `\(*\)` and similar cases. Closes #7340.
2021-05-31Fix regression with commonmark/gfm yaml metdata block parsing.John MacFarlane1-5/+5
A regression in 2.14 led to the document body being omitted after YAML metadata in some cases. This is now fixed. Closes #7339.
2021-05-30HTML reader: fix column width regression.John MacFarlane1-1/+1
Column widths specified with a style attribute were off by a factor of 100 in 2.14. Closes #7334.
2021-05-29Markdown reader: in rebasePaths, check for both Windows and PosixJohn MacFarlane1-4/+5
absolute paths. Previously Windows pandoc was treating `/foo/bar.jpg` as non-absolute.
2021-05-29In rebasePath, check for absolute paths two ways.John MacFarlane1-1/+4
isAbsolute from FilePath doesn't return True on Windows for paths beginning with `/`, so we check that separately.
2021-05-28Support `rebase_relative_paths` for commonmark based formats.John MacFarlane1-1/+3
(Including `gfm`.)
2021-05-28Docx reader: Support new table features.Emily Bourke3-49/+163
* Column spans * Row spans - The spec says that if the `val` attribute is ommitted, its value should be assumed to be `continue`, and that its values are restricted to {`restart`, `continue`}. If the value has any other value, I think it seems reasonable to default it to `continue`. It might cause problems if the spec is extended in the future by adding a third possible value, in which case this would probably give incorrect behaviour, and wouldn't error. * Allow multiple header rows * Include table description in simple caption - The table description element is like alt text for a table (along with the table caption element). It seems like we should include this somewhere, but I’m not 100% sure how – I’m pairing it with the simple caption for the moment. (Should it maybe go in the block caption instead?) * Detect table captions - Check for caption paragraph style /and/ either the simple or complex table field. This means the caption detection fails for captions which don’t contain a field, as in an example doc I added as a test. However, I think it’s better to be too conservative: a missed table caption will still show up as a paragraph next to the table, whereas if I incorrectly classify something else as a table caption it could cause havoc by pairing it up with a table it’s not at all related to, or dropping it entirely. * Update tests and add new ones Partially fixes: #6316
2021-05-28Docx reader: Read table column widths.Emily Bourke2-3/+4
2021-05-27rebase_relative_paths: leave empty paths unchanged.John MacFarlane1-1/+1
2021-05-27rebase_relative_paths extension: don't change fragment paths.John MacFarlane1-1/+2
We don't want a pure fragment path to be rewritten, since these are used for cross-referencing.
2021-05-27Modify rebase_reference_links treatment of reference links/images.John MacFarlane1-5/+4
The directory is based on the file containing the link reference, not the file containing the link, if these differ.
2021-05-27Add `rebase_relative_paths` extension.John MacFarlane1-7/+29
- Add manual entry for (non-default) extension `rebase_relative_paths`. - Add constructor `Ext_rebase_relative_paths` to `Extensions` in Text.Pandoc.Extensions [API change]. When enabled, this extension rewrites relative image and link paths by prepending the (relative) directory of the containing file. - Make Markdown reader sensitive to the new extension. - Add tests for #3752. Closes #3752. NB. currently the extension applies to markdown and associated readers but not commonmark/gfm.
2021-05-27LaTeX reader: improve `\def` and implement `\newif`.John MacFarlane2-15/+63
- Improve parsing of `\def` macros. We previously set "verbatim mode" even for parsing the initial `\def`; this caused problems for things like ``` \def\foo{\def\bar{BAR}} \foo \bar ``` - Implement `\newif`. - Add tests.
2021-05-25Allow compilation with base 4.15Albert Krewinkel2-9/+8
2021-05-25Use haddock-library-1.10.0Albert Krewinkel1-1/+2
2021-05-25Jira: add support for "smart" linksAlbert Krewinkel1-0/+2
Support has been added for the new `[alias|https://example.com|smart-card]` syntax.
2021-05-22Handle relative lengths (e.g. `2*`) in HTML column widths.John MacFarlane1-14/+33
See <https://www.w3.org/TR/html4/types.html#h-6.6>. "A relative length has the form "i*", where "i" is an integer. When allotting space among elements competing for that space, user agents allot pixel and percentage lengths first, then divide up remaining available space among relative lengths. Each relative length receives a portion of the available space that is proportional to the integer preceding the "*". The value "*" is equivalent to "1*". Thus, if 60 pixels of space are available after the user agent allots pixel and percentage space, and the competing relative lengths are 1*, 2*, and 3*, the 1* will be alloted 10 pixels, the 2* will be alloted 20 pixels, and the 3* will be alloted 30 pixels." Closes #4063.
2021-05-22Revert "HTML reader: simplify col width parsing"John MacFarlane1-9/+13
This reverts commit f76fe2ab56606528d4710cc6c40bceb5788c3906.
2021-05-22HTML reader: simplify col width parsingAlbert Krewinkel1-13/+9
2021-05-20DocBook reader: ensure that first and last names are separated.John MacFarlane1-6/+14
Closes #6541.
2021-05-20LaTeX reader: More siunitx improvements. Closes #6658.John MacFarlane2-46/+95
There's still one slight divergence from the siunitx behavior: we get 'kg m/A/s' instead of 'kg m/(A s)'. At the moment I'm not going to worry about that.
2021-05-20LaTeX/siunitx: fix parsing of `\cubic` etc. See #6658.John MacFarlane1-35/+50
2021-05-20LaTeX reader sinuitx: fix + sign on ang.John MacFarlane1-3/+6
2021-05-20LaTeX reader siunitx: add leading 0 to numbers starting with .John MacFarlane1-2/+5
2021-05-20LaTeX reader: Fix parsing of `+-` in siunitx numbers.John MacFarlane1-4/+7
See #6658.
2021-05-20LaTeX reader: support `\pm` in `SI{..}`.John MacFarlane1-1/+3
Closes #6620.
2021-05-19LaTeX reader: better support for `\xspace`.John MacFarlane2-14/+19
Previously we only supported it in inline contexts; now we support it in all contexts, including math. Partially addresses #7299.
2021-05-17HTML writer: keep attributes from code nested below pre tag.Albert Krewinkel1-1/+12
If a code block is defined with `<pre><code class="language-x">…</code></pre>`, where the `<pre>` element has no attributes, then the attributes from the `<code>` element are used instead. Any leading `language-` prefix is dropped in the code's *class* attribute are dropped to improve syntax highlighting. Closes: #7221
2021-05-15HTML writer: parse `<header>` as a DivAlbert Krewinkel1-0/+2
HTML5 `<header>` elements are treated like `<div>` elements.
2021-05-14HTML reader: keep h1 tags as normal headers (#7274)Albert Krewinkel1-5/+1
The tags `<title>` and `<h1 class="title">` often contain the same information, so the latter was dropped from the document. However, as this can lead to loss of information, the heading is now always retained. Use `--shift-heading-level-by=-1` to turn the `<h1>` into the document title, or a filter to restore the previous behavior. Closes: #2293
2021-05-14HTML reader: don't fail on unmatched closing "script" tag.Albert Krewinkel1-7/+9
Prevent the reader from crashing if the HTML input contains an unmatched closing `</script>` tag. Fixes: #7282
2021-05-13Implement curly-brace syntax for Markdown citation keys.John MacFarlane2-7/+7
The change provides a way to use citation keys that contain special characters not usable with the standard citation key syntax. Example: `@{foo_bar{x}'}` for the key `foo_bar{x}`. Closes #6026. The change requires adding a new parameter to the `citeKey` parser from Text.Pandoc.Parsing [API change]. Markdown reader: recognize @{..} syntax for citatinos. Markdown writer: use @{..} syntax for citations when needed. Update manual with curly-brace syntax for citations. Closes #6026.
2021-05-12Fix source position reporting for YAML bibliographies.John MacFarlane2-4/+6
Closes #7273.
2021-05-09RST reader: seek include files in the directory...John MacFarlane1-1/+3
...of the file containing the include directive, as RST requires. Closes #6632.
2021-05-09Org reader: Resolve org includes relative to ...John MacFarlane2-2/+5
...the directory containing the file containing the INCLUDE directive. Closes #5501.
2021-05-09RST reader: use `insertIncludedFile` from T.P.Parsing...John MacFarlane1-58/+36
instead of reproducing much of its code.
2021-05-09T.P.Parsing: improve include file functions.John MacFarlane2-3/+3
Remove old `insertIncludedFileF`. [API change] Give `insertIncludedFile` a more general type, allowing it to be used where `insertIncludedFileF` was.
2021-05-09Change reader types, allowing better tracking of source positions.John MacFarlane34-355/+443
Previously, when multiple file arguments were provided, pandoc simply concatenated them and passed the contents to the readers, which took a Text argument. As a result, the readers had no way of knowing which file was the source of any particular bit of text. This meant that we couldn't report accurate source positions on errors or include accurate source positions as attributes in the AST. More seriously, it meant that we couldn't resolve resource paths relative to the files containing them (see e.g. #5501, #6632, #6384, #3752). Add Text.Pandoc.Sources (exported module), with a `Sources` type and a `ToSources` class. A `Sources` wraps a list of `(SourcePos, Text)` pairs. [API change] A parsec `Stream` instance is provided for `Sources`. The module also exports versions of parsec's `satisfy` and other Char parsers that track source positions accurately from a `Sources` stream (or any instance of the new `UpdateSourcePos` class). Text.Pandoc.Parsing now exports these modified Char parsers instead of the ones parsec provides. Modified parsers to use a `Sources` as stream [API change]. The readers that previously took a `Text` argument have been modified to take any instance of `ToSources`. So, they may still be used with a `Text`, but they can also be used with a `Sources` object. In Text.Pandoc.Error, modified the constructor PandocParsecError to take a `Sources` rather than a `Text` as first argument, so parse error locations can be accurately reported. T.P.Error: showPos, do not print "-" as source name.
2021-04-29Docx reader: add handling of vml image objects (jgm#4735) (#7257)mbrackeantidot1-2/+9
They represent images, the same way as other images in vml format.
2021-04-28Smarter smart quotes.John MacFarlane3-47/+12
Treat a leading " with no closing " as a left curly quote. This supports the practice, in fiction, of continuing paragraphs quoting the same speaker without an end quote. It also helps with quotes that break over lines in line blocks. Closes #7216.