aboutsummaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Readers
AgeCommit message (Collapse)AuthorFilesLines
2021-02-20HTML reader: efficiency improvements.John MacFarlane1-81/+129
Do a lookahead to find the right parser to use. Benchmarks from 34ms to 23ms, with less allocation. Also speeds up the epub reader.
2021-02-18DocBook, JATS, OPML readers: performance optimization.John MacFarlane3-64/+8
With the new XML parser, we can avoid the expensive tree normalization step we used to do. This gives a significant speed boost in docbook and JATS parsing (e.g. 9.7 to 6 ms).
2021-02-18Org reader: fix bug in org-ref citation parsing.Albert Krewinkel1-1/+1
The org-ref syntax allows to list multiple citations separated by comma. This fixes a bug that accepted commas as part of the citation id, so all citation lists were parsed as one single citation. Fixes: #7101
2021-02-17Docx reader: use Map instead of list for Namespaces.John MacFarlane2-20/+20
This gives a speedup of about 5-10%. The reader is now approximately twice as fast as in the last release.
2021-02-16Rename Text.Pandoc.XMLParser -> Text.Pandoc.XML.Light...John MacFarlane15-344/+309
..and add new definitions isomorphic to xml-light's, but with Text instead of String. This allows us to keep most of the code in existing readers that use xml-light, but avoid lots of unnecessary allocation. We also add versions of the functions from xml-light's Text.XML.Light.Output and Text.XML.Light.Proc that operate on our modified XML types, and functions that convert xml-light types to our types (since some of our dependencies, like texmath, use xml-light). Update golden tests for docx and pptx. OOXML test: Use `showContent` instead of `ppContent` in `displayDiff`. Docx: Do a manual traversal to unwrap sdt and smartTag. This is faster, and needed to pass the tests. Benchmarks: A = prior to 8ca191604dcd13af27c11d2da225da646ebce6fc (Feb 8) B = as of 8ca191604dcd13af27c11d2da225da646ebce6fc (Feb 8) C = this commit | Reader | A | B | C | | ------- | ----- | ------ | ----- | | docbook | 18 ms | 12 ms | 10 ms | | opml | 65 ms | 62 ms | 35 ms | | jats | 15 ms | 11 ms | 9 ms | | docx | 72 ms | 69 ms | 44 ms | | odt | 78 ms | 41 ms | 28 ms | | epub | 64 ms | 61 ms | 56 ms | | fb2 | 14 ms | 5 ms | 4 ms |
2021-02-13HTML reader: fix bad handling of empty src attribute in iframe.John MacFarlane1-6/+12
- If src is empty, we simply skip the iframe. - If src is invalid or cannot be fetched, we issue a warning and skip instead of failing with an error. - Closes #7099.
2021-02-13Org: support task_lists extensionAlbert Krewinkel1-2/+39
The tasks lists extension is now supported by the org reader and writer; the extension is turned on by default. Closes: #6336
2021-02-13LaTeX reader: remove unnecessary lineJohn MacFarlane1-1/+0
2021-02-12Avoid an unnecessary withRaw.John MacFarlane1-1/+4
2021-02-12LaTeX reader improvements.John MacFarlane2-22/+68
* Rewrote `withRaw` so it doesn't rely on fragile assumptions about token positions (which break when macros are expanded). This requires the addition of `sEnableWithRaw` and `sRawTokens` in `LaTeXState`, and a new combinator `disablingWithRaw` to disable collecting of raw tokens in certain contexts. * Add `parseFromToks` to T.P.Readers.LaTeX.Parsing. * Fix parsing of single character tokens so it doesn't mess up the new raw token collecting. * These changes slightly increase allocations and have a small performance impact, but it's minor. Closes #7092.
2021-02-11Use getTimestamp instead of getCurrentTime in writers.John MacFarlane1-2/+2
Setting SOURCE_DATE_EPOCH will allow reproducible builds. Partially addresses #7093. This does not suffice to fully enable reproducible in EPUB, since a unique id is being generated for each build.
2021-02-10Add new unexported module T.P.XMLParser.John MacFarlane8-62/+109
This exports functions that uses xml-conduit's parser to produce an xml-light Element or [Content]. This allows existing pandoc code to use a better parser without much modification. The new parser is used in all places where xml-light's parser was previously used. Benchmarks show a significant performance improvement in parsing XML-based formats (especially ODT and FB2). Note that the xml-light types use String, so the conversion from xml-conduit types involves a lot of extra allocation. It would be desirable to avoid that in the future by gradually switching to using xml-conduit directly. This can be done module by module. The new parser also reports errors, which we report when possible. A new constructor PandocXMLError has been added to PandocError in T.P.Error [API change]. Closes #7091, which was the main stimulus. These changes revealed the need for some changes in the tests. The docbook-reader.docbook test lacked definitions for the entities it used; these have been added. And the docx golden tests have been updated, because the new parser does not preserve the order of attributes. Add entity defs to docbook-reader.docbook. Update golden tests for docx.
2021-02-08ODT reader: finer-grained errors on parse failure.John MacFarlane1-21/+18
See #7091.
2021-02-08ODT reader: give more information if zip can't be unpacked.John MacFarlane1-1/+4
2021-02-08DocBook reader: Support informalfigure (#7079)Nils Carlson1-1/+3
Add support for informalfigure.
2021-02-06Markdown reader: improved handling of mmd link attributes in references.John MacFarlane1-0/+2
Previously they only worked for links that had titles. Closes #7080.
2021-01-31RST reader: fix handling of header in CSV tables.John MacFarlane1-4/+5
The interpretation of this line is not affected by the delim option. Closes #7064.
2021-01-26Clean up BibTeX parsing.John MacFarlane1-0/+18
Previously there was a messy code path that gave strange results in some cases, not passing through raw tex but trying to extract a string content. This was an artefact of trying to handle some special bibtex-specific commands in the BibTeX reader. Now we just handle these in the LaTeX reader and simplify parsing in the BibTeX reader. This does mean that more raw tex will be passed through (and currently this is not sensitive to the `raw_tex` extension; this should be fixed). Closes #7049.
2021-01-16Revert "Markdown reader: support GitHub wiki's internal links (#2923) (#6458)"John MacFarlane1-25/+0
This reverts commit 6efd3460a776620fdb93812daa4f6831e6c332ce. Since this extension is designed to be used with GitHub markdown (gfm), we need to implement the parser as a commonmark extension (commonmark-extensions), rather than in pandoc's markdown reader. When that is done, we can add it here.
2021-01-16Markdown reader: support GitHub wiki's internal links (#2923) (#6458)Gautier DI FOLCO1-0/+25
Canges overview: * Add a `Ext_markdown_github_wikilink` constructor to `Extension` [API change]. * Add the parser `githubWikiLink` in `Text.Pandoc.Readers.Markdown` * Add tests.
2021-01-09Org reader: allow multiple pipe chars in todo sequencesAlbert Krewinkel1-4/+10
Additional pipe chars, used to separate "action" state from "no further action" states, are ignored. E.g., for the following sequence, both `DONE` and `FINISHED` are states with no further action required. #+TODO: UNFINISHED | DONE | FINISHED Previously, parsing of the todo sequence failed if multiple pipe chars were included. Closes: #7014
2021-01-08Update copyright notices for 2021 (#7012)Albert Krewinkel35-36/+36
2021-01-04LaTeX reader: handle filecontents environment.John MacFarlane2-6/+28
Closes #7003.
2021-01-03Org reader: mark verbatim code with class "verbatim". (#6998)Dimitri Sabadie1-1/+1
* Replace org-mode’s verbatim from code to codeWith. This adds the `"verbatim"` class so that exporters can apply a specific style on it. For instance, it will be possible for HTML to add a CSS rule for code + verbatim class. * Alter test for org-mode’s verbatim change. See previous commit for further detail on the new implementation.
2021-01-02LaTeX reader: put contents of unknown environments in a Div...John MacFarlane1-1/+1
when `raw_tex` is not enabled. (When `raw_tex` is enabled, the whole environment is parsed as a raw block.) The class name is the name of the environment. Previously, we just included the contents without the surrounding Div, but having a record of the environment's boundaries and name can be useful. Closes #6997.
2021-01-01Org reader: restructure output of captioned code blocksAlbert Krewinkel1-14/+12
The Div wrapper of code blocks with captions now has the class "captioned-content". The caption itself is added as a Plain block inside a Div of class "caption". This makes it easier to write filters which match on captioned code blocks. Existing filters will need to be updated. Closes: #6977
2020-12-30Mediawiki reader: allow space around storng/emph delimiters.John MacFarlane1-6/+4
Closes #6993.
2020-12-30Hlint fixesJohn MacFarlane1-1/+1
2020-12-28HTML reader: use renderTags' from Text.Pandoc.Shared.Albert Krewinkel1-25/+3
The `renderTags'` function was duplicated when the reader used `Text` as its string type. The duplication is no longer necessary. A side effect of this change is that empty `<col>` elements are written as self-closing tags in raw HTML blocks.
2020-12-10HTML reader: pay attention to lang attributes on body.John MacFarlane1-3/+6
These (as well as lang attributes on html) should update lang in metadata. See #6938.
2020-12-10HTML reader: retain attribute prefixes and avoid duplicates.John MacFarlane2-24/+24
Previously we stripped attribute prefixes, reading `xml:lang` as `lang` for example. This resulted in two duplicate `lang` attributes when `xml:lang` and `lang` were both used. This commit causes the prefixes to be retained, and also avoids invald duplicate attributes. Closes #6938.
2020-12-10Add sourcepos extension for commonmarkeJohn MacFarlane1-5/+9
* Add `Ext_sourcepos` constructor for `Extension`. * Add `sourcepos` extension (only for commonmark). * Bump to 2.11.3 With the `sourcepos` extension set set, `data-pos` attributes are added to the AST by the commonmark reader. No other readers are affected. The `data-pos` attributes are put on elements that accept attributes; for other elements, an enlosing Div or Span is added to hold the attributes. Closes #4565.
2020-12-10Commonmark reader: refactor specFor, set input name to "".John MacFarlane1-2/+8
2020-12-07Dokuwiki reader: handle unknown interwiki links better.John MacFarlane1-1/+1
DokuWiki lets the user define his own Interwiki links. Previously pandoc reacted to these by emitting a google search link, which is not helpful. Instead, we now just emit the full URL including the wikilink prefix, e.g. `faquk>FAQ-mathml`. This at least gives users the ability to modify the links using filters. Closes #6932.
2020-12-05Org reader: preserve targets of spurious linksAlbert Krewinkel1-5/+4
Links with (internal) targets that the reader doesn't know about are converted into emphasized text. Information on the link target is now preserved by wrapping the text in a Span of class `spurious-link`, with an attribute `target` set to the link's original target. This allows to recover and fix broken or unknown links with filters. See: #6916
2020-12-05LaTeX reader: don't apply theorem default styling to a figure inside.John MacFarlane1-0/+1
If we put an image in italics, then when rendering to Markdown we no longer get an implicit figure. Closes #6925.
2020-11-29LaTeX reader: don't parse `\rule` with width 0 as horizontal rule.John MacFarlane1-1/+11
2020-11-28Fix a tiny Typo in the CSV reader moduleTassos Manganaris1-1/+1
Header comment in the CSV reader module says "RST" instead of "CSV".
2020-11-27HTML reader tests: improve test coverage of new featuresAlbert Krewinkel1-1/+2
2020-11-27HTML reader: support body headers, row head columnsAlbert Krewinkel1-41/+61
Closes: #6312
2020-11-26LaTeX reader: preserve center environment (#6852)Igor Pashev1-1/+1
The contents of the `center` environment are put in a `Div` with class `center`.
2020-11-26HTML reader: improve support for table headers, footer, attributesAlbert Krewinkel3-102/+223
- `<tfoot>` elements are no longer added to the table body but used as table footer. - Separate `<tbody>` elements are no longer combined into one. - Attributes on `<thead>`, `<tbody>`, `<th>`/`<td>`, and `<tfoot>` elements are preserved.
2020-11-26HTML reader: allow finer grained options for tag omissionAlbert Krewinkel3-13/+26
2020-11-25HTML reader: simplify list attribute handlingAlbert Krewinkel1-18/+9
This removes the `foldOrElse` function from the internal Text.Pandoc.CSS module.
2020-11-24HTML reader: support row or column-spanning table cellsAlbert Krewinkel2-28/+26
2020-11-24HTML reader: support blocks in captionAlbert Krewinkel2-6/+6
2020-11-24HTML reader: extract table parsing into separate moduleAlbert Krewinkel3-95/+140
2020-11-23HTML reader: extract submodulesAlbert Krewinkel4-239/+342
Reducing module size should reduce memory use during compilation. This is preparatory work to tackle support for more table features.
2020-11-22Org reader: parse `#+LANGUAGE` into `lang` metadata fieldAlbert Krewinkel1-0/+2
Fixes: #6845
2020-11-21LaTeX reader: more robust parsing of bracketed options.John MacFarlane1-3/+8
Improves on 9a40976. Closes #6873.