aboutsummaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Readers/HTML.hs
AgeCommit message (Collapse)AuthorFilesLines
2020-03-13Update copyright year (#6186)Albert Krewinkel1-1/+1
* Update copyright year * Copyright: add notes for Lua and Jira modules
2020-02-13A bit more cleanup (#6141)Joseph C. Sible1-5/+4
* Remove unnecessary fmaps and only do toMilliseconds once * Share the input tuple intead of making a new one * Lift return out of if * Simplify case statements * Lift DottedNum out of the case statements * Use st instead of mbs * Use setState instead of updateState now that we have the whole state around
2020-02-12HTML reader: don't parse `data-id` as `id` attribute.John MacFarlane1-1/+9
And similarly don't parse any `data-X` as `X` when `X` is a valid HTML attribute. Reported in comment on #5415.
2020-02-07Apply linter suggestions. Add fix_spacing to lint target in Makefile.John MacFarlane1-3/+3
2020-02-03Swap suboptimal uses of maybe and fromMaybe (#6111)Joseph C. Sible1-2/+2
Anywhere "maybe" is used with "id" as its second argument, using "fromMaybe" instead will simplify the code. Conversely, anywhere "fromMaybe" is used with the result of "fmap" or "<$>" as its second argument, using "maybe" instead will simplify the code.
2019-12-17HTML reader: Add "nav" to list of block-level tags.John MacFarlane1-1/+2
2019-11-12Switch to new pandoc-types and use Text instead of String [API change].despresc1-119/+112
PR #5884. + Use pandoc-types 1.20 and texmath 0.12. + Text is now used instead of String, with a few exceptions. + In the MediaBag module, some of the types using Strings were switched to use FilePath instead (not Text). + In the Parsing module, new parsers `manyChar`, `many1Char`, `manyTillChar`, `many1TillChar`, `many1Till`, `manyUntil`, `mantyUntilChar` have been added: these are like their unsuffixed counterparts but pack some or all of their output. + `glob` in Text.Pandoc.Class still takes String since it seems to be intended as an interface to Glob, which uses strings. It seems to be used only once in the package, in the EPUB writer, so that is not hard to change.
2019-11-11Change the implementation of `htmlSpanLikeElements` and implement `<dfn>` ↵Florian Beeres1-4/+11
(#5882) * Add HTML Reader support for `<dfn>`, parsing this as a Span with class `dfn`. * Change `htmlSpanLikeElements` implementation to retain classes, attributes and inline content.
2019-11-04Removed an unnecessary unpack.John MacFarlane1-1/+1
2019-11-04HTML Reader/Writer - Add support for <var> and <samp> (#5861)Amogh Rathore1-5/+7
Closes #5799
2019-10-24HTML reader/writer: Better handling of <q> with cite attribute (#5837)Ole Martin Ruud1-23/+34
* HTML reader: Handle cite attribute for quotes. If a `<q>` tag has a `cite` attribute, we interpret it as a Quoted element with an inner Span. Closes #5798 * Refactor url canonicalization into a helper function * Modify HTML writer to handle quote with cite. [0]: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/q
2019-10-23Add Reader support for HTML <samp> element (#5843)Amogh Rathore1-0/+9
The `<samp>` element is parsed as a Span with class `sample`. Closes #5792.
2019-10-15Add support for reading and writing <kbd> elementsDaniele D'Orazio1-1/+9
* Text.Pandoc.Shared: export `htmlSpanLikeElements` [API change] This commit also introduces a mapping of HTML span like elements that are internally represented as a Span with a single class, but that are converted back to the original element by the html writer. As of now, only the kbd element is handled this way. Ideally these elements should be handled as plain AST values, but since that would be a breaking change with a large impact, we revert to this stop-gap solution. Fixes https://github.com/jgm/pandoc/issues/5796.
2019-09-28Use Prelude.fail to avoid ambiguity with fail from GHC.Base.John MacFarlane1-2/+2
2019-07-02Fix redundant constraint warnings. (#5625)Pete Ryland1-2/+2
2019-05-29HTML reader: misc. epub related fixes.John MacFarlane1-30/+41
- With epub extensions, check for epub:type in addition to type. - Fix problem with noteref parsing which caused block-level content to be eaten with the noteref. - Rename pAnyTag to pAny. - Refactor note resolution.
2019-05-27consolidate simple-table detection (#5524)Mauro Bieg1-7/+2
add `onlySimpleTableCells` to `Text.Pandoc.Shared` [API change] This fixes an inconsistency in the HTML reader, which did not treat tables with `<p>` inside cells as simple.
2019-05-25HTML reader: trim definition list termsAlexander Krotov1-1/+1
2019-03-25HTML reader: read `data-foo` attribute into `foo`.John MacFarlane1-1/+2
The HTML writer adds the `data-` prefix for HTML5 for nonstandard attributes. But the attributes are represented in the AST without the `data-` prefix, so we should strip this when reading HTML. Closes #5392.
2019-03-01Remove license boilerplate.John MacFarlane1-18/+0
The haddock module header contains essentially the same information, so the boilerplate is redundant and just one more thing to get out of sync.
2019-02-04Add missing copyright notices and remove license boilerplate (#5112)Albert Krewinkel1-2/+2
Quite a few modules were missing copyright notices. This commit adds copyright notices everywhere via haddock module headers. The old license boilerplate comment is redundant with this and has been removed. Update copyright years to 2019. Closes #4592.
2019-01-21HTML and markdown: treat textarea as a verbatim environment.John MacFarlane1-1/+3
We don't want to parse its contents as Markdown or HTML. Closes #5241.
2018-12-31Remove unused HasHeaderMap (#5175)Alexander1-6/+1
It is updated by some readers, but never actually used.
2018-12-17HTML reader: handle empty start attribute.John MacFarlane1-4/+2
See #5162.
2018-11-16HTML reader: allow tfoot before body rows.John MacFarlane1-2/+3
Closes #5079.
2018-11-15HTML reader: parse `<small>` as a Span with class "small".John MacFarlane1-0/+4
Closes #5080.
2018-11-13HTML reader: allow thead containing a row with td rather than th.John MacFarlane1-11/+11
See #5014. Note that this doesn't address the original issue in #5014, only an unrelated side-issue.
2018-10-11HTML reader: fix htmlTag and isInlineTag to accept processing instructions.John MacFarlane1-8/+10
Fixes regression #3123 (since 2.0). Added regression test.
2018-09-07HTML reader: parse `<script type="math/tex` tags as math.John MacFarlane1-0/+12
These are used by MathJax. Closes #4877.
2018-08-24HTML reader: allow enabling `raw_tex` extension.John MacFarlane1-3/+28
This now allows raw LaTeX environments, `\ref`, and `\eqref` to be parsed (which is helpful for translation HTML documents using MathJaX). Closes #1126.
2018-08-22HTML reader: extract spaces inside links instead of trimming themAlexander Krotov1-3/+3
Fixes #4845
2018-07-02Spellcheck commentsAlexander Krotov1-2/+2
2018-04-05Changes to tests to accommodate changes in pandoc-types.John MacFarlane1-2/+4
In https://github.com/jgm/pandoc-types/pull/36 we changed the table builder to pad cells. This commit changes tests (and two readers) to accord with this behavior.
2018-03-18Use NoImplicitPrelude and explicitly import Prelude.John MacFarlane1-0/+2
This seems to be necessary if we are to use our custom Prelude with ghci. Closes #4464.
2018-03-16Monoid/Semiground cleanup relying on custom Prelude.John MacFarlane1-1/+1
2018-01-19hlint code improvements.John MacFarlane1-4/+4
2018-01-15HTML reader: Fix col width parsing for percentages < 10% (#4262)n3fariox1-3/+6
Rather than take user input, and place a "0." in front, actually calculate the percentage to catch cases where small column sizes (e.g. `2%`) are needed.
2018-01-05Update copyright notices to include 2018Albert Krewinkel1-2/+2
2017-12-27Fix warning.John MacFarlane1-2/+1
2017-12-27Small improvement to figcaption parsing. #4184.John MacFarlane1-2/+0
2017-12-27Merge pull request #4184 from mb21/html-reader-figcaptionJohn MacFarlane1-4/+7
HTML Reader: be more forgiving about figcaption
2017-12-27HTML reader: parse div with class `line-block` as LineBlock.John MacFarlane1-1/+13
See #4162.
2017-12-23HTML Reader: be more forgiving about figcaptionmb211-4/+7
fixes #4183
2017-12-06Markdown reader: accept processing instructions as raw HTML.John MacFarlane1-2/+3
Closes #4125.
2017-12-04Add `empty_paragraphs` extension.John MacFarlane1-4/+9
* Deprecate `--strip-empty-paragraphs` option. Instead we now use an `empty_paragraphs` extension that can be enabled on the reader or writer. By default, disabled. * Add `Ext_empty_paragraphs` constructor to `Extension`. * Revert "Docx reader: don't strip out empty paragraphs." This reverts commit d6c58eb836f033a48955796de4d9ffb3b30e297b. * Implement `empty_paragraphs` extension in docx reader and writer, opendocument writer, html reader and writer. * Add tests for `empty_paragraphs` extension.
2017-11-25Fix comment typo: s/elemnet/element/Alexander Krotov1-1/+1
2017-11-18HTML reader: ensure we don't produce level 0 headers,John MacFarlane1-5/+5
even for chapter sections in epubs. This causes problems because writers aren't set up to expect these. This fixes the most immediate problem in #4076. It would be good to think more about how to propagate the information that top-level headers are chapters from the reader to the writer.
2017-11-10HTML reader: hlintAlexander Krotov1-31/+30
2017-11-01Really fix #3989.John MacFarlane1-5/+12
The previous fix only worked in certain cases. Other cases with `>` in an HTML attribute broke.
2017-11-01hlintAlexander Krotov1-5/+5