aboutsummaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Readers/HTML.hs
AgeCommit message (Collapse)AuthorFilesLines
2017-10-24HTML reader: td or th implicitly closes blocks within last td/th.John MacFarlane1-1/+5
2017-10-23HTML reader: `htmlTag` improvements.John MacFarlane1-8/+19
We previously failed on cases where an attribute contained a `>` character. This patch fixes the bug. Closes #3989.
2017-09-17Added `--strip-comments` option, `readerStripComments` in `ReaderOptions`.John MacFarlane1-6/+10
* Options: Added readerStripComments to ReaderOptions. * Added `--strip-comments` command-line option. * Made `htmlTag` from the HTML reader sensitive to this feature. This affects Markdown and Textile input. Closes #2552.
2017-09-04HTML reader: Fix pattern match.John MacFarlane1-1/+1
2017-08-30HTML reader: improved handling of figure.John MacFarlane1-17/+17
Previously we had a parse failure if the figure contained anything besides an image and caption.
2017-08-17HTML reader: support column alignments.John MacFarlane1-13/+30
These can be set either with a `width` attribute or with `text-width` in a `style` attribute. Closes #1881.
2017-08-09HTML reader: parse <main> like <div role=main>. (#3791)bucklereed1-7/+11
* HTML reader: parse <main> like <div role=main>. * <main> closes <p> and behaves like a block element generally
2017-07-22HTML Reader: parse figure and figcaption (#3813)Mauro Bieg1-0/+20
2017-07-11HTML reader: Ensure that paragraphs are closed properly...John MacFarlane1-0/+2
when the parent block element closes, even without `</p>`. Closes #3794.
2017-06-27HTML reader: Use the lang value of <html> to set the lang meta value. (#3765)bucklereed1-0/+9
* HTML reader: Use the lang value of <html> to set the lang meta value. * Fix for pre-AMP environments.
2017-06-20Move CR filtering from tabFilter to the readers.John MacFarlane1-2/+2
The readers previously assumed that CRs had been filtered from the input. Now we strip the CRs in the readers themselves, before parsing. (The point of this is just to simplify the parsers.) Shared now exports a new function `crFilter`. [API change] And `tabFilter` no longer filters CRs.
2017-06-19Separated tracing from logging.John MacFarlane1-3/+2
Formerly tracing was just log messages with a DEBUG log level. We now make these things independent. Tracing can be turned on or off in PandocMonad using `setTrace`; it is independent of logging. * Removed `DEBUG` from `Verbosity`. * Removed `ParserTrace` from `LogMessage`. * Added `trace`, `setTrace` to `PandocMonad`.
2017-06-11Rewrote HTML reader to use Text throughout.John MacFarlane1-137/+194
- Export new NamedTag class from HTML reader. - Effect on memory usage is modest (< 10%).
2017-06-10Changed all readers to take Text instead of String.John MacFarlane1-2/+4
Readers: Renamed StringReader -> TextReader. Updated tests. API change.
2017-06-02Fixed HTML reader.John MacFarlane1-2/+3
2017-06-01HTML reader: Use sets instead of lists for block tag lookup.John MacFarlane1-50/+43
2017-06-01HTML reader: Removed "button" from block tag list.John MacFarlane1-1/+1
It is already in the eitherBlockOrInlineTag list, and should be both places. Closes #3717. Note: the result of this change is that there will be p tags around the whole paragraph. That is the right result, because the `button` tags are treated as inline HTML here, and the whole chunk of text is a Markdown paragraph.
2017-05-24HTML reader: Add `details` tag to list of block tags.John MacFarlane1-1/+2
Closes #3694.
2017-05-13Update dates in copyright noticesAlbert Krewinkel1-2/+2
This follows the suggestions given by the FSF for GPL licensed software. <https://www.gnu.org/prep/maintain/html_node/Copyright-Notices.html>
2017-04-23HTML reader: Revise treatment of li with id attribute.John MacFarlane1-2/+6
Previously we always added an empty div before the list item, but this created problems with spacing in tight lists. Now we do this: If the list item contents begin with a Plain block, we modify the Plain block by adding a Span around its contents. Otherwise, we add a Div around the contents of the list item (instead of adding an empty Div to the beginning, as before). Closes #3596.
2017-03-18HTML reader: Better sanity checks on raw HTML.John MacFarlane1-6/+17
This also affects the Markdown reader. Closes #3257.
2017-03-12Issue warning for duplicate header identifiers.John MacFarlane1-5/+11
As noted in the previous commit, an autogenerated identifier may still coincide with an explicit identifier that is given for a header later in the document, or with an identifier on a div, span, link, or image. This commit adds a warning in this case, so users can supply an explicit identifier. * Added `DuplicateIdentifier` to LogMessage. * Modified HTML, Org, MediaWiki readers so their custom state type is an instance of HasLogMessages. This is necessary for `registerHeader` to issue warnings. See #1745.
2017-03-04Fixed some loose ends in #1592.John MacFarlane1-1/+3
Added test cases. Fixed HTML reader to parse a span with class "smallcaps" as SmallCaps. Fixed Markdown writer to render SmallCaps as a native span when native spans are enabled.
2017-02-20Tighten up HasQuoteContext instance in HTML reader.John MacFarlane1-1/+1
We constrain it to the state used in the HTML reader. Otherwise we can get overlap with the general instance for ParserState m.
2017-02-11Use new warnings throughout the code base.John MacFarlane1-6/+4
2017-02-10Added Text.Pandoc.Logging (exported module).John MacFarlane1-1/+2
This now contains the Verbosity definition previously in Options, as well as a new LogMessage datatype that will eventually be used instead of raw strings for warnings. This will enable us, among other things, to provide machine-readable warnings if desired. See #3392.
2017-02-10HTML reader: Added warnings for ignored material.John MacFarlane1-5/+14
See #3392.
2017-02-06Removed --parse-raw and readerParseRaw.John MacFarlane1-7/+7
These were confusing. Now we rely on the +raw_tex or +raw_html extension with latex or html input. Thus, instead of --parse-raw -f latex we use -f latex+raw_tex and instead of --parse-raw -f html we use -f html+raw_html
2017-01-25More logging-related changes.John MacFarlane1-9/+5
Class: * Removed getWarnings, withWarningsToStderr * Added report * Added logOutput to PandocMonad * Make logOutput streaming in PandocIO monad * Properly reverse getLog output Readers: * Replaced use of trace with report DEBUG. TWiki Reader: Put everything inside PandocMonad m. API changes.
2017-01-25Changes to verbosity in writer and reader options.John MacFarlane1-3/+3
API changes: Text.Pandoc.Options: * Added Verbosity. * Added writerVerbosity. * Added readerVerbosity. * Removed writerVerbose. * Removed readerTrace. pandoc CLI: The `--trace` option sets verbosity to DEBUG; the `--quiet` option sets it to ERROR, and the `--verbose` option sets it to INFO. The default is WARNING.
2017-01-25Unify Errors.Jesse Rosenthal1-1/+2
2017-01-25Working on readers.Jesse Rosenthal1-98/+115
2016-12-08Removed debug trace from HTML reader.John MacFarlane1-2/+1
2016-12-07HTML reader: Understand `style=width:` as well as `width` in `col`.John MacFarlane1-2/+7
Closes #3286.
2016-12-06Fixed some bad regressions in HTML table parser.John MacFarlane1-3/+3
This regression leads to the introduction of empty rows in some circumstances. Closes #3280.
2016-11-26HTML reader: improved table parsing.John MacFarlane1-11/+24
We now check explicitly for non-1 rowspan or colspan attributes, and fail when we encounter them. Previously we checked that each row had the same number of cells, but that could be true even with rowspans/colspans. And there are cases where it isn't true in tables that we can handle fine -- e.g. when a tr element is empty. So now we just pad rows with empty cells when needed. Closes #3027.
2016-11-13HTML reader: only treat "a" element as link if it has href.John MacFarlane1-7/+19
Otherwise treat as span. Closes #3226.
2016-11-02HTML reader: treat `<math>` as MathML by default...John MacFarlane1-8/+11
unless something else is explicitly specified in xmlns. Provided it parses as MathML, of course. Also fixed default which should be to inline math if no display attribute is used.
2016-09-02Remove Compat.MonoidJesse Rosenthal1-1/+1
This was only necessary for GHC versions with base below 4.5 (i.e., ghc < 7.4).
2016-05-21HTML reader: fixed bug in pClose.John MacFarlane1-1/+1
This caused exponential parsing behavior in documnets with unclosed tags in dl, dd, dt.
2016-04-10Markdown + HTML readers: be more forgiving about unescaped &.John MacFarlane1-10/+15
We are now more forgiving about parsing invalid HTML with unescaped `&` as raw HTML. (Previously any unescaped `&` would cause pandoc not to recognize the string as raw HTML.) Closes #2410.
2016-03-22Fixed bug in Markdown raw HTML parsing.John MacFarlane1-1/+1
This was a regression, with the rewrite of `htmlInBalanced` (from `Text.Pandoc.Readers.HTML`) in 1.17. It caused newlines to be omitted in raw HTML blocks. Closes #2804.
2016-03-10Fixed behavior of base tag.John MacFarlane1-17/+11
+ If the base path does not end with slash, the last component will be replaced. E.g. base = `http://example.com/foo` combines with `bar.html` to give `http://example.com/bar.html`. + If the href begins with a slash, the whole path of the base is replaced. E.g. base = `http://example.com/foo/` combines with `/bar.html` to give `http://example.com/bar.html`. Closes #2777.
2016-02-20Fixed some linter warnings.John MacFarlane1-3/+3
2016-02-20HTML reader: rewrote htmlInBalanced.John MacFarlane1-10/+39
This version avoids an exponential performance problem with `<script>` tags, and it should be faster in general. Closes #2730.
2016-02-16HTML reader: properly handle an empty cell in a simple table.John MacFarlane1-0/+1
Closes #2718.
2016-01-29HTML reader: handle multiple meta tags with same name.John MacFarlane1-2/+6
Put them in a list in the metadata so they are all preserved, rather than (as before) throwing out all but one..
2016-01-22Changed type of Shared.uniqueIdent argument from [String] to Set String.John MacFarlane1-3/+3
This avoids performance problems in documents with many identically named headers. Closes #2671.
2015-12-12Modified readers to emit SoftBreak when appropriate.John MacFarlane1-1/+4
2015-11-19Merge branch 'new-image-attributes' of https://github.com/mb21/pandoc into ↵John MacFarlane1-15/+11
mb21-new-image-attributes * Bumped version to 1.16. * Added Attr field to Link and Image. * Added `common_link_attributes` extension. * Updated readers for link attributes. * Updated writers for link attributes. * Updated tests * Updated stack.yaml to build against unreleased versions of pandoc-types and texmath. * Fixed various compiler warnings. Closes #261. TODO: * Relative (percentage) image widths in docx writer. * ODT/OpenDocument writer (untested, same issue about percentage widths). * Update pandoc-citeproc.