aboutsummaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Readers/HTML.hs
AgeCommit message (Collapse)AuthorFilesLines
2016-12-08Removed debug trace from HTML reader.John MacFarlane1-2/+1
2016-12-07HTML reader: Understand `style=width:` as well as `width` in `col`.John MacFarlane1-2/+7
Closes #3286.
2016-12-06Fixed some bad regressions in HTML table parser.John MacFarlane1-3/+3
This regression leads to the introduction of empty rows in some circumstances. Closes #3280.
2016-11-26HTML reader: improved table parsing.John MacFarlane1-11/+24
We now check explicitly for non-1 rowspan or colspan attributes, and fail when we encounter them. Previously we checked that each row had the same number of cells, but that could be true even with rowspans/colspans. And there are cases where it isn't true in tables that we can handle fine -- e.g. when a tr element is empty. So now we just pad rows with empty cells when needed. Closes #3027.
2016-11-13HTML reader: only treat "a" element as link if it has href.John MacFarlane1-7/+19
Otherwise treat as span. Closes #3226.
2016-11-02HTML reader: treat `<math>` as MathML by default...John MacFarlane1-8/+11
unless something else is explicitly specified in xmlns. Provided it parses as MathML, of course. Also fixed default which should be to inline math if no display attribute is used.
2016-09-02Remove Compat.MonoidJesse Rosenthal1-1/+1
This was only necessary for GHC versions with base below 4.5 (i.e., ghc < 7.4).
2016-05-21HTML reader: fixed bug in pClose.John MacFarlane1-1/+1
This caused exponential parsing behavior in documnets with unclosed tags in dl, dd, dt.
2016-04-10Markdown + HTML readers: be more forgiving about unescaped &.John MacFarlane1-10/+15
We are now more forgiving about parsing invalid HTML with unescaped `&` as raw HTML. (Previously any unescaped `&` would cause pandoc not to recognize the string as raw HTML.) Closes #2410.
2016-03-22Fixed bug in Markdown raw HTML parsing.John MacFarlane1-1/+1
This was a regression, with the rewrite of `htmlInBalanced` (from `Text.Pandoc.Readers.HTML`) in 1.17. It caused newlines to be omitted in raw HTML blocks. Closes #2804.
2016-03-10Fixed behavior of base tag.John MacFarlane1-17/+11
+ If the base path does not end with slash, the last component will be replaced. E.g. base = `http://example.com/foo` combines with `bar.html` to give `http://example.com/bar.html`. + If the href begins with a slash, the whole path of the base is replaced. E.g. base = `http://example.com/foo/` combines with `/bar.html` to give `http://example.com/bar.html`. Closes #2777.
2016-02-20Fixed some linter warnings.John MacFarlane1-3/+3
2016-02-20HTML reader: rewrote htmlInBalanced.John MacFarlane1-10/+39
This version avoids an exponential performance problem with `<script>` tags, and it should be faster in general. Closes #2730.
2016-02-16HTML reader: properly handle an empty cell in a simple table.John MacFarlane1-0/+1
Closes #2718.
2016-01-29HTML reader: handle multiple meta tags with same name.John MacFarlane1-2/+6
Put them in a list in the metadata so they are all preserved, rather than (as before) throwing out all but one..
2016-01-22Changed type of Shared.uniqueIdent argument from [String] to Set String.John MacFarlane1-3/+3
This avoids performance problems in documents with many identically named headers. Closes #2671.
2015-12-12Modified readers to emit SoftBreak when appropriate.John MacFarlane1-1/+4
2015-11-19Merge branch 'new-image-attributes' of https://github.com/mb21/pandoc into ↵John MacFarlane1-15/+11
mb21-new-image-attributes * Bumped version to 1.16. * Added Attr field to Link and Image. * Added `common_link_attributes` extension. * Updated readers for link attributes. * Updated writers for link attributes. * Updated tests * Updated stack.yaml to build against unreleased versions of pandoc-types and texmath. * Fixed various compiler warnings. Closes #261. TODO: * Relative (percentage) image widths in docx writer. * ODT/OpenDocument writer (untested, same issue about percentage widths). * Update pandoc-citeproc.
2015-11-09Restored Text.Pandoc.Compat.Monoid.John MacFarlane1-1/+1
Don't use custom prelude for latest ghc. This is a better approach to making 'stack ghci' and 'cabal repl' work. Instead of using NoImplicitPrelude, we only use the custom prelude for older ghc versions. The custom prelude presents a uniform API that matches the current base version's prelude. So, when developing (presumably with latest ghc), we don't use a custom prelude at all and hence have no trouble with ghci. The custom prelude no longer exports (<>): we now want to match the base 4.8 prelude behavior.
2015-11-09Revert "Use -XNoImplicitPrelude and 'import Prelude' explicitly."John MacFarlane1-1/+0
This reverts commit c423dbb5a34c2d1195020e0f0ca3aae883d0749b.
2015-11-08Use -XNoImplicitPrelude and 'import Prelude' explicitly.John MacFarlane1-0/+1
This is needed for ghci to work with pandoc, given that we now use a custom prelude. Closes #2503.
2015-10-22Fixed over-eager raw HTML inline parsing.John MacFarlane1-0/+1
Tightened up the inline HTML parser so it disallows TagWarnings. This only affects the markdown reader when the `markdown_in_html_blocks` option is disabled. Closes #2469.
2015-10-14Use custom Prelude to avoid compiler warnings.John MacFarlane1-2/+2
- The (non-exported) prelude is in prelude/Prelude.hs. - It exports Monoid and Applicative, like base 4.8 prelude, but works with older base versions. - It exports (<>) for mappend. - It hides 'catch' on older base versions. This allows us to remove many imports of Data.Monoid and Control.Applicative, and remove Text.Pandoc.Compat.Monoid. It should allow us to use -Wall again for ghc 7.10.
2015-10-11HTML reader/writer: better handling of "section" elements.John MacFarlane1-3/+10
Previously `<section>` tags were just parsed as raw HTML blocks. With this change, section elements are parsed as Div elements with the class "section". The HTML writer will use `<section>` tags to render these Divs in HTML5; otherwise they will be rendered as `<div class="section">`. Closes #2438.
2015-08-08HTML reader: add auto identifiers if not present on headers.John MacFarlane1-7/+17
This makes TOC linking work properly. The same thing needs to be done to the org reader to fix #2354; in addition, `Ext_auto_identifiers` should be added to the list of default extensions for org in Text.Pandoc.
2015-08-07Updated readers, writers and README for link attributemb211-14/+4
2015-08-07Updated readers and writers for new image attribute parameter.John MacFarlane1-1/+7
(mb21)
2015-07-27HTML Reader: Detect font-variant with pickStyleAttrPropsOphir Lifshitz1-6/+5
2015-07-24HTML Reader: Parse <ol> type, class, and inline list-style(-type) CSSOphir Lifshitz1-17/+30
2015-07-21Fix regression: allow HTML comments containing `--`.John MacFarlane1-4/+4
Technically this isn't allowed in an HTML comment, but we've always allowed it, and so do most other implementations. It is handy if e.g. you want to put command line arguments in HTML comments.
2015-07-21HTML reader: handle type attribute on ol.John MacFarlane1-1/+8
E.g. `<ol type="i">`. Closes #2313.
2015-07-10Avoid parsing partial URLs as HTML tags.John MacFarlane1-1/+8
Closes #2277.
2015-06-04HTML reader: allow `<body>` to close `<head>`.John MacFarlane1-0/+1
2015-05-13HTML reader: Support base tag.John MacFarlane1-7/+28
We only support the href attribute, as there's no place for "target" in the Pandoc document model for links. Added HTML reader test module, with tests for this feature. Closes #1751.
2015-05-11HTML reader: Fixed detection of self-closing tags.John MacFarlane1-2/+2
Earlier versions had a bug and would wrongly think opening tags containing attributes with slashes in them were self-closing. Closes #2146.
2015-04-29HTML reader: Allow multiple colgroups in table.John MacFarlane1-1/+1
Closes #2122.
2015-04-26Updated copyright notices to -2015. Closes #2111.John MacFarlane1-2/+2
2015-04-17More principled fix for #1820.John MacFarlane1-5/+7
If the tag parses as a comment, we check to see if the input starts with `<!--`. If not, it's bogus comment mode and we fail htmlTag. Includes test case. Closes #1820.
2015-04-17Fixed `htmlTag` in HTML reader.John MacFarlane1-1/+1
Require that `<!` or `<?` be followed by nonspace. This prevents `</ div>` from being parsed as a comment. Closes #1820.
2015-02-18Move utility error functions to Text.Pandoc.SharedMatthew Pickering1-1/+1
2015-02-18Change return type of HTML readerMatthew Pickering1-5/+12
2015-01-25fixes #1859 HTML Reader table parsingmb211-11/+22
2014-11-16Make `embed` tag either block or inline.John MacFarlane1-2/+2
Closes #1756.
2014-09-25HTML Reader: Recognise <br> tags inside <pre> blocksmpickering1-1/+6
Closes #1620
2014-08-18HTML reader: improved handling of tags that can be block or inline.John MacFarlane1-5/+13
Previously a section like this would be enclosed in a paragraph, with RawInline for the video tags (since video is a tag that can be either block or inline): <video controls="controls"> <source src="../videos/test.mp4" type="video/mp4" /> <source src="../videos/test.webm" type="video/webm" /> <p> The videos can not be played back on your system.<br/> Try viewing on Youtube (requires Internet connection): <a href="http://youtu.be/etE5urBps_w">Relative Velocity on Youtube</a>. </p> </video> This change will cause the video and source tags to be parsed as RawBlock instead, giving better output. The general change is this: when we're parsing a "plain" sequence of inlines, we don't parse anything that COULD be a block-level tag.
2014-08-16HTML reader: Parse appropriately styled span as SmallCaps.John MacFarlane1-1/+6
2014-08-12EPUB Reader: Ignore title pagesMatthew Pickering1-4/+10
2014-08-08Added `native_divs` and `native_spans` extensions.John MacFarlane1-1/+4
This allows users to turn off the default pandoc behavior of parsing contents of div and span tags in markdown and HTML as native pandoc Div blocks and Span inlines. Setting of default epub extensions has been moved from the EPUB reader to Text.Pandoc.
2014-08-08HTML EPUB exts: switch element can now be in either the inline or block positionMatthew Pickering1-9/+10
2014-08-07HTML reader: Really ignore DOCTYPE and xml declarations.John MacFarlane1-2/+2
This actually does what d71b013841f3c9c8c595591e312a31df16a728cb said it did. Revised epub tests to remove the repeated DOCTYPE and xml tags.