aboutsummaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Readers/HTML.hs
AgeCommit message (Collapse)AuthorFilesLines
2013-11-07recognize svg tag in HTML ReaderMinRK1-1/+1
avoids adding lots of `<p>` tags in embedded SVG content, for instance in markdown to HTML.
2013-11-03HTML reader: Use pandoc Div and Span for raw "<div>", "<span>".John MacFarlane1-10/+25
Only if --parse-raw.
2013-08-10Adjustments for new Format newtype.John MacFarlane1-2/+2
2013-07-16HTML reader: read widths from col tags if present.John MacFarlane1-6/+23
Closes #893.
2013-07-16HTML reader: Handle non-simple tables (#893).John MacFarlane1-3/+9
Column widths are divided equally. TODO: Get column widths from col tags if present.
2013-07-16HTML reader: Generalized table parser.John MacFarlane1-4/+9
This commit doesn't change the present behavior at all, but it will make it easier to support non-simple tables in the future.
2013-06-24Use new flexible metadata type.John MacFarlane1-23/+20
* Depend on pandoc 1.12. * Added yaml dependency. * `Text.Pandoc.XML`: Removed `stripTags`. (API change.) * `Text.Pandoc.Shared`: Added `metaToJSON`. This will be used in writers to create a JSON object for use in the templates from the pandoc metadata. * Revised readers and writers to use the new Meta type. * `Text.Pandoc.Options`: Added `Ext_yaml_title_block`. * Markdown reader: Added support for YAML metadata block. Note that it must come at the beginning of the document. * `Text.Pandoc.Parsing.ParserState`: Replace `stateTitle`, `stateAuthors`, `stateDate` with `stateMeta`. * RST reader: Improved metadata. Treat initial field list as metadata when standalone specified. Previously ALL fields "title", "author", "date" in field lists were treated as metadata, even if not at the beginning. Use `subtitle` metadata field for subtitle. * `Text.Pandoc.Templates`: Export `renderTemplate'` that takes a string instead of a compiled template.. * OPML template: Use 'for' loop for authors. * Org template: '#+TITLE:' is inserted before the title. Previously the writer did this.
2013-03-28Parsing: Better error reporting in readWith.John MacFarlane1-1/+4
- Specialize readWith to String input. - On error have it print the line in which the error occurred, with a caret pointing to the column. - This should help diagnose parsing problems in LaTeX especially.
2013-02-16HTML reader: Preserve all header attributes.John MacFarlane1-2/+4
2013-01-30HTML reader: Handle colgroup tag.John MacFarlane1-1/+2
2013-01-12HTML reader: Added html5 tags to list of block-level tags.John MacFarlane1-5/+8
2013-01-09Added Attr field to Header.John MacFarlane1-2/+4
Previously header ids were autogenerated by the writers. Now they are generated (unless supplied explicitly) in the markdown parser, if the `header_identifiers` extension is selected. In addition, the textile reader now supports id attributes on headers.
2012-09-15HTML reader: Modified htmlTag for fewer false positives.John MacFarlane1-1/+1
A tag must start with `<` followed by `!`,`?`, `/`, or a letter. This makes it more useful in the wikimedia and markdown parsers.
2012-09-13MediaWiki reader: Use MWState instead of ParserState.John MacFarlane1-1/+1
2012-09-09HTML reader: Handle nested `<q>` tags properly.John MacFarlane1-1/+9
2012-09-09HTML reader: Parse <q> as Quoted DoubleQuote.John MacFarlane1-0/+4
2012-08-15Moved renderTags' from HTML reader & SelfContained to Shared.John MacFarlane1-13/+1
Improved removal of markdown="1" attribute in Markdow reader.
2012-07-26Fixed whitespace errors.John MacFarlane1-5/+5
2012-07-26Use readerExtensions instead of readerStrict in readers.John MacFarlane1-26/+19
Test individually for the extensions.
2012-07-25Changed reader parameters from ParserState to ReaderOptions.John MacFarlane1-3/+3
2012-07-25Moved ParseRaw from ParserState to ReaderOptions.John MacFarlane1-4/+4
2012-07-25Options -> ReaderOptions.John MacFarlane1-2/+2
Better to keep reader and writer options separate.
2012-07-25Put smart, strict in separate options field in state.John MacFarlane1-2/+3
This is the beginning of a larger transition that will make Options, not ParserState, the parameter of the read functions. (Options will also be used in writers, in place of WriterOptions.) Next step is to remove strict, replacing it with granular tests for different extensions.
2012-07-24HTML reader: Fixed bug in htmlBalanced.John MacFarlane1-2/+1
This caused hangs in parsing certain markdown input using --strict.
2012-07-20Use Parser as type synonym for Parsec.John MacFarlane1-8/+8
2012-07-20Text.Pandoc.Parsing: Export all Parsec functions used in pandoc code.John MacFarlane1-2/+0
No other module directly imports Parsec. This will make it easier to change the parsing backend in the future, if we want to.
2012-07-20Use Text.Parsec instead of Text.ParserCombinators.Parsec.John MacFarlane1-12/+12
2012-04-29HTML reader: Support `<col>` and `<caption>` in tables.John MacFarlane1-1/+3
Closes #486.
2012-04-28HTML reader: Don't skip nonbreaking spaces.John MacFarlane1-1/+7
Previously a paragraph containing just `&nbsp;` would be rendered as an empty paragraph. Thanks to Paul Vorbach for pointing out the bug.
2012-02-17Don't escape `<` in `<style>` tags with `--self-contained`.John MacFarlane1-2/+10
Closes #422: highlighting lost using `--self-contained`.
2012-01-12Added "title" to list of docbook block-level tags.John MacFarlane1-1/+1
2011-12-29Better smart quote parsing.John MacFarlane1-2/+6
* Added stateLastStrPos to ParserState. This lets us keep track of whether we're parsing the position immediately after a 'str'. If we encounter a ' in such a location, it must be an apostrophe, and can't be a single quote start. * Set this in the markdown, textile, html, and rst str parsers. * Closes #360.
2011-10-25HTML reader now recognizes DocBook block and inline tags.John MacFarlane1-5/+24
It was always possible to include raw DocBook tags in a markdown document, but now pandoc will be able to distinguish block from inline tags and behave accordingly. Thus, for example, <sidebar> hello </sidebar> will not be wrapped in `<para>` tags.
2011-08-01HTML reader: Fixed bug parsing tables w both thead and tbody.John MacFarlane1-0/+1
See bug #274, which was not completely fixed by the last patch.
2011-07-23Properly handle characters in the 128..159 range.John MacFarlane1-2/+41
These aren't valid in HTML, but many HTML files produced by Windows tools contain them. We substitute correct unicode characters.
2011-07-16HTML reader: treat Plain as Para when needed.John MacFarlane1-9/+12
For example, in Just a few glitches remaining. <ul><li> In this situation, one loses the list. </ul> And in this, the preformatting. <pre>Preformatted text not starting with its own blank line. </pre> Thansk to Dirk Laurie for noticing the issue.
2011-07-15HTML reader: Handle tbody, thead in simple tables.John MacFarlane1-7/+17
Closes #274.
2011-07-10Make HTML reader more forgiving of bad HTML.John MacFarlane1-4/+16
* Skip spaces after <b>, <emph>, etc. * Convert Plain elements into Para when they're in a list item with Para, Pre, BlockQuote, CodeBlock. An example of HTML that pandoc handles better now: ~~~~ <h4> Testing html to markdown </h4> <ul> <li> <b> An item in a list </b> <p> An introductory sentence. <pre> Some preformatted text at this stage comes next. But alas! much havoc is wrought by Pandoc. </pre> </ul> ~~~~ Thanks to Dirk Laurie for reporting the issues.
2011-01-26Add support for attributes in inline Code.John MacFarlane1-2/+6
Additional related changes: * URLs in Code in autolinks now use class "url". * Require highlighting-kate 0.2.8.2, which omits the final <br/> tag, essential for inline code.
2011-01-26Bumped version to 1.8; depend on pandoc-types 1.8.John MacFarlane1-2/+2
The old TeX, HtmlInline and RawHtml elements have been removed and replaced by generic RawInline and RawBlock elements. All modules updated to use the new raw elements.
2011-01-14HTML reader: parse simple tables.John MacFarlane1-2/+22
Resolves Issue #106. Thanks to Rodja Trappe for the idea and some sample code.
2011-01-14HTML reader: parse location tags in pSatisfy.John MacFarlane1-13/+17
This avoids the need for manual parsing all over the place.
2011-01-06HTML reader: Fixed bug in htmlTag for comments.John MacFarlane1-2/+9
2010-12-30HTML reader: Fixed some parsing bugs.John MacFarlane1-22/+28
2010-12-30New HTML reader using tagsoup as a lexer.John MacFarlane1-582/+379
* The new reader is faster and more accurate. * API changes for Text.Pandoc.Readers.HTML: - removed rawHtmlBlock, anyHtmlBlockTag, anyHtmlInlineTag, anyHtmlTag, anyHtmlEndTag, htmlEndTag, extractTagType, htmlBlockElement, htmlComment - added htmlTag, htmlInBalanced, isInlineTag, isBlockTag, isTextTag * tagsoup is a new dependency. * Text.Pandoc.Parsing: Generalized type on readWith. * Benchmark.hs: Added length calculation to force full evaluation. * Updated HTML reader tests. * Updated markdown and textile readers to use the functions from the HTML reader. * Note: The markdown reader now correctly handles some cases it did not before. For example: <hr/> is reproduced without adding a space. <script> a = '<b>'; </script> is parsed correctly.
2010-12-22HTML reader: Simplified parsing of <script> sections.John MacFarlane1-24/+1
I had previously assumed that we needed to ignore </script> occuring in a string literal or javascript comment. It turns out, though, that browsers aren't that smart.
2010-12-22Made --smart work with HTML reader.John MacFarlane1-4/+13
It did not work before, because - and quotes were gobbled up by the str parser.
2010-12-15HTML reader: allow : in tags.John MacFarlane1-2/+6
Resolves Issue #274.
2010-12-10Removed HTML sanitization.John MacFarlane1-90/+5
This is better done on the resulting HTML; use the xss-sanitize library for this. xss-sanitize is based on pandoc's sanitization, but improves it. - Removed stateSanitize from ParserState. - Removed --sanitize-html option.
2010-12-07Make --smart work in HTML reader.John MacFarlane1-2/+3