diff options
-rw-r--r-- | README | 14 | ||||
-rw-r--r-- | changelog | 165 | ||||
-rw-r--r-- | src/Text/Pandoc.hs | 12 | ||||
-rw-r--r-- | src/Text/Pandoc/Options.hs | 4 | ||||
-rw-r--r-- | src/Text/Pandoc/Readers/EPUB.hs | 16 | ||||
-rw-r--r-- | src/Text/Pandoc/Readers/HTML.hs | 5 | ||||
-rw-r--r-- | src/Text/Pandoc/Readers/Markdown.hs | 4 | ||||
-rw-r--r-- | src/Text/Pandoc/Writers/Markdown.hs | 5 | ||||
-rw-r--r-- | tests/writer.plain | 30 |
9 files changed, 216 insertions, 39 deletions
@@ -2238,6 +2238,20 @@ markdown with HTML block elements. For example, one can surround a block of markdown text with `<div>` tags without preventing it from being interpreted as markdown. +#### Extension: `native_divs` #### + +Use native pandoc `Div` blocks for content inside `<div>` tags. +For the most part this should give the same output as +`markdown_in_html_blocks`, but it makes it easier to write pandoc +filters to manipulate groups of blocks. + +#### Extension: `native_spans` #### + +Use native pandoc `Span` blocks for content inside `<span>` tags. +For the most part this should give the same output as `raw_html`, +but it makes it easier to write pandoc filters to manipulate groups +of inlines. + Raw TeX ------- @@ -1,9 +1,38 @@ pandoc (1.13) - * Added Docx reader (API change) (Jesse Rosenthal). + * Added Docx reader (API change) (Jesse Rosenthal). Includes conversion + of Word equations to latex in pandoc `Math` elements. Note that metadata + is taken from paragraphs at the beginning styled with `Author`, + `Title`, `Subtitle`, `Date`, and `Abstract`. + + * Added EPUB reader (API change) (Matthew Pickering). Includes conversion + of MathML math to latex in pandoc `Math` elements. + + * Added txt2tags reader (Matthew Pickering). Txt2tags is a lightweight + markup format described at <http://txt2tags.org/>. + + * Added `--extract-media` option to extract media contained in a zip + container (docx or epub) while adjusting image paths to point to the + extracted images. + + * Added module `Text.Pandoc.MediaBag`, exporting `MediaBag`, `lookupMedia`, + `insertMedia`, `mediaDirectory`, `extractMediaBag`. The docx and epub + readers return a pair of a `Pandoc` document and a `MediaBag` with + the media resources they contain. This can be extracted using + `--extract-media`. Writers that incorporate media (PDF, Docx, + ODT, EPUB, RTF, or HTML formats with `--self-contained`) will look + for resources in the `MediaBag` generated by the reader, in addition to + the file system or web. * Markdown reader: + + Changed behavior of the `markdown_attribute` extension, to bring + it in line with PHP markdown extra and multimarkdown. Setting + `markdown="1"` on an outer tag affects all contained tags, + recursively, until it is reversed with `markdown="0"` (#1378). + + Fixed small bug in HTML parsing with `markdown_attribute`, which + caused incorrect tag nesting for input like + `<aside markdown="1">*hi*</aside>`. + Fixed regression with intraword underscores (#1121). + Slight rewrite of `enclosure`/`emphOrStrong` code. + Revamped raw HTML block parsing in markdown (#1330). @@ -58,6 +87,12 @@ pandoc (1.13) * HTML reader: adjust `blockTags` and `eitherBlockOrInline`. + + Parse `div` and `span` elements even without `--parse-raw`. + Motivation: these now generate native pandoc Div and Span + elements, not raw HTML. + + Parse EPUB-specific elements if the `epub_html_exts` + extension is enabled. These include `switch`, `footnote`, + `rearnote`, `noteref`. + Added `audio` and `source` in `eitherBlockOrInline`. + Moved `video`, `svg`, `progress`, `script`, `noscript`, `svg` from `blockTags` to `eitherBlockOrInline`. @@ -69,6 +104,7 @@ pandoc (1.13) We let a `</table>` tag close an open `<tr>` or `<td>` (#1167). + Support `--trace`. + Allow space between `<col>` and `</col>`. + + Ignore `DOCTYPE` and `xml` declarations. * Rewrote Haddock reader to use `haddock-library` (#1346). @@ -79,13 +115,18 @@ pandoc (1.13) * MediaWiki reader: + + Don't parse backslash escapes inside `<source>` (#1445). + Tightened up template parsing. The opening `{{` must be followed by an alphanumeric or `:`. This prevents the exponential slowdown in #1033. + Support `--trace`. + + Support "Bild" for images. * DocBook reader: + + Better handle elements inside code environments. Pandoc's document + model does not allow structure inside code blocks, but at least this way + we preserve the text (#1449). + Support `<?asciidoc-br?>` (#1236). Note, this is a bit of a kludge, to work around the fact that xml-light doesn't parse `<?asciidoc-br?>` correctly. We preprocess the input, @@ -121,6 +162,8 @@ pandoc (1.13) * EPUB writer: + + Avoid excess whitespace in `nav.xhtml`. This should improve + TOC view in iBooks (#1392). + Fixed regression on cover image. In 1.12.4 and 1.12.4.2, the cover image would not appear properly, because the metadata id was not correct. Now we derive the id from the @@ -138,6 +181,7 @@ pandoc (1.13) * LaTeX writer: + + Use `\(..\)` instead of `$..$` for inline math (#1464). + Use `\nolinkurl` in email autolinks. This allows them to be styled using `\urlstyle{tt}`. Thanks to Ulrike Fischer for the solution. + Use `\textquotesingle` for `'` in inline code. Otherwise we get @@ -149,7 +193,11 @@ pandoc (1.13) + Put table captions above tables, to match the conventional standard. (Previously they appeared below tables.) - * RTF writer: Avoid extra paragraph tags in metadata (#1421). + * RTF writer: + + + Improved image embedding: `fetchItem'` is now used to get the + images, and calculated image sizes are indicated in the RTF. + + Avoid extra paragraph tags in metadata (#1421). * HTML writer: @@ -166,15 +214,38 @@ pandoc (1.13) * Markdown writer: + + Updated definition lists, so they match the current markdown reader's + behavior. (Thus, they match PHP markdown extra, and a blank line is + always used between items.) The old behavior + can be activated with the `compact_definition_lists` extension. + + Horizontal rules are now a line across the whole page. + + Avoid wrapping that might start a list, blockquote, or header (#1013). + Use span with style for `SmallCaps` (#1360). + Use Span instead of (hackish) `SmallCaps` in `plainify`. + Don't use braced attributes for fenced code (#1416). If `Ext_fenced_code_attributes` is not set, the first class attribute will be printed after the opening fence as a bare word. + Prettier pipe tables. Columns are now aligned (#1323). + + Respect the `raw_html` extension. `pandoc -t markdown-raw_html` + no longer emits any raw HTML, including span and div tags + generated by Span and Div elements. + + Separate adjacent lists of the same kind with an HTML comment (#1458). + + * Plain writer: Revised output, largely following the style of Project + Gutenberg. + + + Emphasis is rendered with `_underscores_`, strong emphasis + with ALL CAPS. + + Headings are rendered differently, with space to set them off, + not with setext style underlines. Level 1 headers are ALL CAPS. + + Math is rendered using unicode when possible, but without the + distracting emphasis markers around variables. + + Footnotes use a regular `[n]` style. + + Added tests. * PDF writer: + + Moved `withTempDir` to `Text.Pandoc.Shared`, which now exports it. + Fixed treatment of data uris for images (#1062). * Docx writer: @@ -197,6 +268,14 @@ pandoc (1.13) + Section numbering carries over from reference.docx (#1305). + Simplified `abstractNumId` numbering. Instead of sequential numbering, we assign numbers based on the list marker styles. + + Include abstract (if present) with `Abstract` style (#1451). + + Include subtitle (if present) with `Subtitle` style, rather + than tacking it on to the title (#1451). + + * Org writer: + + + Write empty span elements with an id attribute as org anchors. + For example `Span ("uid",[],[]) []` becomes `<<uid>>`. * Custom lua writers: @@ -215,14 +294,30 @@ pandoc (1.13) * `Text.Pandoc.Highlighting`: Let `.numberLines` work even if no language is given (#1287, jgm/highlighting-kate#40). + * `Text.Pandoc.Pretty`: Added `blanklines`, which guarantees a certain + number of blank lines (and no more). * `Text.Pandoc.Shared`: + + Added `fetchItem'`, which works like `fetchItem` but searches + a `MediaBag` before looking on the net or file system. + + Added `withTempDir` (API change). + `fetchItem`: unescape URI encoding before reading local file (#1427). + + `fetchItem`: strip a fragment like `?#iefix` from the extension before + doing mime lookup, to improve mime type guessing. + + Improved log of `fetchItem`: absolute URIs are fetched from the net; + other things are treated as relative URIs if `sourceURL` is `Just _`, + otherwise as file paths on the local file system. + + `fetchItem` now properly handles links without a protocol (#1477). + + `fetchItem` now escapes characters not allowed in URIs before trying + to parse the URIs. + Added `removeFormatting` (API change). + Added `extractSpaces` (from HTML reader) and generalized its type so that it can be used by the docx reader (Matthew Pickering). + Added `ordNub` (API change). + + Fixed runtime error with `compactify'DL` on certain lists (#1452). + + Added `capitalize` (Artyom Kazak), and replaced uses of + `map toUpper` (which give bad results for many languages). * `Text.Pandoc.Templates`: @@ -235,18 +330,41 @@ pandoc (1.13) * `Text.Pandoc.Options`: + + Added `writerMediaBag` to `WriterOptions` (API change). + + Removed deprecated and no longer used `readerStrict` in + `ReaderOptions`. This is handled by `readerExtensions` now + (API change). + Removed `Ext_fenced_code_attributes` from `markdown_github` - extensions. - + Added `Ext_compact_definition_lists`. + extensions (API change)`. + + Added `Ext_compact_definition_lists` (API change). + + Added `Ext_epub_html_exts` (API change). * `Text.Pandoc.Parsing` + + Generalized `readWith` to `readWithM` (Matthew Pickering) (API change). + + Export `runParserT` and `Stream` (Matthew Pickering). + + Added `HasQuoteContext` type class (Matthew Pickering). + + Generalized types of `mathInline`, `smartPunctuation`, `quoted`, + `singleQuoted`, `doubleQuoted`, `failIfInQuoteContext`, + `applyMacros` (Matthew Pickering). + + Added custom `token` (Matthew Pickering). + Simplified `dash` and `ellipsis` (#1419). + Removed `(>>~)` in favor of the equivalent `(<*)` (Matthew Pickering). + Generalized functions to use `ParsecT` (Matthew Pickering). + Added `stateInHtmlBlock` to `ParserState`. This is used to keep track of the ending tag we're waiting for when we're parsing inside HTML block tags. (API change.) + + Added `stateMarkdownAttribute` to `ParserState`. This is used + to keep track of whether the markdown attribute has been set in + an enclosing tag. (API change.) + + Added `isbn` and `pmid` to list of recognized schemes (Matthew + Pickering). + + * Added `Text.Pandoc.Compat.Directory` to allow building against + different versions of the `directory` library. + + + Added `Text.Pandoc.Compat.Except` to allow building against + different verions of `mtl`. * Templates: @@ -255,6 +373,10 @@ pandoc (1.13) be "advisory" in HTML5, but kindlegen seems to require them. + LaTeX: Made `\subtitle` work properly (#1327). + LaTeX/Beamer: remove conditional around date (#1321). + + LaTeX: Added `lot` and `lof` variables, which can be set to + get `\listoftables` and `\listoffigures` (#1407). Note that + these variables can be set at the command line with `-Vlot -Vlof` + or in YAML metadata. * Code cleanup in some writers, using Reader monad to avoid passing options parameter around (Matej Kollar). @@ -280,8 +402,13 @@ pandoc (1.13) * Avoid `import Prelude hiding (catch)` (#1309, thanks to Michael Thompson). + * Don't strip path off of `writerSourceURL`: the path is needed to + resolve relative URLs when we fetch resources (#750). + * README: + + Made headers for all extensions so they have IDs and can be linked + to (Beni Cherniavsky-Paskin). + Fixed typos (Phillip Alday). + Fixed documentation of attributes (#1315). + Clarified documentation on small caps (#1360). @@ -300,10 +427,24 @@ pandoc (1.13) ocaml and fsharp. * Require latest `texmath`. This fixes `\tilde{E}` and allows - `\left` to be used with `]`, `)` etc. (#1319). + `\left` to be used with `]`, `)` etc. (#1319), among many other + improvements. + + * Improved readability in `pandoc.hs`. + + * Miscellaneous code cleanups (Artyom Kazak). * Require latest `zip-archive`. This has fixes for unicode path names. + * Benchmarks: + + + Made benchmarks compile again (Artyom Kazak). + + Fixed so that the failure of one benchmark does not prevent others + from running (Artyom Kazak). + + Use `nfIO` instead of the `getLength` trick to force full evaluation. + + Changed benchmark to use only the test suite, so that benchmarks + run more quickly. + * Windows build script: + Add `-windows` to file name. @@ -425,6 +566,10 @@ pandoc (1.12.4) + Implemented correct parsing rules for inline markup (#1175, Matthew Pickering). + Use Builder (Matthew Pickering). + + Fixed list parsing bug (#1500). + + Don't allow inline formatting to extend over newlines. + This matches the behavior of RedCarpet, avoids some ugly bugs, + and improves performance. * DocBook reader: @@ -570,8 +715,14 @@ pandoc (1.12.4) in your template, the word `true` will appear, which may be unexpected. (Previously nothing would appear.) - * `Text.Pandoc.SelfContained`: Handle `poster` attribute in `video` - tags (#1188). + * `Text.Pandoc.SelfContained`: + + + `mkSelfContained` now takes just two arguments, `WriterOptions` and + the string. + * It no longer looks in data files. This only made sense when we + had copies of slidy and S5 code there. + * `fetchItem'` is used instead of the nearly duplicate `getItem`. + + Handle `poster` attribute in `video` tags (#1188). * `Text.Pandoc.Parsing`: diff --git a/src/Text/Pandoc.hs b/src/Text/Pandoc.hs index 589a6af98..c7c64f0fc 100644 --- a/src/Text/Pandoc.hs +++ b/src/Text/Pandoc.hs @@ -304,7 +304,17 @@ getDefaultExtensions "markdown_github" = githubMarkdownExtensions getDefaultExtensions "markdown" = pandocExtensions getDefaultExtensions "plain" = pandocExtensions getDefaultExtensions "org" = Set.fromList [Ext_citations] -getDefaultExtensions "textile" = Set.fromList [Ext_auto_identifiers, Ext_raw_tex] +getDefaultExtensions "textile" = Set.fromList [Ext_auto_identifiers, + Ext_raw_tex] +getDefaultExtensions "html" = Set.fromList [Ext_auto_identifiers, + Ext_native_divs, + Ext_native_spans] +getDefaultExtensions "html5" = getDefaultExtensions "html" +getDefaultExtensions "epub" = Set.fromList [Ext_auto_identifiers, + Ext_raw_html, + Ext_native_divs, + Ext_native_spans, + Ext_epub_html_exts] getDefaultExtensions _ = Set.fromList [Ext_auto_identifiers] -- | Retrieve reader based on formatSpec (format+extensions). diff --git a/src/Text/Pandoc/Options.hs b/src/Text/Pandoc/Options.hs index bb213bac0..84ccbbdc9 100644 --- a/src/Text/Pandoc/Options.hs +++ b/src/Text/Pandoc/Options.hs @@ -77,6 +77,8 @@ data Extension = | Ext_backtick_code_blocks -- ^ Github style ``` code blocks | Ext_inline_code_attributes -- ^ Allow attributes on inline code | Ext_markdown_in_html_blocks -- ^ Interpret as markdown inside HTML blocks + | Ext_native_divs -- ^ Use Div blocks for contents of <div> tags + | Ext_native_spans -- ^ Use Span inlines for contents of <span> | Ext_markdown_attribute -- ^ Interpret text inside HTML as markdown -- iff container has attribute 'markdown' | Ext_escaped_line_breaks -- ^ Treat a backslash at EOL as linebreak @@ -131,6 +133,8 @@ pandocExtensions = Set.fromList , Ext_backtick_code_blocks , Ext_inline_code_attributes , Ext_markdown_in_html_blocks + , Ext_native_divs + , Ext_native_spans , Ext_escaped_line_breaks , Ext_fancy_lists , Ext_startnum diff --git a/src/Text/Pandoc/Readers/EPUB.hs b/src/Text/Pandoc/Readers/EPUB.hs index 8073f9ad2..b6b271488 100644 --- a/src/Text/Pandoc/Readers/EPUB.hs +++ b/src/Text/Pandoc/Readers/EPUB.hs @@ -13,8 +13,7 @@ import Text.Pandoc.Definition hiding (Attr) import Text.Pandoc.Walk (walk, query) import Text.Pandoc.Generic(bottomUp) import Text.Pandoc.Readers.HTML (readHtml) -import Text.Pandoc.Options ( ReaderOptions(..), readerExtensions, Extension(..) - , readerTrace) +import Text.Pandoc.Options ( ReaderOptions(..), readerTrace) import Text.Pandoc.Shared (escapeURI, collapseFilePath) import Text.Pandoc.MediaBag (MediaBag, insertMedia) import Text.Pandoc.Compat.Except (MonadError, throwError, runExcept, Except) @@ -32,7 +31,6 @@ import Data.Monoid (mempty, (<>)) import Data.List (isPrefixOf, isInfixOf) import Data.Maybe (mapMaybe, fromMaybe) import qualified Data.Map as M (Map, lookup, fromList, elems) -import qualified Data.Set as S (insert) import Control.DeepSeq.Generics (deepseq, NFData) import Debug.Trace (trace) @@ -51,7 +49,7 @@ runEPUB = either error id . runExcept -- are of the form "filename#id" -- archiveToEPUB :: (MonadError String m) => ReaderOptions -> Archive -> m (Pandoc, MediaBag) -archiveToEPUB (setEPUBOptions -> os) archive = do +archiveToEPUB os archive = do -- root is path to folder with manifest file in (root, content) <- getManifest archive meta <- parseMeta content @@ -67,6 +65,7 @@ archiveToEPUB (setEPUBOptions -> os) archive = do let mediaBag = fetchImages (M.elems items) archive ast return $ (ast, mediaBag) where + os' = os {readerParseRaw = True} parseSpineElem :: MonadError String m => FilePath -> (FilePath, MIME) -> m Pandoc parseSpineElem (normalise -> r) (normalise -> path, mime) = do when (readerTrace os) (traceM path) @@ -77,20 +76,13 @@ archiveToEPUB (setEPUBOptions -> os) archive = do mimeToReader "application/xhtml+xml" (normalise -> path) = do fname <- findEntryByPathE path archive return $ fixInternalReferences path . - readHtml os . + readHtml os' . UTF8.toStringLazy $ fromEntry fname mimeToReader s path | s `elem` imageMimes = return $ imageToPandoc path | otherwise = return $ mempty -setEPUBOptions :: ReaderOptions -> ReaderOptions -setEPUBOptions os = os'' - where - rs = readerExtensions os - os' = os {readerExtensions = foldr S.insert rs [Ext_epub_html_exts]} - os'' = os' {readerParseRaw = True} - -- paths should be absolute when this function is called -- renameImages should do this fetchImages :: [(FilePath, MIME)] diff --git a/src/Text/Pandoc/Readers/HTML.hs b/src/Text/Pandoc/Readers/HTML.hs index 3d988cd80..1789b865f 100644 --- a/src/Text/Pandoc/Readers/HTML.hs +++ b/src/Text/Pandoc/Readers/HTML.hs @@ -45,7 +45,8 @@ import Text.Pandoc.Builder (Blocks, Inlines, trimInlines, HasMeta(..)) import Text.Pandoc.Shared ( extractSpaces, renderTags' , escapeURI, safeRead ) import Text.Pandoc.Options (ReaderOptions(readerParseRaw, readerTrace) - , Extension (Ext_epub_html_exts)) + , Extension (Ext_epub_html_exts, + Ext_native_divs, Ext_native_spans)) import Text.Pandoc.Parsing hiding ((<|>)) import Text.Pandoc.Walk import Data.Maybe ( fromMaybe, isJust) @@ -296,6 +297,7 @@ pRawTag = do pDiv :: TagParser Blocks pDiv = try $ do + guardEnabled Ext_native_divs TagOpen _ attr <- lookAhead $ pSatisfy $ tagOpen (=="div") (const True) contents <- pInTags "div" block return $ B.divWith (mkAttr attr) contents @@ -560,6 +562,7 @@ pCode = try $ do pSpan :: TagParser Inlines pSpan = try $ do + guardEnabled Ext_native_spans TagOpen _ attr <- lookAhead $ pSatisfy $ tagOpen (=="span") (const True) contents <- pInTags "span" inline return $ B.spanWith (mkAttr attr) contents diff --git a/src/Text/Pandoc/Readers/Markdown.hs b/src/Text/Pandoc/Readers/Markdown.hs index 04b3fa684..861f81b23 100644 --- a/src/Text/Pandoc/Readers/Markdown.hs +++ b/src/Text/Pandoc/Readers/Markdown.hs @@ -1764,7 +1764,7 @@ inBrackets parser = do spanHtml :: MarkdownParser (F Inlines) spanHtml = try $ do - guardEnabled Ext_markdown_in_html_blocks + guardEnabled Ext_native_spans (TagOpen _ attrs, _) <- htmlTag (~== TagOpen "span" []) contents <- mconcat <$> manyTill inline (htmlTag (~== TagClose "span")) let ident = fromMaybe "" $ lookup "id" attrs @@ -1779,7 +1779,7 @@ spanHtml = try $ do divHtml :: MarkdownParser (F Blocks) divHtml = try $ do - guardEnabled Ext_markdown_in_html_blocks + guardEnabled Ext_native_divs (TagOpen _ attrs, rawtag) <- htmlTag (~== TagOpen "div" []) -- we set stateInHtmlBlock so that closing tags that can be either block or -- inline will not be parsed as inline tags diff --git a/src/Text/Pandoc/Writers/Markdown.hs b/src/Text/Pandoc/Writers/Markdown.hs index 5c0476b7d..95d4db29b 100644 --- a/src/Text/Pandoc/Writers/Markdown.hs +++ b/src/Text/Pandoc/Writers/Markdown.hs @@ -357,7 +357,10 @@ blockToMarkdown opts (Header level attr inlines) = do _ | isEnabled Ext_header_attributes opts -> space <> attrsToMarkdown attr | otherwise -> empty - contents <- inlineListToMarkdown opts inlines + contents <- inlineListToMarkdown opts $ + if level == 1 && plain + then capitalize inlines + else inlines let setext = writerSetextHeaders opts return $ nowrap $ case level of diff --git a/tests/writer.plain b/tests/writer.plain index 68f563b4a..fab0489ac 100644 --- a/tests/writer.plain +++ b/tests/writer.plain @@ -9,7 +9,7 @@ markdown test suite. -Headers +HEADERS Level 2 with an embedded link @@ -22,7 +22,7 @@ Level 5 -Level 1 +LEVEL 1 Level 2 with _emphasis_ @@ -40,7 +40,7 @@ with no blank line -Paragraphs +PARAGRAPHS Here’s a regular paragraph. @@ -58,7 +58,7 @@ here. -Block Quotes +BLOCK QUOTES E-mail style: @@ -90,7 +90,7 @@ And a following paragraph. -Code Blocks +CODE BLOCKS Code: @@ -113,7 +113,7 @@ And: -Lists +LISTS Unordered @@ -277,7 +277,7 @@ B. Williams -Definition Lists +DEFINITION LISTS Tight using spaces: @@ -373,7 +373,7 @@ orange -HTML Blocks +HTML BLOCKS Simple block on one line: @@ -428,7 +428,7 @@ Hr’s: -Inline Markup +INLINE MARKUP This is _emphasized_, and so _is this_. @@ -460,7 +460,7 @@ spaces: a^b c^d, a~b c~d. -Smart quotes, ellipses, dashes +SMART QUOTES, ELLIPSES, DASHES “Hello,” said the spider. “‘Shelob’ is my name.” @@ -483,7 +483,7 @@ Ellipses…and…and…. -LaTeX +LATEX - @@ -510,7 +510,7 @@ Here’s a LaTeX table: -Special Characters +SPECIAL CHARACTERS Here is some unicode: @@ -567,7 +567,7 @@ Minus: - -Links +LINKS Explicit @@ -649,7 +649,7 @@ Auto-links should not occur here: <http://example.com/> -Images +IMAGES From “Voyage dans la Lune” by Georges Melies (1902): @@ -662,7 +662,7 @@ Here is a movie [movie] icon. -Footnotes +FOOTNOTES Here is a footnote reference,[1] and another.[2] This should _not_ be a |