diff options
author | John MacFarlane <jgm@berkeley.edu> | 2017-09-16 23:00:20 -0700 |
---|---|---|
committer | John MacFarlane <jgm@berkeley.edu> | 2017-09-16 23:00:20 -0700 |
commit | 91ab987a524a34b56b2763f639318bf6b800c09a (patch) | |
tree | d2f6adbf59259f60e283c9f802fdeb78ca447c67 | |
parent | 9add71365489cf21c07221e86f5705c6494c1efb (diff) | |
download | pandoc-91ab987a524a34b56b2763f639318bf6b800c09a.tar.gz |
Removed customizing-pandoc.md from doc/, added filters.md.
filters.md is essentially the scripting tutorial from the
webiste.
-rw-r--r-- | doc/customizing-pandoc.md | 18 | ||||
-rw-r--r-- | doc/filters.md | 469 | ||||
-rw-r--r-- | doc/using-the-pandoc-api.md | 21 |
3 files changed, 480 insertions, 28 deletions
diff --git a/doc/customizing-pandoc.md b/doc/customizing-pandoc.md deleted file mode 100644 index 37b77cf1f..000000000 --- a/doc/customizing-pandoc.md +++ /dev/null @@ -1,18 +0,0 @@ -# Customizing pandoc - -## Templates - -## Reference docx/odt - -## Custom lua writers - -## Custom syntax highlighting - -syntax definitions, styles - -## Filters - -including documentation of the JSON serialization format and -AST definition - - diff --git a/doc/filters.md b/doc/filters.md new file mode 100644 index 000000000..0c9b77328 --- /dev/null +++ b/doc/filters.md @@ -0,0 +1,469 @@ +% Pandoc filters +% John MacFarlane + +# Summary + +Pandoc provides an interface for users to write programs (known +as filters) which act on pandoc’s AST. + +Pandoc consists of a set of readers and writers. When converting +a document from one format to another, text is parsed by a +reader into pandoc’s intermediate representation of the +document---an "abstract syntax tree" or AST---which is then +converted by the writer into the target format. +The pandoc AST format is defined in the module +`Text.Pandoc.Definition` in +[pandoc-types](https://hackage.haskell.org/package/pandoc-types). + +A "filter" is a program that modifies the AST, between the +reader and the writer: + + INPUT --reader--> AST --filter--> AST --writer--> OUTPUT + +Filters are "pipes" that read from standard input and write to +standard output. They consume and produce a JSON representation +of the pandoc AST. (In recent versions, this representation +includes a `pandoc-api-version` field which refers to a +version of `pandoc-types`.) Filters may be written in any programming +language. To use a filter, you need only specify it on the +command line using `--filter`, e.g. + + pandoc -s input.txt --filter pandoc-citeproc -o output.htl + +For a gentle introduction into writing your own filters, +continue this guide. There’s also a [list of third party filters +on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters). + + +# A simple example + +Suppose you wanted to replace all level 2+ headers in a markdown +document with regular paragraphs, with text in italics. How would you go +about doing this? + +A first thought would be to use regular expressions. Something +like this: + + perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt + +This should work most of the time. But don't forget +that ATX style headers can end with a sequence of `#`s +that is not part of the header text: + + ## My header ## + +And what if your document contains a line starting with `##` in an HTML +comment or delimited code block? + + <!-- + ## This is just a comment + --> + + ~~~~ + ### A third level header in standard markdown + ~~~~ + +We don't want to touch *these* lines. Moreover, what about setext +style second-level headers? + + A header + -------- + +We need to handle those too. Finally, can we be sure that adding +asterisks to each side of our string will put it in italics? +What if the string already contains asterisks around it? Then we'll +end up with bold text, which is not what we want. And what if it contains +a regular unescaped asterisk? + +How would you modify your regular expression to handle these cases? It +would be hairy, to say the least. What we need is a real parser. + +Well, pandoc has a real markdown parser, the library function +`readMarkdown`. This transforms markdown text to an abstract syntax tree +(AST) that represents the document structure. Why not manipulate the +AST directly in a short Haskell script, then convert the result back to +markdown using `writeMarkdown`? + +First, let's see what this AST looks like. We can use pandoc's `native` +output format: + +~~~~ +% cat test.txt +## my header + +text with *italics* +% pandoc -s -t native test.txt +Pandoc (Meta {unMeta = fromList []}) +[Header 3 ("my-header",[],[]) [Str "My",Space,Str "header"] +, Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ] +~~~~ + +A `Pandoc` document consists of a `Meta` block (containing +metadata like title, authors, and date) and a list of `Block` + elements. In this case, we have two `Block`s, a `Header` and a `Para`. +Each has as its content a list of `Inline` elements. For more details on +the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`]. + +[haddock documentation for `Text.Pandoc.Definition`]: http://hackage.haskell.org/package/pandoc-types + +Here's a short Haskell script that reads markdown, changes level +2+ headers to regular paragraphs, and writes the result as markdown. +If you save it as `behead.hs`, you can run it using `runhaskell behead.hs`. +It will act like a unix pipe, reading from `stdin` and writing to `stdout`. +Or, if you want, you can compile it, using `ghc --make behead`, then run +the resulting executable `behead`. + +~~~~ {.haskell} +-- behead.hs +import Text.Pandoc +import Text.Pandoc.Walk (walk) + +behead :: Block -> Block +behead (Header n _ xs) | n >= 2 = Para [Emph xs] +behead x = x + +readDoc :: String -> Pandoc +readDoc s = readMarkdown def s +-- or, for pandoc 1.14 and greater, use: +-- readDoc s = case readMarkdown def s of +-- Right doc -> doc +-- Left err -> error (show err) + +writeDoc :: Pandoc -> String +writeDoc doc = writeMarkdown def doc + +main :: IO () +main = interact (writeDoc . walk behead . readDoc) +~~~~ + +The magic here is the `walk` function, which converts +our `behead` function (a function from `Block` to `Block`) to +a transformation on whole `Pandoc` documents. +(See the [haddock documentation for `Text.Pandoc.Walk`].) + +[haddock documentation for `Text.Pandoc.Walk`]: http://hackage.haskell.org/package/pandoc-types + +# Queries: listing URLs + +We can use this same technique to do much more complex transformations +and queries. Here's how we could extract all the URLs linked to in +a markdown document (again, not an easy task with regular expressions): + +~~~~ {.haskell} +-- extracturls.hs +import Text.Pandoc + +extractURL :: Inline -> [String] +extractURL (Link _ _ (u,_)) = [u] +extractURL (Image _ _ (u,_)) = [u] +extractURL _ = [] + +extractURLs :: Pandoc -> [String] +extractURLs = query extractURL + +readDoc :: String -> Pandoc +readDoc = readMarkdown def +-- or, for pandoc 1.14, use: +-- readDoc s = case readMarkdown def s of +-- Right doc -> doc +-- Left err -> error (show err) + +main :: IO () +main = interact (unlines . extractURLs . readDoc) +~~~~ + +`query` is the query counterpart of `walk`: it lifts +a function that operates on `Inline` elements to one that operates +on the whole `Pandoc` AST. The results returned by applying +`extractURL` to each `Inline` element are concatenated in the +result. + +# JSON filters + +`behead.hs` is a very special-purpose program. It reads a +specific input format (markdown) and writes a specific output format +(HTML), with a specific set of options (here, the defaults). +But the basic operation it performs is one that would be useful +in many document transformations. It would be nice to isolate the +part of the program that transforms the pandoc AST, leaving the rest +to pandoc itself. What we want is a *filter* that *just* operates +on the AST---or rather, on a JSON representation of the AST that +pandoc can produce and consume: + + source format + ↓ + (pandoc) + ↓ + JSON-formatted AST + ↓ + (filter) + ↓ + JSON-formatted AST + ↓ + (pandoc) + ↓ + target format + +The module `Text.Pandoc.JSON` contains a function `toJSONFilter` +that makes it easy to write such filters. Here is a filter +version of `behead.hs`: + +~~~~ {.haskell} +#!/usr/bin/env runhaskell +-- behead2.hs +import Text.Pandoc.JSON + +main :: IO () +main = toJSONFilter behead + where behead (Header n _ xs) | n >= 2 = Para [Emph xs] + behead x = x +~~~~ + +It can be used this way: + + pandoc -f SOURCEFORMAT -t json | runhaskell behead2.hs | \ + pandoc -f json -t TARGETFORMAT + +But it is easier to use the `--filter` option with pandoc: + + pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead2.hs + +Note that this approach requires that `behead2.hs` be executable, +so we must + + chmod +x behead2.hs + +Alternatively, we could compile the filter: + + ghc --make behead2.hs + pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead + +Note that if the filter is placed in the system PATH, then the initial +`./` is not needed. Note also that the command line can include +multiple instances of `--filter`: the filters will be applied in +sequence. + +# LaTeX for WordPress + +Another easy example. WordPress blogs require a special format for +LaTeX math. Instead of `$e=mc^2$`, you need: `$LaTeX e=mc^2$`. +How can we convert a markdown document accordingly? + +Again, it's difficult to do the job reliably with regexes. +A `$` might be a regular currency indicator, or it might occur in +a comment or code block or inline code span. We just want to find +the `$`s that begin LaTeX math. If only we had a parser... + +We do. Pandoc already extracts LaTeX math, so: + +~~~~ {.haskell} +#!/usr/bin/env runhaskell +-- wordpressify.hs +import Text.Pandoc.JSON + +main = toJSONFilter wordpressify + where wordpressify (Math x y) = Math x ("LaTeX " ++ y) + wordpressify x = x +~~~~ + +Mission accomplished. (I've omitted type signatures here, +just to show it can be done.) + + +# But I don't want to learn Haskell! + +While it's easiest to write pandoc filters in Haskell, it is fairly +easy to write them in python using the `pandocfilters` package. +The package is in PyPI and can be installed using `pip install +pandocfilters` or `easy_install pandocfilters`. + +Here's our "beheading" filter in python: + +~~~ {.python} +#!/usr/bin/env python + +""" +Pandoc filter to convert all level 2+ headers to paragraphs with +emphasized text. +""" + +from pandocfilters import toJSONFilter, Emph, Para + +def behead(key, value, format, meta): + if key == 'Header' and value[0] >= 2: + return Para([Emph(value[2])]) + +if __name__ == "__main__": + toJSONFilter(behead) +~~~ + +`toJSONFilter(behead)` walks the AST and applies the `behead` action +to each element. If `behead` returns nothing, the node is unchanged; +if it returns an object, the node is replaced; if it returns a list, +the new list is spliced in. + +Note that, although these parameters are not used in this example, +`format` provides access to the target format, and `meta` provides access to +the document's metadata. + +There are many examples of python filters in [the pandocfilters +repository](http://github.com/jgm/pandocfilters). + +For a more Pythonic alternative to pandocfilters, see +the [panflute](http://scorreia.com/software/panflute/) library. +Don't like Python? There are also ports of pandocfilters in +[PHP](https://github.com/vinai/pandocfilters-php), +[perl](https://metacpan.org/pod/Pandoc::Filter), and +[javascript/node.js](https://github.com/mvhenderson/pandoc-filter-node).] + +Starting with pandoc 2.0, pandoc includes built-in support for +writing filters in lua. The lua interpreter is built in to +pandoc, so a lua filter does not require any additional software +to run. See the [documentation on lua +filters](lua-filters.html). + +# Include files + +So none of our transforms have involved IO. How about a script that +reads a markdown document, finds all the inline code blocks with +attribute `include`, and replaces their contents with the contents of +the file given? + +~~~~ {.haskell} +#!/usr/bin/env runhaskell +-- includes.hs +import Text.Pandoc.JSON + +doInclude :: Block -> IO Block +doInclude cb@(CodeBlock (id, classes, namevals) contents) = + case lookup "include" namevals of + Just f -> return . (CodeBlock (id, classes, namevals)) =<< readFile f + Nothing -> return cb +doInclude x = return x + +main :: IO () +main = toJSONFilter doInclude +~~~~ + +Try this on the following: + + Here's the pandoc README: + + ~~~~ {include="README"} + this will be replaced by contents of README + ~~~~ + +# Removing links + +What if we want to remove every link from a document, retaining +the link's text? + +~~~~ {.haskell} +#!/usr/bin/env runhaskell +-- delink.hs +import Text.Pandoc.JSON + +main = toJSONFilter delink + +delink :: Inline -> [Inline] +delink (Link _ txt _) = txt +delink x = [x] +~~~~ + +Note that `delink` can't be a function of type `Inline -> Inline`, +because the thing we want to replace the link with is not a single +`Inline` element, but a list of them. So we make `delink` a function +from an `Inline` element to a list of `Inline` elements. +`toJSONFilter` can still lift this function to a transformation of type +`Pandoc -> Pandoc`. + +# A filter for ruby text + +Finally, here's a nice real-world example, developed on the +[pandoc-discuss](http://groups.google.com/group/pandoc-discuss/browse_thread/thread/7baea325565878c8) list. Qubyte wrote: + +> I'm interested in using pandoc to turn my markdown notes on Japanese +> into nicely set HTML and (Xe)LaTeX. With HTML5, ruby (typically used to +> phonetically read chinese characters by placing text above or to the +> side) is standard, and support from browsers is emerging (Webkit based +> browsers appear to fully support it). For those browsers that don't +> support it yet (notably Firefox) the feature falls back in a nice way +> by placing the phonetic reading inside brackets to the side of each +> Chinese character, which is suitable for other output formats too. As +> for (Xe)LaTeX, ruby is not an issue. +> +> At the moment, I use inline HTML to achieve the result when the +> conversion is to HTML, but it's ugly and uses a lot of keystrokes, for +> example +> +> ~~~ {.xml} +> <ruby>ご<rt></rt>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby> +> ~~~ +> +> sets ご飯 "gohan" with "han" spelt phonetically above the second +> character, or to the right of it in brackets if the browser does not +> support ruby. I'd like to have something more like +> +> r[はん](飯) +> +> or any keystroke saving convention would be welcome. + +We came up with the following script, which uses the convention that a +markdown link with a URL beginning with a hyphen is interpreted as ruby: + + [はん](-飯) + +~~~ {.haskell} +-- handleruby.hs +import Text.Pandoc.JSON +import System.Environment (getArgs) + +handleRuby :: Maybe Format -> Inline -> Inline +handleRuby (Just format) (Link _ [Str ruby] ('-':kanji,_)) + | format == Format "html" = RawInline format + $ "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby ++ "</rt><rp>)</rp></ruby>" + | format == Format "latex" = RawInline format + $ "\\ruby{" ++ kanji ++ "}{" ++ ruby ++ "}" + | otherwise = Str ruby +handleRuby _ x = x + +main :: IO () +main = toJSONFilter handleRuby +~~~ + +Note that, when a script is called using `--filter`, pandoc passes +it the target format as the first argument. When a function's +first argument is of type `Maybe Format`, `toJSONFilter` will +automatically assign it `Just` the target format or `Nothing`. + +We compile our script: + + ghc --make handleRuby + +Then run it: + + % pandoc -F ./handleRuby -t html + [はん](-飯) + ^D + <p><ruby>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby></p> + % pandoc -F ./handleRuby -t latex + [はん](-飯) + ^D + \ruby{飯}{はん} + +# Exercises + +1. Put all the regular text in a markdown document in ALL CAPS + (without touching text in URLs or link titles). + +2. Remove all horizontal rules from a document. + +3. Renumber all enumerated lists with roman numerals. + +4. Replace each delimited code block with class `dot` with an + image generated by running `dot -Tpng` (from graphviz) on the + contents of the code block. + +5. Find all code blocks with class `python` and run them + using the python interpreter, printing the results to the console. + diff --git a/doc/using-the-pandoc-api.md b/doc/using-the-pandoc-api.md index b567db968..e80c3641f 100644 --- a/doc/using-the-pandoc-api.md +++ b/doc/using-the-pandoc-api.md @@ -1,28 +1,29 @@ -# Using the pandoc API +% Using the pandoc API +% John MacFarlane -## Concepts +# Concepts -## Basic usage +# Basic usage -## The Pandoc structure +# The Pandoc structure -## Reader options +# Reader options -## Writer options +# Writer options -## The PandocMonad class +# The PandocMonad class custom PandocMonad instances -## Builder +# Builder example: report from CSV data -## Generic transformations +# Generic transformations Walk and syb for AST transformations -## Filters +# Filters writing filters in Haskell |