aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJohn MacFarlane <jgm@berkeley.edu>2017-09-16 23:00:20 -0700
committerJohn MacFarlane <jgm@berkeley.edu>2017-09-16 23:00:20 -0700
commit91ab987a524a34b56b2763f639318bf6b800c09a (patch)
treed2f6adbf59259f60e283c9f802fdeb78ca447c67
parent9add71365489cf21c07221e86f5705c6494c1efb (diff)
downloadpandoc-91ab987a524a34b56b2763f639318bf6b800c09a.tar.gz
Removed customizing-pandoc.md from doc/, added filters.md.
filters.md is essentially the scripting tutorial from the webiste.
-rw-r--r--doc/customizing-pandoc.md18
-rw-r--r--doc/filters.md469
-rw-r--r--doc/using-the-pandoc-api.md21
3 files changed, 480 insertions, 28 deletions
diff --git a/doc/customizing-pandoc.md b/doc/customizing-pandoc.md
deleted file mode 100644
index 37b77cf1f..000000000
--- a/doc/customizing-pandoc.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Customizing pandoc
-
-## Templates
-
-## Reference docx/odt
-
-## Custom lua writers
-
-## Custom syntax highlighting
-
-syntax definitions, styles
-
-## Filters
-
-including documentation of the JSON serialization format and
-AST definition
-
-
diff --git a/doc/filters.md b/doc/filters.md
new file mode 100644
index 000000000..0c9b77328
--- /dev/null
+++ b/doc/filters.md
@@ -0,0 +1,469 @@
+% Pandoc filters
+% John MacFarlane
+
+# Summary
+
+Pandoc provides an interface for users to write programs (known
+as filters) which act on pandoc’s AST.
+
+Pandoc consists of a set of readers and writers. When converting
+a document from one format to another, text is parsed by a
+reader into pandoc’s intermediate representation of the
+document---an "abstract syntax tree" or AST---which is then
+converted by the writer into the target format.
+The pandoc AST format is defined in the module
+`Text.Pandoc.Definition` in
+[pandoc-types](https://hackage.haskell.org/package/pandoc-types).
+
+A "filter" is a program that modifies the AST, between the
+reader and the writer:
+
+ INPUT --reader--> AST --filter--> AST --writer--> OUTPUT
+
+Filters are "pipes" that read from standard input and write to
+standard output. They consume and produce a JSON representation
+of the pandoc AST. (In recent versions, this representation
+includes a `pandoc-api-version` field which refers to a
+version of `pandoc-types`.) Filters may be written in any programming
+language. To use a filter, you need only specify it on the
+command line using `--filter`, e.g.
+
+ pandoc -s input.txt --filter pandoc-citeproc -o output.htl
+
+For a gentle introduction into writing your own filters,
+continue this guide. There’s also a [list of third party filters
+on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters).
+
+
+# A simple example
+
+Suppose you wanted to replace all level 2+ headers in a markdown
+document with regular paragraphs, with text in italics. How would you go
+about doing this?
+
+A first thought would be to use regular expressions. Something
+like this:
+
+ perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt
+
+This should work most of the time. But don't forget
+that ATX style headers can end with a sequence of `#`s
+that is not part of the header text:
+
+ ## My header ##
+
+And what if your document contains a line starting with `##` in an HTML
+comment or delimited code block?
+
+ <!--
+ ## This is just a comment
+ -->
+
+ ~~~~
+ ### A third level header in standard markdown
+ ~~~~
+
+We don't want to touch *these* lines. Moreover, what about setext
+style second-level headers?
+
+ A header
+ --------
+
+We need to handle those too. Finally, can we be sure that adding
+asterisks to each side of our string will put it in italics?
+What if the string already contains asterisks around it? Then we'll
+end up with bold text, which is not what we want. And what if it contains
+a regular unescaped asterisk?
+
+How would you modify your regular expression to handle these cases? It
+would be hairy, to say the least. What we need is a real parser.
+
+Well, pandoc has a real markdown parser, the library function
+`readMarkdown`. This transforms markdown text to an abstract syntax tree
+(AST) that represents the document structure. Why not manipulate the
+AST directly in a short Haskell script, then convert the result back to
+markdown using `writeMarkdown`?
+
+First, let's see what this AST looks like. We can use pandoc's `native`
+output format:
+
+~~~~
+% cat test.txt
+## my header
+
+text with *italics*
+% pandoc -s -t native test.txt
+Pandoc (Meta {unMeta = fromList []})
+[Header 3 ("my-header",[],[]) [Str "My",Space,Str "header"]
+, Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ]
+~~~~
+
+A `Pandoc` document consists of a `Meta` block (containing
+metadata like title, authors, and date) and a list of `Block`
+ elements. In this case, we have two `Block`s, a `Header` and a `Para`.
+Each has as its content a list of `Inline` elements. For more details on
+the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`].
+
+[haddock documentation for `Text.Pandoc.Definition`]: http://hackage.haskell.org/package/pandoc-types
+
+Here's a short Haskell script that reads markdown, changes level
+2+ headers to regular paragraphs, and writes the result as markdown.
+If you save it as `behead.hs`, you can run it using `runhaskell behead.hs`.
+It will act like a unix pipe, reading from `stdin` and writing to `stdout`.
+Or, if you want, you can compile it, using `ghc --make behead`, then run
+the resulting executable `behead`.
+
+~~~~ {.haskell}
+-- behead.hs
+import Text.Pandoc
+import Text.Pandoc.Walk (walk)
+
+behead :: Block -> Block
+behead (Header n _ xs) | n >= 2 = Para [Emph xs]
+behead x = x
+
+readDoc :: String -> Pandoc
+readDoc s = readMarkdown def s
+-- or, for pandoc 1.14 and greater, use:
+-- readDoc s = case readMarkdown def s of
+-- Right doc -> doc
+-- Left err -> error (show err)
+
+writeDoc :: Pandoc -> String
+writeDoc doc = writeMarkdown def doc
+
+main :: IO ()
+main = interact (writeDoc . walk behead . readDoc)
+~~~~
+
+The magic here is the `walk` function, which converts
+our `behead` function (a function from `Block` to `Block`) to
+a transformation on whole `Pandoc` documents.
+(See the [haddock documentation for `Text.Pandoc.Walk`].)
+
+[haddock documentation for `Text.Pandoc.Walk`]: http://hackage.haskell.org/package/pandoc-types
+
+# Queries: listing URLs
+
+We can use this same technique to do much more complex transformations
+and queries. Here's how we could extract all the URLs linked to in
+a markdown document (again, not an easy task with regular expressions):
+
+~~~~ {.haskell}
+-- extracturls.hs
+import Text.Pandoc
+
+extractURL :: Inline -> [String]
+extractURL (Link _ _ (u,_)) = [u]
+extractURL (Image _ _ (u,_)) = [u]
+extractURL _ = []
+
+extractURLs :: Pandoc -> [String]
+extractURLs = query extractURL
+
+readDoc :: String -> Pandoc
+readDoc = readMarkdown def
+-- or, for pandoc 1.14, use:
+-- readDoc s = case readMarkdown def s of
+-- Right doc -> doc
+-- Left err -> error (show err)
+
+main :: IO ()
+main = interact (unlines . extractURLs . readDoc)
+~~~~
+
+`query` is the query counterpart of `walk`: it lifts
+a function that operates on `Inline` elements to one that operates
+on the whole `Pandoc` AST. The results returned by applying
+`extractURL` to each `Inline` element are concatenated in the
+result.
+
+# JSON filters
+
+`behead.hs` is a very special-purpose program. It reads a
+specific input format (markdown) and writes a specific output format
+(HTML), with a specific set of options (here, the defaults).
+But the basic operation it performs is one that would be useful
+in many document transformations. It would be nice to isolate the
+part of the program that transforms the pandoc AST, leaving the rest
+to pandoc itself. What we want is a *filter* that *just* operates
+on the AST---or rather, on a JSON representation of the AST that
+pandoc can produce and consume:
+
+ source format
+ ↓
+ (pandoc)
+ ↓
+ JSON-formatted AST
+ ↓
+ (filter)
+ ↓
+ JSON-formatted AST
+ ↓
+ (pandoc)
+ ↓
+ target format
+
+The module `Text.Pandoc.JSON` contains a function `toJSONFilter`
+that makes it easy to write such filters. Here is a filter
+version of `behead.hs`:
+
+~~~~ {.haskell}
+#!/usr/bin/env runhaskell
+-- behead2.hs
+import Text.Pandoc.JSON
+
+main :: IO ()
+main = toJSONFilter behead
+ where behead (Header n _ xs) | n >= 2 = Para [Emph xs]
+ behead x = x
+~~~~
+
+It can be used this way:
+
+ pandoc -f SOURCEFORMAT -t json | runhaskell behead2.hs | \
+ pandoc -f json -t TARGETFORMAT
+
+But it is easier to use the `--filter` option with pandoc:
+
+ pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead2.hs
+
+Note that this approach requires that `behead2.hs` be executable,
+so we must
+
+ chmod +x behead2.hs
+
+Alternatively, we could compile the filter:
+
+ ghc --make behead2.hs
+ pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead
+
+Note that if the filter is placed in the system PATH, then the initial
+`./` is not needed. Note also that the command line can include
+multiple instances of `--filter`: the filters will be applied in
+sequence.
+
+# LaTeX for WordPress
+
+Another easy example. WordPress blogs require a special format for
+LaTeX math. Instead of `$e=mc^2$`, you need: `$LaTeX e=mc^2$`.
+How can we convert a markdown document accordingly?
+
+Again, it's difficult to do the job reliably with regexes.
+A `$` might be a regular currency indicator, or it might occur in
+a comment or code block or inline code span. We just want to find
+the `$`s that begin LaTeX math. If only we had a parser...
+
+We do. Pandoc already extracts LaTeX math, so:
+
+~~~~ {.haskell}
+#!/usr/bin/env runhaskell
+-- wordpressify.hs
+import Text.Pandoc.JSON
+
+main = toJSONFilter wordpressify
+ where wordpressify (Math x y) = Math x ("LaTeX " ++ y)
+ wordpressify x = x
+~~~~
+
+Mission accomplished. (I've omitted type signatures here,
+just to show it can be done.)
+
+
+# But I don't want to learn Haskell!
+
+While it's easiest to write pandoc filters in Haskell, it is fairly
+easy to write them in python using the `pandocfilters` package.
+The package is in PyPI and can be installed using `pip install
+pandocfilters` or `easy_install pandocfilters`.
+
+Here's our "beheading" filter in python:
+
+~~~ {.python}
+#!/usr/bin/env python
+
+"""
+Pandoc filter to convert all level 2+ headers to paragraphs with
+emphasized text.
+"""
+
+from pandocfilters import toJSONFilter, Emph, Para
+
+def behead(key, value, format, meta):
+ if key == 'Header' and value[0] >= 2:
+ return Para([Emph(value[2])])
+
+if __name__ == "__main__":
+ toJSONFilter(behead)
+~~~
+
+`toJSONFilter(behead)` walks the AST and applies the `behead` action
+to each element. If `behead` returns nothing, the node is unchanged;
+if it returns an object, the node is replaced; if it returns a list,
+the new list is spliced in.
+
+Note that, although these parameters are not used in this example,
+`format` provides access to the target format, and `meta` provides access to
+the document's metadata.
+
+There are many examples of python filters in [the pandocfilters
+repository](http://github.com/jgm/pandocfilters).
+
+For a more Pythonic alternative to pandocfilters, see
+the [panflute](http://scorreia.com/software/panflute/) library.
+Don't like Python? There are also ports of pandocfilters in
+[PHP](https://github.com/vinai/pandocfilters-php),
+[perl](https://metacpan.org/pod/Pandoc::Filter), and
+[javascript/node.js](https://github.com/mvhenderson/pandoc-filter-node).]
+
+Starting with pandoc 2.0, pandoc includes built-in support for
+writing filters in lua. The lua interpreter is built in to
+pandoc, so a lua filter does not require any additional software
+to run. See the [documentation on lua
+filters](lua-filters.html).
+
+# Include files
+
+So none of our transforms have involved IO. How about a script that
+reads a markdown document, finds all the inline code blocks with
+attribute `include`, and replaces their contents with the contents of
+the file given?
+
+~~~~ {.haskell}
+#!/usr/bin/env runhaskell
+-- includes.hs
+import Text.Pandoc.JSON
+
+doInclude :: Block -> IO Block
+doInclude cb@(CodeBlock (id, classes, namevals) contents) =
+ case lookup "include" namevals of
+ Just f -> return . (CodeBlock (id, classes, namevals)) =<< readFile f
+ Nothing -> return cb
+doInclude x = return x
+
+main :: IO ()
+main = toJSONFilter doInclude
+~~~~
+
+Try this on the following:
+
+ Here's the pandoc README:
+
+ ~~~~ {include="README"}
+ this will be replaced by contents of README
+ ~~~~
+
+# Removing links
+
+What if we want to remove every link from a document, retaining
+the link's text?
+
+~~~~ {.haskell}
+#!/usr/bin/env runhaskell
+-- delink.hs
+import Text.Pandoc.JSON
+
+main = toJSONFilter delink
+
+delink :: Inline -> [Inline]
+delink (Link _ txt _) = txt
+delink x = [x]
+~~~~
+
+Note that `delink` can't be a function of type `Inline -> Inline`,
+because the thing we want to replace the link with is not a single
+`Inline` element, but a list of them. So we make `delink` a function
+from an `Inline` element to a list of `Inline` elements.
+`toJSONFilter` can still lift this function to a transformation of type
+`Pandoc -> Pandoc`.
+
+# A filter for ruby text
+
+Finally, here's a nice real-world example, developed on the
+[pandoc-discuss](http://groups.google.com/group/pandoc-discuss/browse_thread/thread/7baea325565878c8) list. Qubyte wrote:
+
+> I'm interested in using pandoc to turn my markdown notes on Japanese
+> into nicely set HTML and (Xe)LaTeX. With HTML5, ruby (typically used to
+> phonetically read chinese characters by placing text above or to the
+> side) is standard, and support from browsers is emerging (Webkit based
+> browsers appear to fully support it). For those browsers that don't
+> support it yet (notably Firefox) the feature falls back in a nice way
+> by placing the phonetic reading inside brackets to the side of each
+> Chinese character, which is suitable for other output formats too. As
+> for (Xe)LaTeX, ruby is not an issue.
+>
+> At the moment, I use inline HTML to achieve the result when the
+> conversion is to HTML, but it's ugly and uses a lot of keystrokes, for
+> example
+>
+> ~~~ {.xml}
+> <ruby>ご<rt></rt>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby>
+> ~~~
+>
+> sets ご飯 "gohan" with "han" spelt phonetically above the second
+> character, or to the right of it in brackets if the browser does not
+> support ruby. I'd like to have something more like
+>
+> r[はん](飯)
+>
+> or any keystroke saving convention would be welcome.
+
+We came up with the following script, which uses the convention that a
+markdown link with a URL beginning with a hyphen is interpreted as ruby:
+
+ [はん](-飯)
+
+~~~ {.haskell}
+-- handleruby.hs
+import Text.Pandoc.JSON
+import System.Environment (getArgs)
+
+handleRuby :: Maybe Format -> Inline -> Inline
+handleRuby (Just format) (Link _ [Str ruby] ('-':kanji,_))
+ | format == Format "html" = RawInline format
+ $ "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby ++ "</rt><rp>)</rp></ruby>"
+ | format == Format "latex" = RawInline format
+ $ "\\ruby{" ++ kanji ++ "}{" ++ ruby ++ "}"
+ | otherwise = Str ruby
+handleRuby _ x = x
+
+main :: IO ()
+main = toJSONFilter handleRuby
+~~~
+
+Note that, when a script is called using `--filter`, pandoc passes
+it the target format as the first argument. When a function's
+first argument is of type `Maybe Format`, `toJSONFilter` will
+automatically assign it `Just` the target format or `Nothing`.
+
+We compile our script:
+
+ ghc --make handleRuby
+
+Then run it:
+
+ % pandoc -F ./handleRuby -t html
+ [はん](-飯)
+ ^D
+ <p><ruby>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby></p>
+ % pandoc -F ./handleRuby -t latex
+ [はん](-飯)
+ ^D
+ \ruby{飯}{はん}
+
+# Exercises
+
+1. Put all the regular text in a markdown document in ALL CAPS
+ (without touching text in URLs or link titles).
+
+2. Remove all horizontal rules from a document.
+
+3. Renumber all enumerated lists with roman numerals.
+
+4. Replace each delimited code block with class `dot` with an
+ image generated by running `dot -Tpng` (from graphviz) on the
+ contents of the code block.
+
+5. Find all code blocks with class `python` and run them
+ using the python interpreter, printing the results to the console.
+
diff --git a/doc/using-the-pandoc-api.md b/doc/using-the-pandoc-api.md
index b567db968..e80c3641f 100644
--- a/doc/using-the-pandoc-api.md
+++ b/doc/using-the-pandoc-api.md
@@ -1,28 +1,29 @@
-# Using the pandoc API
+% Using the pandoc API
+% John MacFarlane
-## Concepts
+# Concepts
-## Basic usage
+# Basic usage
-## The Pandoc structure
+# The Pandoc structure
-## Reader options
+# Reader options
-## Writer options
+# Writer options
-## The PandocMonad class
+# The PandocMonad class
custom PandocMonad instances
-## Builder
+# Builder
example: report from CSV data
-## Generic transformations
+# Generic transformations
Walk and syb for AST transformations
-## Filters
+# Filters
writing filters in Haskell