Removed customizing-pandoc.md from doc/, added filters.md.

filters.md is essentially the scripting tutorial from the webiste.
author: John MacFarlane <jgm@berkeley.edu> 2017-09-16 23:00:20 -0700
committer: John MacFarlane <jgm@berkeley.edu> 2017-09-16 23:00:20 -0700
commit: 91ab987a524a34b56b2763f639318bf6b800c09a (patch)
tree: d2f6adbf59259f60e283c9f802fdeb78ca447c67
parent: 9add71365489cf21c07221e86f5705c6494c1efb (diff)
download: pandoc-91ab987a524a34b56b2763f639318bf6b800c09a.tar.gz
3 files changed, 480 insertions, 28 deletions
diff --git a/doc/customizing-pandoc.md b/doc/customizing-pandoc.md
deleted file mode 100644
index 37b77cf1f..000000000
--- a/doc/customizing-pandoc.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Customizing pandoc
-
-## Templates
-
-## Reference docx/odt
-
-## Custom lua writers
-
-## Custom syntax highlighting
-
-syntax definitions, styles
-
-## Filters
-
-including documentation of the JSON serialization format and
-AST definition
-
-
diff --git a/doc/filters.md b/doc/filters.md
new file mode 100644
index 000000000..0c9b77328
--- /dev/null
+++ b/doc/filters.md
@@ -0,0 +1,469 @@
+% Pandoc filters
+% John MacFarlane
+
+# Summary
+
+Pandoc provides an interface for users to write programs (known
+as filters) which act on pandoc’s AST.
+
+Pandoc consists of a set of readers and writers. When converting
+a document from one format to another, text is parsed by a
+reader into pandoc’s intermediate representation of the
+document---an "abstract syntax tree" or AST---which is then
+converted by the writer into the target format.
+The pandoc AST format is defined in the module
+`Text.Pandoc.Definition` in
+[pandoc-types](https://hackage.haskell.org/package/pandoc-types).
+
+A "filter" is a program that modifies the AST, between the
+reader and the writer:
+
+    INPUT --reader--> AST --filter--> AST --writer--> OUTPUT
+
+Filters are "pipes" that read from standard input and write to
+standard output.  They consume and produce a JSON representation
+of the pandoc AST.  (In recent versions, this representation
+includes a `pandoc-api-version` field which refers to a
+version of `pandoc-types`.)  Filters may be written in any programming
+language.  To use a filter, you need only specify it on the
+command line using `--filter`, e.g.
+
+    pandoc -s input.txt --filter pandoc-citeproc -o output.htl
+
+For a gentle introduction into writing your own filters,
+continue this guide. There’s also a [list of third party filters
+on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters).
+
+
+# A simple example
+
+Suppose you wanted to replace all level 2+ headers in a markdown
+document with regular paragraphs, with text in italics. How would you go
+about doing this?
+
+A first thought would be to use regular expressions. Something
+like this:
+
+    perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt
+
+This should work most of the time.  But don't forget
+that ATX style headers can end with a sequence of `#`s
+that is not part of the header text:
+
+    ## My header ##
+
+And what if your document contains a line starting with `##` in an HTML
+comment or delimited code block?
+
+    <!--
+    ## This is just a comment
+    -->
+
+    ~~~~
+    ### A third level header in standard markdown
+    ~~~~
+
+We don't want to touch *these* lines.  Moreover, what about setext
+style second-level headers?
+
+    A header
+    --------
+
+We need to handle those too.  Finally, can we be sure that adding
+asterisks to each side of our string will put it in italics?
+What if the string already contains asterisks around it? Then we'll
+end up with bold text, which is not what we want. And what if it contains
+a regular unescaped asterisk?
+
+How would you modify your regular expression to handle these cases? It
+would be hairy, to say the least. What we need is a real parser.
+
+Well, pandoc has a real markdown parser, the library function
+`readMarkdown`. This transforms markdown text to an abstract syntax tree
+(AST) that represents the document structure. Why not manipulate the
+AST directly in a short Haskell script, then convert the result back to
+markdown using `writeMarkdown`?
+
+First, let's see what this AST looks like. We can use pandoc's `native`
+output format:
+
+~~~~
+% cat test.txt
+## my header
+
+text with *italics*
+% pandoc -s -t native test.txt
+Pandoc (Meta {unMeta = fromList []})
+[Header 3 ("my-header",[],[]) [Str "My",Space,Str "header"]
+, Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ]
+~~~~
+
+A `Pandoc` document consists of a `Meta` block (containing
+metadata like title, authors, and date) and a list of `Block`
+ elements.  In this case, we have two `Block`s, a `Header` and a `Para`.
+Each has as its content a list of `Inline` elements.  For more details on
+the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`].
+
+[haddock documentation for `Text.Pandoc.Definition`]: http://hackage.haskell.org/package/pandoc-types
+
+Here's a short Haskell script that reads markdown, changes level
+2+ headers to regular paragraphs, and writes the result as markdown.
+If you save it as `behead.hs`, you can run it using `runhaskell behead.hs`.
+It will act like a unix pipe, reading from `stdin` and writing to `stdout`.
+Or, if you want, you can compile it, using `ghc --make behead`, then run
+the resulting executable `behead`.
+
+~~~~                          {.haskell}
+-- behead.hs
+import Text.Pandoc
+import Text.Pandoc.Walk (walk)
+
+behead :: Block -> Block
+behead (Header n _ xs) | n >= 2 = Para [Emph xs]
+behead x = x
+
+readDoc :: String -> Pandoc
+readDoc s = readMarkdown def s
+-- or, for pandoc 1.14 and greater, use:
+-- readDoc s = case readMarkdown def s of
+--                  Right doc -> doc
+--                  Left err  -> error (show err)
+
+writeDoc :: Pandoc -> String
+writeDoc doc = writeMarkdown def doc
+
+main :: IO ()
+main = interact (writeDoc . walk behead . readDoc)
+~~~~
+
+The magic here is the `walk` function, which converts
+our `behead` function (a function from `Block` to `Block`) to
+a transformation on whole `Pandoc` documents.
+(See the [haddock documentation for `Text.Pandoc.Walk`].)
+
+[haddock documentation for `Text.Pandoc.Walk`]: http://hackage.haskell.org/package/pandoc-types
+
+# Queries: listing URLs
+
+We can use this same technique to do much more complex transformations
+and queries.  Here's how we could extract all the URLs linked to in
+a markdown document (again, not an easy task with regular expressions):
+
+~~~~                          {.haskell}
+-- extracturls.hs
+import Text.Pandoc
+
+extractURL :: Inline -> [String]
+extractURL (Link _ _ (u,_)) = [u]
+extractURL (Image _ _ (u,_)) = [u]
+extractURL _ = []
+
+extractURLs :: Pandoc -> [String]
+extractURLs = query extractURL
+
+readDoc :: String -> Pandoc
+readDoc = readMarkdown def
+-- or, for pandoc 1.14, use:
+-- readDoc s = case readMarkdown def s of
+--                Right doc -> doc
+--                Left err  -> error (show err)
+
+main :: IO ()
+main = interact (unlines . extractURLs . readDoc)
+~~~~
+
+`query` is the query counterpart of `walk`: it lifts
+a function that operates on `Inline` elements to one that operates
+on the whole `Pandoc` AST.  The results returned by applying
+`extractURL` to each `Inline` element are concatenated in the
+result.
+
+# JSON filters
+
+`behead.hs` is a very special-purpose program.  It reads a
+specific input format (markdown) and writes a specific output format
+(HTML), with a specific set of options (here, the defaults).
+But the basic operation it performs is one that would be useful
+in many document transformations.  It would be nice to isolate the
+part of the program that transforms the pandoc AST, leaving the rest
+to pandoc itself.  What we want is a *filter* that *just* operates
+on the AST---or rather, on a JSON representation of the AST that
+pandoc can produce and consume:
+
+                             source format
+                                  ↓
+                               (pandoc)
+                                  ↓
+                          JSON-formatted AST
+                                  ↓
+                               (filter)
+                                  ↓
+                          JSON-formatted AST
+                                  ↓
+                               (pandoc)
+                                  ↓
+                            target format
+
+The module `Text.Pandoc.JSON` contains a function `toJSONFilter`
+that makes it easy to write such filters.  Here is a filter
+version of `behead.hs`:
+
+~~~~                          {.haskell}
+#!/usr/bin/env runhaskell
+-- behead2.hs
+import Text.Pandoc.JSON
+
+main :: IO ()
+main = toJSONFilter behead
+  where behead (Header n _ xs) | n >= 2 = Para [Emph xs]
+        behead x = x
+~~~~
+
+It can be used this way:
+
+    pandoc -f SOURCEFORMAT -t json | runhaskell behead2.hs | \
+      pandoc -f json -t TARGETFORMAT
+
+But it is easier to use the `--filter` option with pandoc:
+
+    pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead2.hs
+
+Note that this approach requires that `behead2.hs` be executable,
+so we must
+
+    chmod +x behead2.hs
+
+Alternatively, we could compile the filter:
+
+    ghc --make behead2.hs
+    pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead
+
+Note that if the filter is placed in the system PATH, then the initial
+`./` is not needed.  Note also that the command line can include
+multiple instances of `--filter`:  the filters will be applied in
+sequence.
+
+# LaTeX for WordPress
+
+Another easy example. WordPress blogs require a special format for
+LaTeX math.  Instead of `$e=mc^2$`, you need: `$LaTeX e=mc^2$`.
+How can we convert a markdown document accordingly?
+
+Again, it's difficult to do the job reliably with regexes.
+A `$` might be a regular currency indicator, or it might occur in
+a comment or code block or inline code span.  We just want to find
+the `$`s that begin LaTeX math. If only we had a parser...
+
+We do.  Pandoc already extracts LaTeX math, so:
+
+~~~~                          {.haskell}
+#!/usr/bin/env runhaskell
+-- wordpressify.hs
+import Text.Pandoc.JSON
+
+main = toJSONFilter wordpressify
+  where wordpressify (Math x y) = Math x ("LaTeX " ++ y)
+        wordpressify x = x
+~~~~
+
+Mission accomplished. (I've omitted type signatures here,
+just to show it can be done.)
+
+
+# But I don't want to learn Haskell!
+
+While it's easiest to write pandoc filters in Haskell, it is fairly
+easy to write them in python using the `pandocfilters` package.
+The package is in PyPI and can be installed using `pip install
+pandocfilters` or `easy_install pandocfilters`.
+
+Here's our "beheading" filter in python:
+
+~~~ {.python}
+#!/usr/bin/env python
+
+"""
+Pandoc filter to convert all level 2+ headers to paragraphs with
+emphasized text.
+"""
+
+from pandocfilters import toJSONFilter, Emph, Para
+
+def behead(key, value, format, meta):
+  if key == 'Header' and value[0] >= 2:
+    return Para([Emph(value[2])])
+
+if __name__ == "__main__":
+  toJSONFilter(behead)
+~~~
+
+`toJSONFilter(behead)` walks the AST and applies the `behead` action
+to each element.  If `behead` returns nothing, the node is unchanged;
+if it returns an object, the node is replaced; if it returns a list,
+the new list is spliced in.
+
+Note that, although these parameters are not used in this example,
+`format` provides access to the target format, and `meta` provides access to
+the document's metadata.
+
+There are many examples of python filters in [the pandocfilters
+repository](http://github.com/jgm/pandocfilters).
+
+For a more Pythonic alternative to pandocfilters, see
+the [panflute](http://scorreia.com/software/panflute/) library.
+Don't like Python?   There are also ports of pandocfilters in
+[PHP](https://github.com/vinai/pandocfilters-php),
+[perl](https://metacpan.org/pod/Pandoc::Filter), and
+[javascript/node.js](https://github.com/mvhenderson/pandoc-filter-node).]
+
+Starting with pandoc 2.0, pandoc includes built-in support for
+writing filters in lua.  The lua interpreter is built in to
+pandoc, so a lua filter does not require any additional software
+to run.  See the [documentation on lua
+filters](lua-filters.html).
+
+# Include files
+
+So none of our transforms have involved IO. How about a script that
+reads a markdown document, finds all the inline code blocks with
+attribute `include`, and replaces their contents with the contents of
+the file given?
+
+~~~~                          {.haskell}
+#!/usr/bin/env runhaskell
+-- includes.hs
+import Text.Pandoc.JSON
+
+doInclude :: Block -> IO Block
+doInclude cb@(CodeBlock (id, classes, namevals) contents) =
+  case lookup "include" namevals of
+       Just f     -> return . (CodeBlock (id, classes, namevals)) =<< readFile f
+       Nothing    -> return cb
+doInclude x = return x
+
+main :: IO ()
+main = toJSONFilter doInclude
+~~~~
+
+Try this on the following:
+
+    Here's the pandoc README:
+
+    ~~~~ {include="README"}
+    this will be replaced by contents of README
+    ~~~~
+
+# Removing links
+
+What if we want to remove every link from a document, retaining
+the link's text?
+
+~~~~                          {.haskell}
+#!/usr/bin/env runhaskell
+-- delink.hs
+import Text.Pandoc.JSON
+
+main = toJSONFilter delink
+
+delink :: Inline -> [Inline]
+delink (Link _ txt _) = txt
+delink x              = [x]
+~~~~
+
+Note that `delink` can't be a function of type `Inline -> Inline`,
+because the thing we want to replace the link with is not a single
+`Inline` element, but a list of them. So we make `delink` a function
+from an `Inline` element to a list of `Inline` elements.
+`toJSONFilter` can still lift this function to a transformation of type
+`Pandoc -> Pandoc`.
+
+# A filter for ruby text
+
+Finally, here's a nice real-world example, developed on the
+[pandoc-discuss](http://groups.google.com/group/pandoc-discuss/browse_thread/thread/7baea325565878c8) list.  Qubyte wrote:
+
+> I'm interested in using pandoc to turn my markdown notes on Japanese
+> into nicely set HTML and (Xe)LaTeX. With HTML5, ruby (typically used to
+> phonetically read chinese characters by placing text above or to the
+> side) is standard, and support from browsers is emerging (Webkit based
+> browsers appear to fully support it). For those browsers that don't
+> support it yet (notably Firefox) the feature falls back in a nice way
+> by placing the phonetic reading inside brackets to the side of each
+> Chinese character, which is suitable for other output formats too. As
+> for (Xe)LaTeX, ruby is not an issue.
+>
+> At the moment, I use inline HTML to achieve the result when the
+> conversion is to HTML, but it's ugly and uses a lot of keystrokes, for
+> example
+>
+> ~~~ {.xml}
+> <ruby>ご<rt></rt>飯<rp>（</rp><rt>はん</rt><rp>）</rp></ruby>
+> ~~~
+>
+> sets ご飯 "gohan" with "han" spelt phonetically above the second
+> character, or to the right of it in brackets if the browser does not
+> support ruby.  I'd like to have something more like
+>
+>     r[はん](飯)
+>
+> or any keystroke saving convention would be welcome.
+
+We came up with the following script, which uses the convention that a
+markdown link with a URL beginning with a hyphen is interpreted as ruby:
+
+    [はん](-飯)
+
+~~~ {.haskell}
+-- handleruby.hs
+import Text.Pandoc.JSON
+import System.Environment (getArgs)
+
+handleRuby :: Maybe Format -> Inline -> Inline
+handleRuby (Just format) (Link _ [Str ruby] ('-':kanji,_))
+  | format == Format "html"  = RawInline format
+    $ "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby ++ "</rt><rp>)</rp></ruby>"
+  | format == Format "latex" = RawInline format
+    $ "\\ruby{" ++ kanji ++ "}{" ++ ruby ++ "}"
+  | otherwise = Str ruby
+handleRuby _ x = x
+
+main :: IO ()
+main = toJSONFilter handleRuby
+~~~
+
+Note that, when a script is called using `--filter`, pandoc passes
+it the target format as the first argument.  When a function's
+first argument is of type `Maybe Format`, `toJSONFilter` will
+automatically assign it `Just` the target format or `Nothing`.
+
+We compile our script:
+
+    ghc --make handleRuby
+
+Then run it:
+
+    % pandoc -F ./handleRuby -t html
+    [はん](-飯)
+    ^D
+    <p><ruby>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby></p>
+    % pandoc -F ./handleRuby -t latex
+    [はん](-飯)
+    ^D
+    \ruby{飯}{はん}
+
+# Exercises
+
+1.  Put all the regular text in a markdown document in ALL CAPS
+    (without touching text in URLs or link titles).
+
+2.  Remove all horizontal rules from a document.
+
+3.  Renumber all enumerated lists with roman numerals.
+
+4.  Replace each delimited code block with class `dot` with an
+    image generated by running `dot -Tpng` (from graphviz) on the
+    contents of the code block.
+
+5.  Find all code blocks with class `python` and run them
+    using the python interpreter, printing the results to the console.
+
diff --git a/doc/using-the-pandoc-api.md b/doc/using-the-pandoc-api.md
index b567db968..e80c3641f 100644
--- a/doc/using-the-pandoc-api.md
+++ b/doc/using-the-pandoc-api.md
@@ -1,28 +1,29 @@
-# Using the pandoc API
+% Using the pandoc API
+% John MacFarlane
 
-## Concepts
+# Concepts
 
-## Basic usage
+# Basic usage
 
-## The Pandoc structure
+# The Pandoc structure
 
-## Reader options
+# Reader options
 
-## Writer options
+# Writer options
 
-## The PandocMonad class
+# The PandocMonad class
 
 custom PandocMonad instances
 
-## Builder
+# Builder
 
 example: report from CSV data
 
-## Generic transformations
+# Generic transformations
 
 Walk and syb for AST transformations
 
-## Filters
+# Filters
 
 writing filters in Haskell
author	John MacFarlane <jgm@berkeley.edu>	2017-09-16 23:00:20 -0700
committer	John MacFarlane <jgm@berkeley.edu>	2017-09-16 23:00:20 -0700
commit	91ab987a524a34b56b2763f639318bf6b800c09a (patch)
tree	d2f6adbf59259f60e283c9f802fdeb78ca447c67
parent	9add71365489cf21c07221e86f5705c6494c1efb (diff)
download	pandoc-91ab987a524a34b56b2763f639318bf6b800c09a.tar.gz