Update filter documentation.

Remove example using pandoc API directly (we have other docs for that and it was outdated). Closes #6065.
author: John MacFarlane <jgm@berkeley.edu> 2020-01-14 11:18:24 -0800
committer: John MacFarlane <jgm@berkeley.edu> 2020-01-14 11:18:24 -0800
commit: dfac1239d94401bb45fce65d74fa26f360c6decd (patch)
tree: 808e242e83210932b7236eed0d036af96a761c23
parent: 9009bda1792e1db5d019d63c16f40ce9df269724 (diff)
download: pandoc-dfac1239d94401bb45fce65d74fa26f360c6decd.tar.gz
1 files changed, 84 insertions, 142 deletions
diff --git a/doc/filters.md b/doc/filters.md
index 0b48c8002..b7d6aa45d 100644
--- a/doc/filters.md
+++ b/doc/filters.md
@@ -16,19 +16,52 @@ The pandoc AST format is defined in the module
 ](https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html).
 
 A "filter" is a program that modifies the AST, between the
-reader and the writer:
+reader and the writer.
 
     INPUT --reader--> AST --filter--> AST --writer--> OUTPUT
 
-Filters are "pipes" that read from standard input and write to
-standard output.  They consume and produce a JSON representation
-of the pandoc AST.  (In recent versions, this representation
-includes a `pandoc-api-version` field which refers to a
-version of `pandoc-types`.)  Filters may be written in any programming
-language.  To use a filter, you need only specify it on the
-command line using `--filter`, e.g.
+Pandoc supports two kinds of filters:
 
-    pandoc -s input.txt --filter pandoc-citeproc -o output.htl
+- **Lua filters** use the Lua language to
+  define transformations on the pandoc AST.  They are
+  described in a [separate document](lua-filters.html).
+
+- **JSON filters**, described here, are pipes that read from
+  standard input and write to standard output, consuming and
+  producing a JSON representation of the pandoc AST:
+
+                             source format
+                                  ↓
+                               (pandoc)
+                                  ↓
+                          JSON-formatted AST
+                                  ↓
+                            (JSON filter)
+                                  ↓
+                          JSON-formatted AST
+                                  ↓
+                               (pandoc)
+                                  ↓
+                            target format
+
+Lua filters have a couple of advantages.  They use a Lua
+interpreter that is embedded in pandoc, so you don't need
+to have any external software installed.  And they are
+usually faster than JSON filters.  But if you wish to
+write your filter in a language other than Lua, you may
+prefer to use a JSON filter. JSON filters may be written
+in any programming language.
+
+You can use a JSON filter directly in a pipeline:
+
+    pandoc -s input.txt -t json | \
+     pandoc-citeproc | \
+     pandoc -s -f json -o output.html
+
+But it is more convenient to use the `--filter` option,
+which handles the plumbing automatically:
+
+    pandoc -s input.txt --filter pandoc-citeproc -o output.html
 
 For a gentle introduction into writing your own filters,
 continue this guide. There’s also a [list of third party filters
@@ -37,7 +70,7 @@ on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters).
 
 # A simple example
 
-Suppose you wanted to replace all level 2+ headers in a markdown
+Suppose you wanted to replace all level 2+ headings in a markdown
 document with regular paragraphs, with text in italics. How would you go
 about doing this?
 
@@ -47,10 +80,10 @@ like this:
     perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt
 
 This should work most of the time.  But don't forget
-that ATX style headers can end with a sequence of `#`s
-that is not part of the header text:
+that ATX style headings can end with a sequence of `#`s
+that is not part of the heading text:
 
-    ## My header ##
+    ## My heading ##
 
 And what if your document contains a line starting with `##` in an HTML
 comment or delimited code block?
@@ -60,14 +93,14 @@ comment or delimited code block?
     -->
 
     ~~~~
-    ### A third level header in standard markdown
+    ### A third level heading in standard markdown
     ~~~~
 
-We don't want to touch *these* lines.  Moreover, what about setext
-style second-level headers?
+We don't want to touch *these* lines.  Moreover, what about Setext
+style second-level heading?
 
-    A header
-    --------
+    A heading
+    ---------
 
 We need to handle those too.  Finally, can we be sure that adding
 asterisks to each side of our string will put it in italics?
@@ -76,25 +109,23 @@ end up with bold text, which is not what we want. And what if it contains
 a regular unescaped asterisk?
 
 How would you modify your regular expression to handle these cases? It
-would be hairy, to say the least. What we need is a real parser.
+would be hairy, to say the least.
 
-Well, pandoc has a real markdown parser, the library function
-`readMarkdown`. This transforms markdown text to an abstract syntax tree
-(AST) that represents the document structure. Why not manipulate the
-AST directly in a short Haskell script, then convert the result back to
-markdown using `writeMarkdown`?
+A better approach is to let pandoc handle the parsing, and
+then modify the AST before the document is written. For this,
+we can use a filter.
 
-First, let's see what this AST looks like. We can use pandoc's `native`
-output format:
+To see what sort of AST is produced when pandoc parses our text,
+we can use pandoc's `native` output format:
 
 ~~~~
 % cat test.txt
-## my header
+## my heading
 
 text with *italics*
 % pandoc -s -t native test.txt
 Pandoc (Meta {unMeta = fromList []})
-[Header 3 ("my-header",[],[]) [Str "My",Space,Str "header"]
+[Header 2 ("my-heading",[],[]) [Str "My",Space,Str "heading"]
 , Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ]
 ~~~~
 
@@ -106,136 +137,46 @@ the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`].
 
 [haddock documentation for `Text.Pandoc.Definition`]: https://hackage.haskell.org/package/pandoc-types
 
-Here's a short Haskell script that reads markdown, changes level
-2+ headers to regular paragraphs, and writes the result as markdown.
-If you save it as `behead.hs`, you can run it using `runhaskell behead.hs`.
-It will act like a unix pipe, reading from `stdin` and writing to `stdout`.
-Or, if you want, you can compile it, using `ghc --make behead`, then run
-the resulting executable `behead`.
-
-~~~~                          {.haskell}
--- behead.hs
-import Text.Pandoc
-import Text.Pandoc.Walk (walk)
-
-behead :: Block -> Block
-behead (Header n _ xs) | n >= 2 = Para [Emph xs]
-behead x = x
-
-readDoc :: String -> Pandoc
-readDoc s = readMarkdown def s
--- or, for pandoc 1.14 and greater, use:
--- readDoc s = case readMarkdown def s of
---                  Right doc -> doc
---                  Left err  -> error (show err)
-
-writeDoc :: Pandoc -> String
-writeDoc doc = writeMarkdown def doc
-
-main :: IO ()
-main = interact (writeDoc . walk behead . readDoc)
-~~~~
-
-The magic here is the `walk` function, which converts
-our `behead` function (a function from `Block` to `Block`) to
-a transformation on whole `Pandoc` documents.
-(See the [haddock documentation for `Text.Pandoc.Walk`].)
-
-[haddock documentation for `Text.Pandoc.Walk`]: https://hackage.haskell.org/package/pandoc-types
-
-# Queries: listing URLs
-
-We can use this same technique to do much more complex transformations
-and queries.  Here's how we could extract all the URLs linked to in
-a markdown document (again, not an easy task with regular expressions):
-
-~~~~                          {.haskell}
--- extracturls.hs
-import Text.Pandoc
-
-extractURL :: Inline -> [String]
-extractURL (Link _ _ (u,_)) = [u]
-extractURL (Image _ _ (u,_)) = [u]
-extractURL _ = []
-
-extractURLs :: Pandoc -> [String]
-extractURLs = query extractURL
-
-readDoc :: String -> Pandoc
-readDoc = readMarkdown def
--- or, for pandoc 1.14, use:
--- readDoc s = case readMarkdown def s of
---                Right doc -> doc
---                Left err  -> error (show err)
-
-main :: IO ()
-main = interact (unlines . extractURLs . readDoc)
-~~~~
-
-`query` is the query counterpart of `walk`: it lifts
-a function that operates on `Inline` elements to one that operates
-on the whole `Pandoc` AST.  The results returned by applying
-`extractURL` to each `Inline` element are concatenated in the
-result.
-
-# JSON filters
-
-`behead.hs` is a very special-purpose program.  It reads a
-specific input format (markdown) and writes a specific output format
-(HTML), with a specific set of options (here, the defaults).
-But the basic operation it performs is one that would be useful
-in many document transformations.  It would be nice to isolate the
-part of the program that transforms the pandoc AST, leaving the rest
-to pandoc itself.  What we want is a *filter* that *just* operates
-on the AST---or rather, on a JSON representation of the AST that
-pandoc can produce and consume:
-
-                             source format
-                                  ↓
-                               (pandoc)
-                                  ↓
-                          JSON-formatted AST
-                                  ↓
-                               (filter)
-                                  ↓
-                          JSON-formatted AST
-                                  ↓
-                               (pandoc)
-                                  ↓
-                            target format
-
-The module `Text.Pandoc.JSON` (from `pandoc-types`) contains a
-function `toJSONFilter` that makes it easy to write such
-filters.  Here is a filter version of `behead.hs`:
+We can use Haskell to create a JSON filter that transforms this
+AST, replacing each `Header` block with level >= 2 with a `Para`
+with its contents wrapped inside an `Emph` inline:
 
 ~~~~                          {.haskell}
 #!/usr/bin/env runhaskell
--- behead2.hs
+-- behead.hs
 import Text.Pandoc.JSON
 
 main :: IO ()
 main = toJSONFilter behead
-  where behead (Header n _ xs) | n >= 2 = Para [Emph xs]
-        behead x = x
+
+behead :: Block -> Block
+behead (Header n _ xs) | n >= 2 = Para [Emph xs]
+behead x = x
 ~~~~
 
-It can be used this way:
+The `toJSONFilter` function does two things.  First, it lifts
+the `behead` function (which maps `Block -> Block`) onto a
+transformation of the entire `Pandoc` AST, walking the AST
+and transforming each block.  Second, it wraps this `Pandoc ->
+Pandoc` transformation with the necessary JSON serialization
+and deserialization, producing an executable that consumes
+JSON from stdin and produces JSON to stdout.
 
-    pandoc -f SOURCEFORMAT -t json | runhaskell behead2.hs | \
-      pandoc -f json -t TARGETFORMAT
+To use the filter, make it executable:
 
-But it is easier to use the `--filter` option with pandoc:
+    chmod +x behead.hs
 
-    pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead2.hs
+and then
 
-Note that this approach requires that `behead2.hs` be executable,
-so we must
+    pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead.hs
 
-    chmod +x behead2.hs
+(It is also necessary that `pandoc-types` be installed in the
+local package repository: `cabal install pandoc-types` should
+ensure this.)
 
 Alternatively, we could compile the filter:
 
-    ghc --make behead2.hs
+    ghc --make behead.hs
     pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead
 
 Note that if the filter is placed in the system PATH, then the initial
@@ -243,6 +184,7 @@ Note that if the filter is placed in the system PATH, then the initial
 multiple instances of `--filter`:  the filters will be applied in
 sequence.
 
+
 # LaTeX for WordPress
 
 Another easy example. WordPress blogs require a special format for
@@ -283,7 +225,7 @@ Here's our "beheading" filter in python:
 #!/usr/bin/env python
 
 """
-Pandoc filter to convert all level 2+ headers to paragraphs with
+Pandoc filter to convert all level 2+ headings to paragraphs with
 emphasized text.
 """
author	John MacFarlane <jgm@berkeley.edu>	2020-01-14 11:18:24 -0800
committer	John MacFarlane <jgm@berkeley.edu>	2020-01-14 11:18:24 -0800
commit	dfac1239d94401bb45fce65d74fa26f360c6decd (patch)
tree	808e242e83210932b7236eed0d036af96a761c23
parent	9009bda1792e1db5d019d63c16f40ce9df269724 (diff)
download	pandoc-dfac1239d94401bb45fce65d74fa26f360c6decd.tar.gz