diff options
author | John MacFarlane <jgm@berkeley.edu> | 2020-01-14 11:18:24 -0800 |
---|---|---|
committer | John MacFarlane <jgm@berkeley.edu> | 2020-01-14 11:18:24 -0800 |
commit | dfac1239d94401bb45fce65d74fa26f360c6decd (patch) | |
tree | 808e242e83210932b7236eed0d036af96a761c23 | |
parent | 9009bda1792e1db5d019d63c16f40ce9df269724 (diff) | |
download | pandoc-dfac1239d94401bb45fce65d74fa26f360c6decd.tar.gz |
Update filter documentation.
Remove example using pandoc API directly (we have other
docs for that and it was outdated).
Closes #6065.
-rw-r--r-- | doc/filters.md | 226 |
1 files changed, 84 insertions, 142 deletions
diff --git a/doc/filters.md b/doc/filters.md index 0b48c8002..b7d6aa45d 100644 --- a/doc/filters.md +++ b/doc/filters.md @@ -16,19 +16,52 @@ The pandoc AST format is defined in the module ](https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html). A "filter" is a program that modifies the AST, between the -reader and the writer: +reader and the writer. INPUT --reader--> AST --filter--> AST --writer--> OUTPUT -Filters are "pipes" that read from standard input and write to -standard output. They consume and produce a JSON representation -of the pandoc AST. (In recent versions, this representation -includes a `pandoc-api-version` field which refers to a -version of `pandoc-types`.) Filters may be written in any programming -language. To use a filter, you need only specify it on the -command line using `--filter`, e.g. +Pandoc supports two kinds of filters: - pandoc -s input.txt --filter pandoc-citeproc -o output.htl +- **Lua filters** use the Lua language to + define transformations on the pandoc AST. They are + described in a [separate document](lua-filters.html). + +- **JSON filters**, described here, are pipes that read from + standard input and write to standard output, consuming and + producing a JSON representation of the pandoc AST: + + source format + ↓ + (pandoc) + ↓ + JSON-formatted AST + ↓ + (JSON filter) + ↓ + JSON-formatted AST + ↓ + (pandoc) + ↓ + target format + +Lua filters have a couple of advantages. They use a Lua +interpreter that is embedded in pandoc, so you don't need +to have any external software installed. And they are +usually faster than JSON filters. But if you wish to +write your filter in a language other than Lua, you may +prefer to use a JSON filter. JSON filters may be written +in any programming language. + +You can use a JSON filter directly in a pipeline: + + pandoc -s input.txt -t json | \ + pandoc-citeproc | \ + pandoc -s -f json -o output.html + +But it is more convenient to use the `--filter` option, +which handles the plumbing automatically: + + pandoc -s input.txt --filter pandoc-citeproc -o output.html For a gentle introduction into writing your own filters, continue this guide. There’s also a [list of third party filters @@ -37,7 +70,7 @@ on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters). # A simple example -Suppose you wanted to replace all level 2+ headers in a markdown +Suppose you wanted to replace all level 2+ headings in a markdown document with regular paragraphs, with text in italics. How would you go about doing this? @@ -47,10 +80,10 @@ like this: perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt This should work most of the time. But don't forget -that ATX style headers can end with a sequence of `#`s -that is not part of the header text: +that ATX style headings can end with a sequence of `#`s +that is not part of the heading text: - ## My header ## + ## My heading ## And what if your document contains a line starting with `##` in an HTML comment or delimited code block? @@ -60,14 +93,14 @@ comment or delimited code block? --> ~~~~ - ### A third level header in standard markdown + ### A third level heading in standard markdown ~~~~ -We don't want to touch *these* lines. Moreover, what about setext -style second-level headers? +We don't want to touch *these* lines. Moreover, what about Setext +style second-level heading? - A header - -------- + A heading + --------- We need to handle those too. Finally, can we be sure that adding asterisks to each side of our string will put it in italics? @@ -76,25 +109,23 @@ end up with bold text, which is not what we want. And what if it contains a regular unescaped asterisk? How would you modify your regular expression to handle these cases? It -would be hairy, to say the least. What we need is a real parser. +would be hairy, to say the least. -Well, pandoc has a real markdown parser, the library function -`readMarkdown`. This transforms markdown text to an abstract syntax tree -(AST) that represents the document structure. Why not manipulate the -AST directly in a short Haskell script, then convert the result back to -markdown using `writeMarkdown`? +A better approach is to let pandoc handle the parsing, and +then modify the AST before the document is written. For this, +we can use a filter. -First, let's see what this AST looks like. We can use pandoc's `native` -output format: +To see what sort of AST is produced when pandoc parses our text, +we can use pandoc's `native` output format: ~~~~ % cat test.txt -## my header +## my heading text with *italics* % pandoc -s -t native test.txt Pandoc (Meta {unMeta = fromList []}) -[Header 3 ("my-header",[],[]) [Str "My",Space,Str "header"] +[Header 2 ("my-heading",[],[]) [Str "My",Space,Str "heading"] , Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ] ~~~~ @@ -106,136 +137,46 @@ the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`]. [haddock documentation for `Text.Pandoc.Definition`]: https://hackage.haskell.org/package/pandoc-types -Here's a short Haskell script that reads markdown, changes level -2+ headers to regular paragraphs, and writes the result as markdown. -If you save it as `behead.hs`, you can run it using `runhaskell behead.hs`. -It will act like a unix pipe, reading from `stdin` and writing to `stdout`. -Or, if you want, you can compile it, using `ghc --make behead`, then run -the resulting executable `behead`. - -~~~~ {.haskell} --- behead.hs -import Text.Pandoc -import Text.Pandoc.Walk (walk) - -behead :: Block -> Block -behead (Header n _ xs) | n >= 2 = Para [Emph xs] -behead x = x - -readDoc :: String -> Pandoc -readDoc s = readMarkdown def s --- or, for pandoc 1.14 and greater, use: --- readDoc s = case readMarkdown def s of --- Right doc -> doc --- Left err -> error (show err) - -writeDoc :: Pandoc -> String -writeDoc doc = writeMarkdown def doc - -main :: IO () -main = interact (writeDoc . walk behead . readDoc) -~~~~ - -The magic here is the `walk` function, which converts -our `behead` function (a function from `Block` to `Block`) to -a transformation on whole `Pandoc` documents. -(See the [haddock documentation for `Text.Pandoc.Walk`].) - -[haddock documentation for `Text.Pandoc.Walk`]: https://hackage.haskell.org/package/pandoc-types - -# Queries: listing URLs - -We can use this same technique to do much more complex transformations -and queries. Here's how we could extract all the URLs linked to in -a markdown document (again, not an easy task with regular expressions): - -~~~~ {.haskell} --- extracturls.hs -import Text.Pandoc - -extractURL :: Inline -> [String] -extractURL (Link _ _ (u,_)) = [u] -extractURL (Image _ _ (u,_)) = [u] -extractURL _ = [] - -extractURLs :: Pandoc -> [String] -extractURLs = query extractURL - -readDoc :: String -> Pandoc -readDoc = readMarkdown def --- or, for pandoc 1.14, use: --- readDoc s = case readMarkdown def s of --- Right doc -> doc --- Left err -> error (show err) - -main :: IO () -main = interact (unlines . extractURLs . readDoc) -~~~~ - -`query` is the query counterpart of `walk`: it lifts -a function that operates on `Inline` elements to one that operates -on the whole `Pandoc` AST. The results returned by applying -`extractURL` to each `Inline` element are concatenated in the -result. - -# JSON filters - -`behead.hs` is a very special-purpose program. It reads a -specific input format (markdown) and writes a specific output format -(HTML), with a specific set of options (here, the defaults). -But the basic operation it performs is one that would be useful -in many document transformations. It would be nice to isolate the -part of the program that transforms the pandoc AST, leaving the rest -to pandoc itself. What we want is a *filter* that *just* operates -on the AST---or rather, on a JSON representation of the AST that -pandoc can produce and consume: - - source format - ↓ - (pandoc) - ↓ - JSON-formatted AST - ↓ - (filter) - ↓ - JSON-formatted AST - ↓ - (pandoc) - ↓ - target format - -The module `Text.Pandoc.JSON` (from `pandoc-types`) contains a -function `toJSONFilter` that makes it easy to write such -filters. Here is a filter version of `behead.hs`: +We can use Haskell to create a JSON filter that transforms this +AST, replacing each `Header` block with level >= 2 with a `Para` +with its contents wrapped inside an `Emph` inline: ~~~~ {.haskell} #!/usr/bin/env runhaskell --- behead2.hs +-- behead.hs import Text.Pandoc.JSON main :: IO () main = toJSONFilter behead - where behead (Header n _ xs) | n >= 2 = Para [Emph xs] - behead x = x + +behead :: Block -> Block +behead (Header n _ xs) | n >= 2 = Para [Emph xs] +behead x = x ~~~~ -It can be used this way: +The `toJSONFilter` function does two things. First, it lifts +the `behead` function (which maps `Block -> Block`) onto a +transformation of the entire `Pandoc` AST, walking the AST +and transforming each block. Second, it wraps this `Pandoc -> +Pandoc` transformation with the necessary JSON serialization +and deserialization, producing an executable that consumes +JSON from stdin and produces JSON to stdout. - pandoc -f SOURCEFORMAT -t json | runhaskell behead2.hs | \ - pandoc -f json -t TARGETFORMAT +To use the filter, make it executable: -But it is easier to use the `--filter` option with pandoc: + chmod +x behead.hs - pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead2.hs +and then -Note that this approach requires that `behead2.hs` be executable, -so we must + pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead.hs - chmod +x behead2.hs +(It is also necessary that `pandoc-types` be installed in the +local package repository: `cabal install pandoc-types` should +ensure this.) Alternatively, we could compile the filter: - ghc --make behead2.hs + ghc --make behead.hs pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead Note that if the filter is placed in the system PATH, then the initial @@ -243,6 +184,7 @@ Note that if the filter is placed in the system PATH, then the initial multiple instances of `--filter`: the filters will be applied in sequence. + # LaTeX for WordPress Another easy example. WordPress blogs require a special format for @@ -283,7 +225,7 @@ Here's our "beheading" filter in python: #!/usr/bin/env python """ -Pandoc filter to convert all level 2+ headers to paragraphs with +Pandoc filter to convert all level 2+ headings to paragraphs with emphasized text. """ |