aboutsummaryrefslogtreecommitdiff
path: root/doc/filters.md
blob: c3edd0e46d4888113d23892e45b6208936470dc5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
% Pandoc filters
% John MacFarlane

# Summary

Pandoc provides an interface for users to write programs (known
as filters) which act on pandoc’s AST.

Pandoc consists of a set of readers and writers. When converting
a document from one format to another, text is parsed by a
reader into pandoc’s intermediate representation of the
document---an "abstract syntax tree" or AST---which is then
converted by the writer into the target format.
The pandoc AST format is defined in the module
[`Text.Pandoc.Definition` in the `pandoc-types` package
](https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html).

A "filter" is a program that modifies the AST, between the
reader and the writer:

    INPUT --reader--> AST --filter--> AST --writer--> OUTPUT

Filters are "pipes" that read from standard input and write to
standard output.  They consume and produce a JSON representation
of the pandoc AST.  (In recent versions, this representation
includes a `pandoc-api-version` field which refers to a
version of `pandoc-types`.)  Filters may be written in any programming
language.  To use a filter, you need only specify it on the
command line using `--filter`, e.g.

    pandoc -s input.txt --filter pandoc-citeproc -o output.htl

For a gentle introduction into writing your own filters,
continue this guide. There’s also a [list of third party filters
on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters).


# A simple example

Suppose you wanted to replace all level 2+ headers in a markdown
document with regular paragraphs, with text in italics. How would you go
about doing this?

A first thought would be to use regular expressions. Something
like this:

    perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt

This should work most of the time.  But don't forget
that ATX style headers can end with a sequence of `#`s
that is not part of the header text:

    ## My header ##

And what if your document contains a line starting with `##` in an HTML
comment or delimited code block?

    <!--
    ## This is just a comment
    -->

    ~~~~
    ### A third level header in standard markdown
    ~~~~

We don't want to touch *these* lines.  Moreover, what about setext
style second-level headers?

    A header
    --------

We need to handle those too.  Finally, can we be sure that adding
asterisks to each side of our string will put it in italics?
What if the string already contains asterisks around it? Then we'll
end up with bold text, which is not what we want. And what if it contains
a regular unescaped asterisk?

How would you modify your regular expression to handle these cases? It
would be hairy, to say the least. What we need is a real parser.

Well, pandoc has a real markdown parser, the library function
`readMarkdown`. This transforms markdown text to an abstract syntax tree
(AST) that represents the document structure. Why not manipulate the
AST directly in a short Haskell script, then convert the result back to
markdown using `writeMarkdown`?

First, let's see what this AST looks like. We can use pandoc's `native`
output format:

~~~~
% cat test.txt
## my header

text with *italics*
% pandoc -s -t native test.txt
Pandoc (Meta {unMeta = fromList []})
[Header 3 ("my-header",[],[]) [Str "My",Space,Str "header"]
, Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ]
~~~~

A `Pandoc` document consists of a `Meta` block (containing
metadata like title, authors, and date) and a list of `Block`
 elements.  In this case, we have two `Block`s, a `Header` and a `Para`.
Each has as its content a list of `Inline` elements.  For more details on
the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`].

[haddock documentation for `Text.Pandoc.Definition`]: http://hackage.haskell.org/package/pandoc-types

Here's a short Haskell script that reads markdown, changes level
2+ headers to regular paragraphs, and writes the result as markdown.
If you save it as `behead.hs`, you can run it using `runhaskell behead.hs`.
It will act like a unix pipe, reading from `stdin` and writing to `stdout`.
Or, if you want, you can compile it, using `ghc --make behead`, then run
the resulting executable `behead`.

~~~~                          {.haskell}
-- behead.hs
import Text.Pandoc
import Text.Pandoc.Walk (walk)

behead :: Block -> Block
behead (Header n _ xs) | n >= 2 = Para [Emph xs]
behead x = x

readDoc :: String -> Pandoc
readDoc s = readMarkdown def s
-- or, for pandoc 1.14 and greater, use:
-- readDoc s = case readMarkdown def s of
--                  Right doc -> doc
--                  Left err  -> error (show err)

writeDoc :: Pandoc -> String
writeDoc doc = writeMarkdown def doc

main :: IO ()
main = interact (writeDoc . walk behead . readDoc)
~~~~

The magic here is the `walk` function, which converts
our `behead` function (a function from `Block` to `Block`) to
a transformation on whole `Pandoc` documents.
(See the [haddock documentation for `Text.Pandoc.Walk`].)

[haddock documentation for `Text.Pandoc.Walk`]: http://hackage.haskell.org/package/pandoc-types

# Queries: listing URLs

We can use this same technique to do much more complex transformations
and queries.  Here's how we could extract all the URLs linked to in
a markdown document (again, not an easy task with regular expressions):

~~~~                          {.haskell}
-- extracturls.hs
import Text.Pandoc

extractURL :: Inline -> [String]
extractURL (Link _ _ (u,_)) = [u]
extractURL (Image _ _ (u,_)) = [u]
extractURL _ = []

extractURLs :: Pandoc -> [String]
extractURLs = query extractURL

readDoc :: String -> Pandoc
readDoc = readMarkdown def
-- or, for pandoc 1.14, use:
-- readDoc s = case readMarkdown def s of
--                Right doc -> doc
--                Left err  -> error (show err)

main :: IO ()
main = interact (unlines . extractURLs . readDoc)
~~~~

`query` is the query counterpart of `walk`: it lifts
a function that operates on `Inline` elements to one that operates
on the whole `Pandoc` AST.  The results returned by applying
`extractURL` to each `Inline` element are concatenated in the
result.

# JSON filters

`behead.hs` is a very special-purpose program.  It reads a
specific input format (markdown) and writes a specific output format
(HTML), with a specific set of options (here, the defaults).
But the basic operation it performs is one that would be useful
in many document transformations.  It would be nice to isolate the
part of the program that transforms the pandoc AST, leaving the rest
to pandoc itself.  What we want is a *filter* that *just* operates
on the AST---or rather, on a JSON representation of the AST that
pandoc can produce and consume:

                             source format
                                  ↓
                               (pandoc)
                                  ↓
                          JSON-formatted AST
                                  ↓
                               (filter)
                                  ↓
                          JSON-formatted AST
                                  ↓
                               (pandoc)
                                  ↓
                            target format

The module `Text.Pandoc.JSON` (from `pandoc-types`) contains a
function `toJSONFilter` that makes it easy to write such
filters.  Here is a filter version of `behead.hs`:

~~~~                          {.haskell}
#!/usr/bin/env runhaskell
-- behead2.hs
import Text.Pandoc.JSON

main :: IO ()
main = toJSONFilter behead
  where behead (Header n _ xs) | n >= 2 = Para [Emph xs]
        behead x = x
~~~~

It can be used this way:

    pandoc -f SOURCEFORMAT -t json | runhaskell behead2.hs | \
      pandoc -f json -t TARGETFORMAT

But it is easier to use the `--filter` option with pandoc:

    pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead2.hs

Note that this approach requires that `behead2.hs` be executable,
so we must

    chmod +x behead2.hs

Alternatively, we could compile the filter:

    ghc --make behead2.hs
    pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead

Note that if the filter is placed in the system PATH, then the initial
`./` is not needed.  Note also that the command line can include
multiple instances of `--filter`:  the filters will be applied in
sequence.

# LaTeX for WordPress

Another easy example. WordPress blogs require a special format for
LaTeX math.  Instead of `$e=mc^2$`, you need: `$LaTeX e=mc^2$`.
How can we convert a markdown document accordingly?

Again, it's difficult to do the job reliably with regexes.
A `$` might be a regular currency indicator, or it might occur in
a comment or code block or inline code span.  We just want to find
the `$`s that begin LaTeX math. If only we had a parser...

We do.  Pandoc already extracts LaTeX math, so:

~~~~                          {.haskell}
#!/usr/bin/env runhaskell
-- wordpressify.hs
import Text.Pandoc.JSON

main = toJSONFilter wordpressify
  where wordpressify (Math x y) = Math x ("LaTeX " ++ y)
        wordpressify x = x
~~~~

Mission accomplished. (I've omitted type signatures here,
just to show it can be done.)


# But I don't want to learn Haskell!

While it's easiest to write pandoc filters in Haskell, it is fairly
easy to write them in python using the `pandocfilters` package.
The package is in PyPI and can be installed using `pip install
pandocfilters` or `easy_install pandocfilters`.

Here's our "beheading" filter in python:

~~~ {.python}
#!/usr/bin/env python

"""
Pandoc filter to convert all level 2+ headers to paragraphs with
emphasized text.
"""

from pandocfilters import toJSONFilter, Emph, Para

def behead(key, value, format, meta):
  if key == 'Header' and value[0] >= 2:
    return Para([Emph(value[2])])

if __name__ == "__main__":
  toJSONFilter(behead)
~~~

`toJSONFilter(behead)` walks the AST and applies the `behead` action
to each element.  If `behead` returns nothing, the node is unchanged;
if it returns an object, the node is replaced; if it returns a list,
the new list is spliced in.

Note that, although these parameters are not used in this example,
`format` provides access to the target format, and `meta` provides access to
the document's metadata.

There are many examples of python filters in [the pandocfilters
repository](http://github.com/jgm/pandocfilters).

For a more Pythonic alternative to pandocfilters, see
the [panflute](http://scorreia.com/software/panflute/) library.
Don't like Python?   There are also ports of pandocfilters in
[PHP](https://github.com/vinai/pandocfilters-php),
[perl](https://metacpan.org/pod/Pandoc::Filter),
[javascript/node.js](https://github.com/mvhenderson/pandoc-filter-node),
[Groovy](https://github.com/dfrommi/groovy-pandoc), and
[Ruby](https://heerdebeer.org/Software/markdown/paru/).

Starting with pandoc 2.0, pandoc includes built-in support for
writing filters in lua.  The lua interpreter is built in to
pandoc, so a lua filter does not require any additional software
to run.  See the [documentation on lua
filters](lua-filters.html).

# Include files

So none of our transforms have involved IO. How about a script that
reads a markdown document, finds all the inline code blocks with
attribute `include`, and replaces their contents with the contents of
the file given?

~~~~                          {.haskell}
#!/usr/bin/env runhaskell
-- includes.hs
import Text.Pandoc.JSON

doInclude :: Block -> IO Block
doInclude cb@(CodeBlock (id, classes, namevals) contents) =
  case lookup "include" namevals of
       Just f     -> return . (CodeBlock (id, classes, namevals)) =<< readFile f
       Nothing    -> return cb
doInclude x = return x

main :: IO ()
main = toJSONFilter doInclude
~~~~

Try this on the following:

    Here's the pandoc README:

    ~~~~ {include="README"}
    this will be replaced by contents of README
    ~~~~

# Removing links

What if we want to remove every link from a document, retaining
the link's text?

~~~~                          {.haskell}
#!/usr/bin/env runhaskell
-- delink.hs
import Text.Pandoc.JSON

main = toJSONFilter delink

delink :: Inline -> [Inline]
delink (Link _ txt _) = txt
delink x              = [x]
~~~~

Note that `delink` can't be a function of type `Inline -> Inline`,
because the thing we want to replace the link with is not a single
`Inline` element, but a list of them. So we make `delink` a function
from an `Inline` element to a list of `Inline` elements.
`toJSONFilter` can still lift this function to a transformation of type
`Pandoc -> Pandoc`.

# A filter for ruby text

Finally, here's a nice real-world example, developed on the
[pandoc-discuss](http://groups.google.com/group/pandoc-discuss/browse_thread/thread/7baea325565878c8) list.  Qubyte wrote:

> I'm interested in using pandoc to turn my markdown notes on Japanese
> into nicely set HTML and (Xe)LaTeX. With HTML5, ruby (typically used to
> phonetically read chinese characters by placing text above or to the
> side) is standard, and support from browsers is emerging (Webkit based
> browsers appear to fully support it). For those browsers that don't
> support it yet (notably Firefox) the feature falls back in a nice way
> by placing the phonetic reading inside brackets to the side of each
> Chinese character, which is suitable for other output formats too. As
> for (Xe)LaTeX, ruby is not an issue.
>
> At the moment, I use inline HTML to achieve the result when the
> conversion is to HTML, but it's ugly and uses a lot of keystrokes, for
> example
>
> ~~~ {.xml}
> <ruby>ご<rt></rt>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby>
> ~~~
>
> sets ご飯 "gohan" with "han" spelt phonetically above the second
> character, or to the right of it in brackets if the browser does not
> support ruby.  I'd like to have something more like
>
>     r[はん](飯)
>
> or any keystroke saving convention would be welcome.

We came up with the following script, which uses the convention that a
markdown link with a URL beginning with a hyphen is interpreted as ruby:

    [はん](-飯)

~~~ {.haskell}
-- handleruby.hs
import Text.Pandoc.JSON
import System.Environment (getArgs)

handleRuby :: Maybe Format -> Inline -> Inline
handleRuby (Just format) (Link _ [Str ruby] ('-':kanji,_))
  | format == Format "html"  = RawInline format
    $ "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby ++ "</rt><rp>)</rp></ruby>"
  | format == Format "latex" = RawInline format
    $ "\\ruby{" ++ kanji ++ "}{" ++ ruby ++ "}"
  | otherwise = Str ruby
handleRuby _ x = x

main :: IO ()
main = toJSONFilter handleRuby
~~~

Note that, when a script is called using `--filter`, pandoc passes
it the target format as the first argument.  When a function's
first argument is of type `Maybe Format`, `toJSONFilter` will
automatically assign it `Just` the target format or `Nothing`.

We compile our script:

    ghc --make handleRuby

Then run it:

    % pandoc -F ./handleRuby -t html
    [はん](-飯)
    ^D
    <p><ruby>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby></p>
    % pandoc -F ./handleRuby -t latex
    [はん](-飯)
    ^D
    \ruby{飯}{はん}

# Exercises

1.  Put all the regular text in a markdown document in ALL CAPS
    (without touching text in URLs or link titles).

2.  Remove all horizontal rules from a document.

3.  Renumber all enumerated lists with roman numerals.

4.  Replace each delimited code block with class `dot` with an
    image generated by running `dot -Tpng` (from graphviz) on the
    contents of the code block.

5.  Find all code blocks with class `python` and run them
    using the python interpreter, printing the results to the console.