aboutsummaryrefslogtreecommitdiff
path: root/doc/using-the-pandoc-api.md
blob: d6eb9e15f3e94d6c18e9c67b624c17d84a92b7c4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
% Using the pandoc API
% John MacFarlane

Pandoc can be used as a Haskell library, to write your own
conversion tools or power a web application.  This document
offers an introduction to using the pandoc API.

Detailed API documentation at the level of individual functions
and types is available at
<https://hackage.haskell.org/package/pandoc>.

# Pandoc's architecture

Pandoc is structured as a set of *readers*, which translate
various input formats into an abstract syntax tree (the
Pandoc AST) representing a structured document, and a set of
*writers*, which render this AST into various output formats.
Pictorially:

```
[input format] ==reader==> [Pandoc AST] ==writer==> [output format]
```

This architecture allows pandoc to perform $M \times N$
conversions with $M$ readers and $N$ writers.

The Pandoc AST is defined in the
[pandoc-types](https://hackage.haskell.org/package/pandoc-types)
package.  You should start by looking at the Haddock
documentation for [Text.Pandoc.Definition].  As you'll see, a
`Pandoc` is composed of some metadata and a list of `Block`s.
There are various kinds of `Block`, including `Para`
(paragraph), `Header` (section heading), and `BlockQuote`.  Some
of the `Block`s (like `BlockQuote`) contain lists of `Block`s,
while others (like `Para`) contain lists of `Inline`s, and still
others (like `CodeBlock`) contain plain text or nothing.
`Inline`s are the basic elements of paragraphs.  The distinction
between `Block` and `Inline` in the type system makes it
impossible to represent, for example, a link (`Inline`) whose
link text is a block quote (`Block`).  This expressive
limitation is mostly a help rather than a hindrance, since many
of the formats pandoc supports have similar limitations.

The best way to explore the pandoc AST is to use `pandoc -t
native`, which will display the AST corresponding to some
Markdown input:

```
% echo -e "1. *foo*\n2. bar" | pandoc -t native
[OrderedList (1,Decimal,Period)
 [[Plain [Emph [Str "foo"]]]
 ,[Plain [Str "bar"]]]]
```

# A simple example

Here is a simple example of the use of a pandoc reader and
writer to perform a conversion:

```haskell
import Text.Pandoc
import qualified Data.Text as T
import qualified Data.Text.IO as TIO

main :: IO ()
main = do
  result <- runIO $ do
    doc <- readMarkdown def (T.pack "[testing](url)")
    writeRST def doc
  rst <- handleError result
  TIO.putStrLn rst
```

Some notes:

1. The first part constructs a conversion pipeline: the input
   string is passed to `readMarkdown`, and the resulting Pandoc
   AST (`doc`) is then rendered by `writeRST`.  The conversion
   pipeline is "run" by `runIO`---more on that below.

2. `result` has the type `Either PandocError Text`.  We could
   pattern-match on this manually, but it's simpler in this
   context to use the `handleError` function from
   Text.Pandoc.Error.  This exits with an appropriate error
   code and message if the value is a `Left`, and returns the
   `Text` if the value is a `Right`.

# The PandocMonad class

Let's look at the types of `readMarkdown` and `writeRST`:

```haskell
readMarkdown :: (PandocMonad m, ToSources a)
             => ReaderOptions
             -> a
             -> m Pandoc
writeRST     :: PandocMonad m
             => WriterOptions
             -> Pandoc
             -> m Text
```

The `PandocMonad m =>` part is a typeclass constraint.
It says that `readMarkdown` and `writeRST` define computations
that can be used in any instance of the `PandocMonad`
type class.  `PandocMonad` is defined in the module
[Text.Pandoc.Class].

Two instances of `PandocMonad` are provided: `PandocIO` and
`PandocPure`. The difference is that computations run in
`PandocIO` are allowed to do IO (for example, read a file),
while computations in `PandocPure` are free of any side effects.
`PandocPure` is useful for sandboxed environments, when you want
to prevent users from doing anything malicious.  To run the
conversion in `PandocIO`, use `runIO` (as above).  To run it in
`PandocPure`, use `runPure`.

As you can see from the Haddocks, [Text.Pandoc.Class]
exports many auxiliary functions that can be used in any
instance of `PandocMonad`.  For example:

```haskell
-- | Get the verbosity level.
getVerbosity :: PandocMonad m => m Verbosity

-- | Set the verbosity level.
setVerbosity :: PandocMonad m => Verbosity -> m ()

-- Get the accumulated log messages (in temporal order).
getLog :: PandocMonad m => m [LogMessage]
getLog = reverse <$> getsCommonState stLog

-- | Log a message using 'logOutput'.  Note that 'logOutput' is
-- called only if the verbosity level exceeds the level of the
-- message, but the message is added to the list of log messages
-- that will be retrieved by 'getLog' regardless of its verbosity level.
report :: PandocMonad m => LogMessage -> m ()

-- | Fetch an image or other item from the local filesystem or the net.
-- Returns raw content and maybe mime type.
fetchItem :: PandocMonad m
          => String
          -> m (B.ByteString, Maybe MimeType)

-- Set the resource path searched by 'fetchItem'.
setResourcePath :: PandocMonad m => [FilePath] -> m ()
```

If we wanted more verbose informational messages
during the conversion we defined in the previous
section, we could do this:

```haskell
  result <- runIO $ do
    setVerbosity INFO
    doc <- readMarkdown def (T.pack "[testing](url)")
    writeRST def doc
```

Note that `PandocIO` is an instance of `MonadIO`, so you can
use `liftIO` to perform arbitrary IO operations inside a pandoc
conversion chain.

`readMarkdown` is polymorphic in its second argument, which
can be any type that is an instance of the `ToSources`
typeclass.  You can use `Text`, as in the example above.
But you can also use `[(FilePath, Text)]`, if the input comes
from multiple files and you want to track source positions
accurately.

# Options

The first argument of each reader or writer is for
options controlling the behavior of the reader or writer:
`ReaderOptions` for readers and `WriterOptions`
for writers.  These are defined in [Text.Pandoc.Options].  It is
a good idea to study these options to see what can be adjusted.

`def` (from Data.Default) denotes a default value for
each kind of option.  (You can also use `defaultWriterOptions`
and `defaultReaderOptions`.)  Generally you'll want to use
the defaults and modify them only when needed, for example:

```haskell
    writeRST def{ writerReferenceLinks = True }
```

Some particularly important options to know about:

1.  `writerTemplate`:  By default, this is `Nothing`, which
    means that a document fragment will be produced. If you
    want a full document, you need to specify `Just template`,
    where `template` is a `Template Text` from
    [Text.Pandoc.Templates] containing the template's
    contents (not the path).

2.  `readerExtensions` and `writerExtensions`:  These specify
    the extensions to be used in parsing and rendering.
    Extensions are defined in [Text.Pandoc.Extensions].

# Builder

Sometimes it's useful to construct a Pandoc document
programmatically.  To make this easier we provide the
module [Text.Pandoc.Builder] `pandoc-types`.

Because concatenating lists is slow, we use special
types `Inlines` and `Blocks` that wrap a `Sequence` of
`Inline` and `Block` elements.  These are instances
of the Monoid typeclass and can easily be concatenated:

```haskell
import Text.Pandoc.Builder

mydoc :: Pandoc
mydoc = doc $ header 1 (text "Hello!")
           <> para (emph (text "hello world") <> text ".")

main :: IO ()
main = print mydoc
```

If you use the `OverloadedStrings` pragma, you can
simplify this further:

```haskell
mydoc = doc $ header 1 "Hello!"
           <> para (emph "hello world" <> ".")
```

Here's a more realistic example.  Suppose your boss says: write
me a letter in Word listing all the filling stations in Chicago
that take the Voyager card.  You find some JSON data in this
format (`fuel.json`):

```json
[ {
  "state" : "IL",
  "city" : "Chicago",
  "fuel_type_code" : "CNG",
  "zip" : "60607",
  "station_name" : "Clean Energy - Yellow Cab",
  "cards_accepted" : "A D M V Voyager Wright_Exp CleanEnergy",
  "street_address" : "540 W Grenshaw"
}, ...
```

And then use aeson and pandoc to parse the JSON and create
the Word document:

```haskell
{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Builder
import Text.Pandoc
import Data.Monoid ((<>), mempty, mconcat)
import Data.Aeson
import Control.Applicative
import Control.Monad (mzero)
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text as T
import Data.List (intersperse)

data Station = Station{
    address        :: String
  , name           :: String
  , cardsAccepted  :: [String]
  } deriving Show

instance FromJSON Station where
    parseJSON (Object v) = Station <$>
       v .: "street_address" <*>
       v .: "station_name" <*>
       (words <$> (v .:? "cards_accepted" .!= ""))
    parseJSON _          = mzero

createLetter :: [Station] -> Pandoc
createLetter stations = doc $
    para "Dear Boss:" <>
    para "Here are the CNG stations that accept Voyager cards:" <>
    simpleTable [plain "Station", plain "Address", plain "Cards accepted"]
           (map stationToRow stations) <>
    para "Your loyal servant," <>
    plain (image "JohnHancock.png" "" mempty)
  where
    stationToRow station =
      [ plain (text $ name station)
      , plain (text $ address station)
      , plain (mconcat $ intersperse linebreak
                       $ map text $ cardsAccepted station)
      ]

main :: IO ()
main = do
  json <- BL.readFile "fuel.json"
  let letter = case decode json of
                    Just stations -> createLetter [s | s <- stations,
                                        "Voyager" `elem` cardsAccepted s]
                    Nothing       -> error "Could not decode JSON"
  docx <- runIO (writeDocx def letter) >>= handleError
  BL.writeFile "letter.docx" docx
  putStrLn "Created letter.docx"
```

Voila!  You've written the letter without using Word and
without looking at the data.

# Data files

Pandoc has a number of data files, which can be found in the
`data/` subdirectory of the repository.  These are installed
with pandoc (or, if pandoc was compiled with the
`embed_data_files` flag, they are embedded in the binary).
You can retrieve data files using `readDataFile` from
Text.Pandoc.Class.  `readDataFile` will first look for the
file in the "user data directory" (`setUserDataDir`,
`getUserDataDir`), and if it is not found there, it will
return the default installed with the system.
To force the use of the default, `setUserDataDir Nothing`.

# Templates

Pandoc has its own template system, described in the User's
Guide.  To retrieve the default template for a system,
use `getDefaultTemplate` from [Text.Pandoc.Templates].
Note that this looks first in the
`templates` subdirectory of the user data directory, allowing
users to override the system defaults.  If you want to disable
this behavior, use `setUserDataDir Nothing`.

To render a template, use `renderTemplate'`, which takes two
arguments, a template (String) and a context (any instance
of ToJSON).  If you want to create a context from the metadata
part of a Pandoc document, use `metaToJSON'` from
[Text.Pandoc.Writers.Shared].  If you also want to incorporate
values from variables, use `metaToJSON` instead, and make sure
`writerVariables` is set in `WriterOptions`.


# Handling errors and warnings

`runIO` and `runPure` return an `Either PandocError a`. All errors
raised in running a `PandocMonad` computation will be trapped
and returned as a `Left` value, so they can be handled by
the calling program.  To see the constructors for `PandocError`,
see the documentation for [Text.Pandoc.Error].

To raise a `PandocError` from inside a `PandocMonad` computation,
use `throwError`.

In addition to errors, which stop execution of the conversion
pipeline, one can generate informational messages.
Use `report` from [Text.Pandoc.Class] to issue a `LogMessage`.
For a list of constructors for `LogMessage`, see
[Text.Pandoc.Logging].  Note that each type of log message
is associated with a verbosity level.  The verbosity level
(`setVerbosity`/`getVerbosity`) determines whether the report
will be printed to stderr (when running in `PandocIO`), but
regardless of verbosity level, all reported messages are stored
internally and may be retrieved using `getLog`.

# Walking the AST

It is often useful to walk the Pandoc AST either to extract
information (e.g., what are all the URLs linked to in this
document?, do all the code samples compile?) or to transform a
document (e.g., increase the level of every section header,
remove emphasis, or replace specially marked code blocks with
images).  To make this easier and more efficient, `pandoc-types`
includes a module [Text.Pandoc.Walk].

Here's the essential documentation:

```haskell
class Walkable a b where
  -- | @walk f x@ walks the structure @x@ (bottom up) and replaces every
  -- occurrence of an @a@ with the result of applying @f@ to it.
  walk  :: (a -> a) -> b -> b
  walk f = runIdentity . walkM (return . f)
  -- | A monadic version of 'walk'.
  walkM :: (Monad m, Functor m) => (a -> m a) -> b -> m b
  -- | @query f x@ walks the structure @x@ (bottom up) and applies @f@
  -- to every @a@, appending the results.
  query :: Monoid c => (a -> c) -> b -> c
```

`Walkable` instances are defined for most combinations of
Pandoc types.  For example, the `Walkable Inline Block`
instance allows you to take a function `Inline -> Inline`
and apply it over every inline in a `Block`.  And
`Walkable [Inline] Pandoc` allows you to take a function
`[Inline] -> [Inline]` and apply it over every maximal
list of `Inline`s in a `Pandoc`.

Here's a simple example of a function that promotes
the levels of headers:

```haskell
promoteHeaderLevels :: Pandoc -> Pandoc
promoteHeaderLevels = walk promote
  where promote :: Block -> Block
        promote (Header lev attr ils) = Header (lev + 1) attr ils
        promote x = x
```

`walkM` is a monadic version of `walk`; it can be used, for
example, when you need your transformations to perform IO
operations, use PandocMonad operations, or update internal
state.  Here's an example using the State monad to add unique
identifiers to each code block:

```haskell
addCodeIdentifiers :: Pandoc -> Pandoc
addCodeIdentifiers doc = evalState (walkM addCodeId doc) 1
  where addCodeId :: Block -> State Int Block
        addCodeId (CodeBlock (_,classes,kvs) code) = do
          curId <- get
          put (curId + 1)
          return $ CodeBlock (show curId,classes,kvs) code
        addCodeId x = return x
```

`query` is used to collect information from the AST.
Its argument is a query function that produces a result
in some monoidal type (e.g. a list).  The results are
concatenated together.  Here's an example that returns a
list of the URLs linked to in a document:

```haskell
listURLs :: Pandoc -> [String]
listURLs = query urls
  where urls (Link _ _ (src, _)) = [src]
        urls _                   = []
```

# Creating a front-end

All of the functionality of the command-line program `pandoc`
has been abstracted out in `convertWithOpts` in
the module [Text.Pandoc.App].  Creating a GUI front-end for
pandoc is thus just a matter of populating the `Opts`
structure and calling this function.

# Notes on using pandoc in web applications

1. Pandoc's parsers can exhibit pathological behavior on some
   inputs.  So it is always a good idea to wrap uses of pandoc
   in a timeout function (e.g. `System.Timeout.timeout` from `base`)
   to prevent DoS attacks.

2. If pandoc generates HTML from untrusted user input, it is
   always a good idea to filter the generated HTML through
   a sanitizer (such as `xss-sanitize`) to avoid security
   problems.

3. Using `runPure` rather than `runIO` will ensure that
   pandoc's functions perform no IO operations (e.g. writing
   files).  If some resources need to be made available, a
   "fake environment" is provided inside the state available
   to `runPure` (see `PureState` and its associated functions
   in [Text.Pandoc.Class]).  It is also possible to write
   a custom instance of `PandocMonad` that, for example,
   makes wiki resources available as files in the fake environment,
   while isolating pandoc from the rest of the system.


[Text.Pandoc.Definition]: https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html
[Text.Pandoc.Walk]: https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Walk.html
[Text.Pandoc.Class]: https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Class.html
[Text.Pandoc.Options]: https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Options.html
[Text.Pandoc.Extensions]: https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Extensions.html
[Text.Pandoc.Builder]: https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Builder.html
[Text.Pandoc.Templates]: https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Templates.html
[Text.Pandoc.Logging]: https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Logging.html
[Text.Pandoc.App]: https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-App.html
[Text.Pandoc.Error]: https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Error.html
[Text.Pandoc.Writers.Shared]: https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Writers-Shared.html