aboutsummaryrefslogtreecommitdiff
path: root/doc/custom-readers.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/custom-readers.md')
-rw-r--r--doc/custom-readers.md684
1 files changed, 684 insertions, 0 deletions
diff --git a/doc/custom-readers.md b/doc/custom-readers.md
new file mode 100644
index 000000000..37b6d6a3e
--- /dev/null
+++ b/doc/custom-readers.md
@@ -0,0 +1,684 @@
+---
+author:
+- John MacFarlane
+date: 'November 18, 2021'
+title: Creating Custom Pandoc Readers in Lua
+---
+
+# Introduction
+
+If you need to parse a format not already handled by pandoc,
+you can create a custom reader using the [Lua] language.
+Pandoc has a built-in Lua interpreter, so you needn't
+install any additional software to do this.
+
+[Lua]: https://www.lua.org
+
+A custom reader is a Lua file that defines a function
+called `Reader`, which takes two arguments:
+
+- the raw input to be parsed, as a list of sources
+- optionally, a table of reader options, e.g.
+ `{ columns = 62, standalone = true }`.
+
+The `Reader` function should return a `Pandoc` AST.
+This can be created using functions in the [`pandoc` module],
+which is automatically in scope. (Indeed, all of the utility
+functions that are available for [Lua filters] are available
+in custom readers, too.)
+
+Each source item corresponds to a file or stream passed to pandoc
+containing its text and name. E.g., if a single file `input.txt`
+is passed to pandoc, then the list of sources will contain just a
+single element `s`, where `s.name == 'input.txt'` and `s.text`
+contains the file contents as a string.
+
+The sources list, as well as each of its elements, can be
+converted to a string via the Lua standard library function
+`tostring`.
+
+[Lua filters]: https://pandoc.org/lua-filters.html
+[`pandoc` module]: https://pandoc.org/lua-filters.html#module-pandoc
+
+A minimal example would be
+
+```lua
+function Reader(input)
+ return pandoc.Pandoc({ pandoc.CodeBlock(tostring(input)) })
+end
+```
+
+This just returns a document containing a big code block with all
+of the input. Or, to create a separate code block for each input
+file, one might write
+
+``` lua
+function Reader(input)
+ return pandoc.Pandoc(input:map(
+ function (s) return pandoc.CodeBlock(s.text) end))
+end
+```
+
+In a nontrivial reader, you'll want to parse the input.
+You can do this using standard Lua library functions
+(for example, the [patterns] library), or with the powerful
+and fast [lpeg] parsing library, which is automatically in scope.
+You can also use external Lua libraries (for example,
+an XML parser).
+
+A previous pandoc version passed a raw string instead of a list
+of sources to the Reader function. Reader functions that rely on
+this are obsolete, but still supported: Pandoc analyzes any
+script error, detecting when code assumed the old behavior. The
+code is rerun with raw string input in this case, thereby
+ensuring backwards compatibility.
+
+[patterns]: http://lua-users.org/wiki/PatternsTutorial
+[lpeg]: http://www.inf.puc-rio.br/~roberto/lpeg/
+
+# Example: plain text reader
+
+This is a simple example using [lpeg] to parse the input
+into space-separated strings and blankline-separated paragraphs.
+
+```lua
+-- A sample custom reader that just parses text into blankline-separated
+-- paragraphs with space-separated words.
+
+-- For better performance we put these functions in local variables:
+local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
+ lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
+ lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt
+
+local whitespacechar = S(" \t\r\n")
+local wordchar = (1 - whitespacechar)
+local spacechar = S(" \t")
+local newline = P"\r"^-1 * P"\n"
+local blanklines = newline * (spacechar^0 * newline)^1
+local endline = newline - blanklines
+
+-- Grammar
+G = P{ "Pandoc",
+ Pandoc = Ct(V"Block"^0) / pandoc.Pandoc;
+ Block = blanklines^0 * V"Para" ;
+ Para = Ct(V"Inline"^1) / pandoc.Para;
+ Inline = V"Str" + V"Space" + V"SoftBreak" ;
+ Str = wordchar^1 / pandoc.Str;
+ Space = spacechar^1 / pandoc.Space;
+ SoftBreak = endline / pandoc.SoftBreak;
+}
+
+function Reader(input)
+ return lpeg.match(G, tostring(input))
+end
+```
+
+Example of use:
+
+```
+% pandoc -f plain.lua -t native
+*Hello there*, this is plain text with no formatting
+except paragraph breaks.
+
+- Like this one.
+^D
+[ Para
+ [ Str "*Hello"
+ , Space
+ , Str "there*,"
+ , Space
+ , Str "this"
+ , Space
+ , Str "is"
+ , Space
+ , Str "plain"
+ , Space
+ , Str "text"
+ , Space
+ , Str "with"
+ , Space
+ , Str "no"
+ , Space
+ , Str "formatting"
+ , SoftBreak
+ , Str "except"
+ , Space
+ , Str "paragraph"
+ , Space
+ , Str "breaks."
+ ]
+, Para
+ [ Str "-"
+ , Space
+ , Str "Like"
+ , Space
+ , Str "this"
+ , Space
+ , Str "one."
+ ]
+]
+```
+
+# Example: a RIS bibliography reader
+
+This is a parser for [RIS bibliography] files. It can be used
+to convert them to CSL JSON or YAML, BibTeX, or BibLaTeX.
+
+[RIS bibliography]: https://en.wikipedia.org/wiki/RIS_(file_format)
+
+```lua
+-- A sample custom reader for RIS bibliography format
+-- https://en.wikipedia.org/wiki/RIS_(file_format)
+-- The references are converted to inline pandoc/CSL YAML
+-- references in the metadata.
+
+local inspect = require"inspect"
+
+local types =
+ { ABST = "article",
+ ADVS = "motion-picture",
+ AGGR = "dataset",
+ ANCIENT = "book",
+ ART = "graphic",
+ BILL = "bill",
+ BLOG = "post-weblog",
+ BOOK = "book",
+ CASE = "legal_case",
+ CHAP = "chapter",
+ CHART = "graphic",
+ CLSWK = "book",
+ COMP = "program",
+ CONF = "paper-conference",
+ CPAPER = "paper-conference",
+ CTLG = "catalog",
+ DATA = "dataset",
+ DBASE = "dataset",
+ DICT = "book",
+ EBOOK = "book",
+ ECHAP = "chapter",
+ EDBOOK = "book",
+ EJOUR = "article",
+ WEB = "webpage",
+ ENCYC = "entry-encyclopedia",
+ EQUA = "figure",
+ FIGURE = "figure",
+ GEN = "entry",
+ GOVDOC = "report",
+ GRANT = "report",
+ HEAR = "report",
+ ICOMM = "personal_communication",
+ INPR = "article-journal",
+ JFULL = "article-journal",
+ JOUR = "article-journal",
+ LEGAL = "legal_case",
+ MANSCPT = "manuscript",
+ MAP = "map",
+ MGZN = "article-magazine",
+ MPCT = "motion-picture",
+ MULTI = "webpage",
+ MUSIC = "musical_score",
+ NEWS = "article-newspaper",
+ PAMP = "pamphlet",
+ PAT = "patent",
+ PCOMM = "personal_communication",
+ RPRT = "report",
+ SER = "article",
+ SLIDE = "graphic",
+ SOUND = "musical_score",
+ STAND = "report",
+ STAT = "legislation",
+ THES = "thesis",
+ UNBILL = "bill",
+ UNPB = "unpublished",
+ VIDEO = "graphic"
+ }
+
+local function clean(refpairs)
+ local ref = {}
+ for i = 1, #refpairs do
+ local k,v = table.unpack(refpairs[i])
+ if k == "TY" then
+ ref["type"] = types[v]
+ elseif k == "VL" then
+ ref.volume = v
+ elseif k == "KW" then
+ ref.keyword = v
+ elseif k == "PB" then
+ ref.publisher = v
+ elseif k == "CY" or k == "PP" then
+ ref["publisher-place"] = v
+ elseif k == "SP" then
+ if ref.page then
+ ref.page = v .. ref.page
+ else
+ ref.page = v
+ end
+ elseif k == "EP" then
+ if ref.page then
+ ref.page = ref.page .. "-" .. v
+ else
+ ref.page = "-" .. v
+ end
+ elseif k == "AU" or k == "A1" or k == "A2" or k == "A3" then
+ if ref.author then
+ table.insert(ref.author, v)
+ else
+ ref.author = {v}
+ end
+ elseif k == "TI" or k == "T1" or k == "CT" or
+ (k == "BT" and ref.type == "book") then
+ ref.title = v
+ elseif k == "ET" then
+ ref.edition = v
+ elseif k == "NV" then
+ ref["number-of-volumes"] = v
+ elseif k == "AB" then
+ ref.abstract = v
+ elseif k == "ED" then
+ if ref.editor then
+ table.insert(ref.editor, v)
+ else
+ ref.editor = {v}
+ end
+ elseif k == "JO" or k == "JF" or k == "T2" or
+ (k == "BT" and ref.type ~= "book") then
+ ref["container-title"] = v
+ elseif k == "PY" or k == "Y1" then
+ ref.issued = v
+ elseif k == "IS" then
+ ref.issue = v
+ elseif k == "SN" then
+ ref.ISSN = v
+ elseif k == "L" then
+ ref.lang = v
+ elseif k == "UR" or k == "LK" then
+ ref.URL = v
+ end
+ end
+ return ref
+end
+
+function Reader(input, reader_options)
+ local refs = {}
+ local thisref = {}
+ local ids = {}
+ for line in string.gmatch(tostring(input), "[^\n]*") do
+ key, val = string.match(line, "([A-Z][A-Z0-9]) %- (.*)")
+ if key == "ER" then
+ -- clean up fields
+ local newref = clean(thisref)
+ -- ensure we have an id and if not, create a sensible one
+ if not newref.id then
+ newref.id = ""
+ for _,x in ipairs(newref.author) do
+ newref.id = newref.id .. string.match(pandoc.utils.stringify(x), "%a+")
+ end
+ if newref.issued then
+ newref.id = newref.id .. string.match(newref.issued, "%d+")
+ end
+ if ids[newref.id] then -- add disambiguator if needed
+ newref.id = newref.id .. "-" .. #ids
+ end
+ end
+ table.insert(ids, newref.id)
+ table.insert(refs, newref)
+ thisref = {}
+ elseif key then
+ table.insert(thisref, {key, val})
+ end
+ end
+ return pandoc.Pandoc({}, pandoc.Meta { references = refs } )
+end
+```
+
+Example of use:
+
+```
+% pandoc -f ris.lua -t bibtex
+TY - JOUR
+AU - Shannon, Claude E.
+PY - 1948
+DA - July
+TI - A Mathematical Theory of Communication
+T2 - Bell System Technical Journal
+SP - 379
+EP - 423
+VL - 27
+ER -
+TY - JOUR
+T1 - On computable numbers, with an application to the Entscheidungsproblem
+A1 - Turing, Alan Mathison
+JO - Proc. of London Mathematical Society
+VL - 47
+IS - 1
+SP - 230
+EP - 265
+Y1 - 1937
+ER -
+^D
+@article{Shannon1948,
+ author = {Shannon, Claude E.},
+ title = {A {Mathematical} {Theory} of {Communication}},
+ journal = {Bell System Technical Journal},
+ volume = {27},
+ pages = {379-423},
+ year = {1948}
+}
+@article{Turing1937,
+ author = {Turing, Alan Mathison},
+ title = {On Computable Numbers, with an Application to the
+ {Entscheidungsproblem}},
+ journal = {Proc. of London Mathematical Society},
+ volume = {47},
+ number = {1},
+ pages = {230-265},
+ year = {1937}
+}
+```
+
+# Example: a wiki Creole reader
+
+This is a parser for [Creole common wiki markup].
+It uses an [lpeg] grammar. Fun fact: this custom reader is faster than
+pandoc's built-in creole reader! This shows that high-performance
+readers can be designed in this way.
+
+[Creole common wiki markup]: http://www.wikicreole.org/wiki/CheatSheet
+
+
+```lua
+-- A sample custom reader for Creole 1.0 (common wiki markup)
+-- http://www.wikicreole.org/wiki/CheatSheet
+
+-- For better performance we put these functions in local variables:
+local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
+ lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
+ lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt
+
+local whitespacechar = S(" \t\r\n")
+local specialchar = S("/*~[]\\{}|")
+local wordchar = (1 - (whitespacechar + specialchar))
+local spacechar = S(" \t")
+local newline = P"\r"^-1 * P"\n"
+local blankline = spacechar^0 * newline
+local endline = newline * #-blankline
+local endequals = spacechar^0 * P"="^0 * spacechar^0 * newline
+local cellsep = spacechar^0 * P"|"
+
+local function trim(s)
+ return (s:gsub("^%s*(.-)%s*$", "%1"))
+end
+
+local function ListItem(lev, ch)
+ local start
+ if ch == nil then
+ start = S"*#"
+ else
+ start = P(ch)
+ end
+ local subitem = function(c)
+ if lev < 6 then
+ return ListItem(lev + 1, c)
+ else
+ return (1 - 1) -- fails
+ end
+ end
+ local parser = spacechar^0
+ * start^lev
+ * #(- start)
+ * spacechar^0
+ * Ct((V"Inline" - (newline * spacechar^0 * S"*#"))^0)
+ * newline
+ * (Ct(subitem("*")^1) / pandoc.BulletList
+ +
+ Ct(subitem("#")^1) / pandoc.OrderedList
+ +
+ Cc(nil))
+ / function (ils, sublist)
+ return { pandoc.Plain(ils), sublist }
+ end
+ return parser
+end
+
+-- Grammar
+G = P{ "Doc",
+ Doc = Ct(V"Block"^0)
+ / pandoc.Pandoc ;
+ Block = blankline^0
+ * ( V"Header"
+ + V"HorizontalRule"
+ + V"CodeBlock"
+ + V"List"
+ + V"Table"
+ + V"Para") ;
+ Para = Ct(V"Inline"^1)
+ * newline
+ / pandoc.Para ;
+ HorizontalRule = spacechar^0
+ * P"----"
+ * spacechar^0
+ * newline
+ / pandoc.HorizontalRule;
+ Header = (P("=")^1 / string.len)
+ * spacechar^1
+ * Ct((V"Inline" - endequals)^1)
+ * endequals
+ / pandoc.Header;
+ CodeBlock = P"{{{"
+ * blankline
+ * C((1 - (newline * P"}}}"))^0)
+ * newline
+ * P"}}}"
+ / pandoc.CodeBlock;
+ Placeholder = P"<<<"
+ * C(P(1) - P">>>")^0
+ * P">>>"
+ / function() return pandoc.Div({}) end;
+ List = V"BulletList"
+ + V"OrderedList" ;
+ BulletList = Ct(ListItem(1,'*')^1)
+ / pandoc.BulletList ;
+ OrderedList = Ct(ListItem(1,'#')^1)
+ / pandoc.OrderedList ;
+ Table = (V"TableHeader" + Cc{})
+ * Ct(V"TableRow"^1)
+ / function(headrow, bodyrows)
+ local numcolumns = #(bodyrows[1])
+ local aligns = {}
+ local widths = {}
+ for i = 1,numcolumns do
+ aligns[i] = pandoc.AlignDefault
+ widths[i] = 0
+ end
+ return pandoc.utils.from_simple_table(
+ pandoc.SimpleTable({}, aligns, widths, headrow, bodyrows))
+ end ;
+ TableHeader = Ct(V"HeaderCell"^1)
+ * cellsep^-1
+ * spacechar^0
+ * newline ;
+ TableRow = Ct(V"BodyCell"^1)
+ * cellsep^-1
+ * spacechar^0
+ * newline ;
+ HeaderCell = cellsep
+ * P"="
+ * spacechar^0
+ * Ct((V"Inline" - (newline + cellsep))^0)
+ / function(ils) return { pandoc.Plain(ils) } end ;
+ BodyCell = cellsep
+ * spacechar^0
+ * Ct((V"Inline" - (newline + cellsep))^0)
+ / function(ils) return { pandoc.Plain(ils) } end ;
+ Inline = V"Emph"
+ + V"Strong"
+ + V"LineBreak"
+ + V"Link"
+ + V"URL"
+ + V"Image"
+ + V"Str"
+ + V"Space"
+ + V"SoftBreak"
+ + V"Escaped"
+ + V"Placeholder"
+ + V"Code"
+ + V"Special" ;
+ Str = wordchar^1
+ / pandoc.Str;
+ Escaped = P"~"
+ * C(P(1))
+ / pandoc.Str ;
+ Special = specialchar
+ / pandoc.Str;
+ Space = spacechar^1
+ / pandoc.Space ;
+ SoftBreak = endline
+ * # -(V"HorizontalRule" + V"CodeBlock")
+ / pandoc.SoftBreak ;
+ LineBreak = P"\\\\"
+ / pandoc.LineBreak ;
+ Code = P"{{{"
+ * C((1 - P"}}}")^0)
+ * P"}}}"
+ / trim / pandoc.Code ;
+ Link = P"[["
+ * C((1 - (P"]]" + P"|"))^0)
+ * (P"|" * Ct((V"Inline" - P"]]")^1))^-1 * P"]]"
+ / function(url, desc)
+ local txt = desc or {pandoc.Str(url)}
+ return pandoc.Link(txt, url)
+ end ;
+ Image = P"{{"
+ * #-P"{"
+ * C((1 - (S"}"))^0)
+ * (P"|" * Ct((V"Inline" - P"}}")^1))^-1
+ * P"}}"
+ / function(url, desc)
+ local txt = desc or ""
+ return pandoc.Image(txt, url)
+ end ;
+ URL = P"http"
+ * P"s"^-1
+ * P":"
+ * (1 - (whitespacechar + (S",.?!:;\"'" * #whitespacechar)))^1
+ / function(url)
+ return pandoc.Link(pandoc.Str(url), url)
+ end ;
+ Emph = P"//"
+ * Ct((V"Inline" - P"//")^1)
+ * P"//"
+ / pandoc.Emph ;
+ Strong = P"**"
+ * Ct((V"Inline" -P"**")^1)
+ * P"**"
+ / pandoc.Strong ;
+}
+
+function Reader(input, reader_options)
+ return lpeg.match(G, tostring(input))
+end
+```
+
+Example of use:
+
+```
+% pandoc -f creole.lua -t markdown
+== Wiki Creole
+
+You can make things **bold** or //italic// or **//both//** or //**both**//.
+
+Character formatting extends across line breaks: **bold,
+this is still bold. This line deliberately does not end in star-star.
+
+Not bold. Character formatting does not cross paragraph boundaries.
+
+You can use [[internal links]] or [[http://www.wikicreole.org|external links]],
+give the link a [[internal links|different]] name.
+^D
+## Wiki Creole
+
+You can make things **bold** or *italic* or ***both*** or ***both***.
+
+Character formatting extends across line breaks: \*\*bold, this is still
+bold. This line deliberately does not end in star-star.
+
+Not bold. Character formatting does not cross paragraph boundaries.
+
+You can use [internal links](internal links) or [external
+links](http://www.wikicreole.org), give the link a
+[different](internal links) name.
+```
+
+# Example: parsing JSON from an API
+
+This custom reader consumes the JSON output of
+<https://www.reddit.com/r/haskell.json> and produces
+a document containing the current top articles on the
+Haskell subreddit.
+
+It assumes that the `luajson` library is available. (It can be
+installed using `luarocks install luajson`---but be sure you are
+installing it for Lua 5.3, which is the version packaged with
+pandoc.)
+
+
+```lua
+-- consumes the output of https://www.reddit.com/r/haskell.json
+
+local json = require'json' -- luajson must be available
+
+local function read_inlines(raw)
+ local doc = pandoc.read(raw, "commonmark")
+ return pandoc.utils.blocks_to_inlines(doc.blocks)
+end
+
+local function read_blocks(raw)
+ local doc = pandoc.read(raw, "commonmark")
+ return doc.blocks
+end
+
+function Reader(input)
+
+ local parsed = json.decode(tostring(input))
+ local blocks = {}
+
+ for _,entry in ipairs(parsed.data.children) do
+ local d = entry.data
+ table.insert(blocks, pandoc.Header(2,
+ pandoc.Link(read_inlines(d.title), d.url)))
+ for _,block in ipairs(read_blocks(d.selftext)) do
+ table.insert(blocks, block)
+ end
+ end
+
+ return pandoc.Pandoc(blocks)
+
+end
+```
+
+Similar code can be used to consume JSON output from other APIs.
+
+Note that the content of the text fields is markdown, so we
+convert it using `pandoc.read()`.
+
+
+# Example: syntax-highlighted code files
+
+This is a reader that puts the content of each input file into a
+code block, sets the file's extension as the block's class to
+enable code highlighting, and places the filename as a header
+above each code block.
+
+``` lua
+function to_code_block (source)
+ local _, lang = pandoc.path.split_extension(source.name)
+ return pandoc.Div{
+ pandoc.Header(1, source.name == '' and '<stdin>' or source.name),
+ pandoc.CodeBlock(source.text, {class=lang}),
+ }
+end
+
+function Reader (input, opts)
+ return pandoc.Pandoc(input:map(to_code_block))
+end
+```