diff options
-rw-r--r-- | README | 122 | ||||
-rw-r--r-- | Setup.hs | 4 | ||||
-rwxr-xr-x | html2markdown | 221 | ||||
-rw-r--r-- | man/man1/hsmarkdown.1.md | 42 | ||||
-rw-r--r-- | man/man1/html2markdown.1.md | 95 | ||||
-rw-r--r-- | pandoc.cabal | 19 | ||||
-rw-r--r-- | src/hsmarkdown.hs | 47 |
7 files changed, 44 insertions, 506 deletions
@@ -127,92 +127,49 @@ will convert `source.txt` from the local encoding to UTF-8, then convert it to HTML, then convert back to the local encoding, putting the output in `output.html`. -The wrapper scripts (described below) automatically convert the input -from the local encoding to UTF-8 before running them through `pandoc`, -then convert the output back to the local encoding. - Wrappers ======== -Three wrapper scripts, `markdown2pdf`, `html2markdown`, and -`hsmarkdown`, are included in the standard Pandoc installation. (The -Windows binary package does not include `html2markdown`, which is -a POSIX shell script. It does include portable Haskell versions of -`markdown2pdf` and `hsmarkdown`.) - -1. `markdown2pdf` produces a PDF file from markdown-formatted - text, using `pandoc` and `pdflatex`. The default - behavior of `markdown2pdf` is to create a file with the same - base name as the first argument and the extension `pdf`; thus, - for example, - - markdown2pdf sample.txt endnotes.txt - - will produce `sample.pdf`. (If `sample.pdf` exists already, - it will be backed up before being overwritten.) An output file - name can be specified explicitly using the `-o` option: - - markdown2pdf -o book.pdf chap1 chap2 - - If no input file is specified, input will be taken from stdin. - All of `pandoc`'s options will work with `markdown2pdf` as well. - - `markdown2pdf` assumes that `pdflatex` is in the path. It also - assumes that the following LaTeX packages are available: - `unicode`, `fancyhdr` (if you have verbatim text in footnotes), - `graphicx` (if you use images), `array` (if you use tables), - and `ulem` (if you use strikeout text). If they are not already - included in your LaTeX distribution, you can get them from - [CTAN]. A full [TeX Live] or [MacTeX] distribution will have all of - these packages. - -2. `html2markdown` grabs a web page from a file or URL and converts - it to markdown-formatted text, using `tidy` and `pandoc`. - - All of `pandoc`'s options will work with `html2markdown` as well. - In addition, the following special options may be used. - The special options must be separated from the `html2markdown` - command and any regular Pandoc options by the delimiter `--`: - - html2markdown -o out.txt -- -e latin1 -g curl google.com - - The `-e` or `--encoding` option specifies the character encoding - of the HTML input. If this option is not specified, and input - is not from stdin, `html2markdown` will attempt to determine the - page's character encoding from the "Content-type" meta tag. - If this is not present, UTF-8 is assumed. - - The `-g` or `--grabber` option specifies the command to be used to - fetch the contents of a URL: - - html2markdown -g 'curl --user foo:bar' www.mysite.com - - If this option is not specified, `html2markdown` searches for an - available program (`wget`, `curl`, or a text-mode browser) to fetch - the contents of a URL. - - `html2markdown` requires [HTML Tidy], which must be in the path. - It uses [`iconv`] for character encoding conversions; if `iconv` - is absent, it will still work, but it will treat everything as UTF-8. - -3. `hsmarkdown` is designed to be used as a drop-in replacement for - `Markdown.pl`. It forces `pandoc` to convert from markdown to - HTML, and to use the `--strict` flag for maximal compliance with - official markdown syntax. (All of Pandoc's syntax extensions and - variants, described below, are disabled.) No other command-line - options are allowed. (In fact, options will be interpreted as - filenames.) - - As an alternative to using the `hsmarkdown` script, the - user may create a symbolic link to `pandoc` called `hsmarkdown`. - When invoked under the name `hsmarkdown`, `pandoc` will behave - as if the `--strict` flag had been selected, and no command-line - options will be recognized. However, this approach does not work - under Cygwin, due to problems with its simulation of symbolic - links. +`markdown2pdf` +-------------- + +The standard Pandoc installation includes `markdown2pdf`, a wrapper +around `pandoc` and `pdflatex` that produces PDFs directly from markdown +sources. The default behavior of `markdown2pdf` is to create a file with +the same base name as the first argument and the extension `pdf`; thus, +for example, + + markdown2pdf sample.txt endnotes.txt + +will produce `sample.pdf`. (If `sample.pdf` exists already, +it will be backed up before being overwritten.) An output file +name can be specified explicitly using the `-o` option: + + markdown2pdf -o book.pdf chap1 chap2 + +If no input file is specified, input will be taken from stdin. +All of `pandoc`'s options will work with `markdown2pdf` as well. + +`markdown2pdf` assumes that `pdflatex` is in the path. It also +assumes that the following LaTeX packages are available: +`unicode`, `fancyhdr` (if you have verbatim text in footnotes), +`graphicx` (if you use images), `array` (if you use tables), +and `ulem` (if you use strikeout text). If they are not already +included in your LaTeX distribution, you can get them from +[CTAN]. A full [TeX Live] or [MacTeX] distribution will have all of +these packages. + +`hsmarkdown` +------------ + +A user who wants a drop-in replacement for `Markdown.pl` may create +a symbolic link to the `pandoc` executable called `hsmarkdown`. When +invoked under the name `hsmarkdown`, `pandoc` will behave as if the +`--strict` flag had been selected, and no command-line options will be +recognized. However, this approach does not work under Cygwin, due to +problems with its simulation of symbolic links. [Cygwin]: http://www.cygwin.com/ -[HTML Tidy]: http://tidy.sourceforge.net/ [`iconv`]: http://www.gnu.org/software/libiconv/ [CTAN]: http://www.ctan.org "Comprehensive TeX Archive Network" [TeX Live]: http://www.tug.org/texlive/ @@ -562,8 +519,7 @@ Pandoc's markdown vs. standard markdown In parsing markdown, Pandoc departs from and extends [standard markdown] in a few respects. Except where noted, these differences can -be suppressed by specifying the `--strict` command-line option or by -using the `hsmarkdown` wrapper. +be suppressed by specifying the `--strict` command-line option. [standard markdown]: http://daringfireball.net/projects/markdown/syntax "Markdown syntax description" @@ -51,7 +51,7 @@ makeManPages :: Args -> BuildFlags -> PackageDescription -> LocalBuildInfo -> IO makeManPages _ flags _ _ = mapM_ (makeManPage (fromFlag $ buildVerbosity flags)) manpages manpages :: [FilePath] -manpages = ["pandoc.1", "hsmarkdown.1", "html2markdown.1", "markdown2pdf.1"] +manpages = ["pandoc.1", "markdown2pdf.1"] manDir :: FilePath manDir = "man" </> "man1" @@ -80,7 +80,7 @@ installScripts pkg lbi verbosity copy = (zip (repeat ".") (wrappers \\ exes)) where exes = map exeName $ filter isBuildable $ executables pkg isBuildable = buildable . buildInfo - wrappers = ["html2markdown", "hsmarkdown", "markdown2pdf"] + wrappers = ["markdown2pdf"] installManpages :: PackageDescription -> LocalBuildInfo -> Verbosity -> CopyDest -> IO () diff --git a/html2markdown b/html2markdown deleted file mode 100755 index 0649e0478..000000000 --- a/html2markdown +++ /dev/null @@ -1,221 +0,0 @@ -#!/bin/sh -e -# converts HTML from a URL, file, or stdin to markdown -# uses an available program to fetch URL and tidy to normalize it first - -REQUIRED="tidy" -SYNOPSIS="converts HTML from a URL, file, or STDIN to markdown-formatted text." - -THIS=${0##*/} - -NEWLINE=' -' - -err () { echo "$*" | fold -s -w ${COLUMNS:-110} >&2; } -errn () { printf "$*" | fold -s -w ${COLUMNS:-110} >&2; } - -usage () { - err "$1 - $2" # short description - err "See the $1(1) man page for usage." -} - -# Portable which(1). -pathfind () { - oldifs="$IFS"; IFS=':' - for _p in $PATH; do - if [ -x "$_p/$*" ] && [ -f "$_p/$*" ]; then - IFS="$oldifs" - return 0 - fi - done - IFS="$oldifs" - return 1 -} - -for p in pandoc $REQUIRED; do - pathfind $p || { - err "You need '$p' to use this program!" - exit 1 - } -done - -CONF=$(pandoc --dump-args "$@" 2>&1) || { - errcode=$? - echo "$CONF" | sed -e '/^pandoc \[OPTIONS\] \[FILES\]/,$d' >&2 - [ $errcode -eq 2 ] && usage "$THIS" "$SYNOPSIS" - exit $errcode -} - -OUTPUT=$(echo "$CONF" | sed -ne '1p') -ARGS=$(echo "$CONF" | sed -e '1d') - - -grab_url_with () { - url="${1:?internal error: grab_url_with: url required}" - - shift - cmdline="$@" - - prog= - prog_opts= - if [ -n "$cmdline" ]; then - eval "set -- $cmdline" - prog=$1 - shift - prog_opts="$@" - fi - - if [ -z "$prog" ]; then - # Locate a sensible web grabber (note the order). - for p in wget lynx w3m curl links w3c; do - if pathfind $p; then - prog=$p - break - fi - done - - [ -n "$prog" ] || { - errn "$THIS: Couldn't find a program to fetch the file from URL " - err "(e.g. wget, w3m, lynx, w3c, or curl)." - return 1 - } - else - pathfind "$prog" || { - err "$THIS: No such web grabber '$prog' found; aborting." - return 1 - } - fi - - # Setup proper base options for known grabbers. - base_opts= - case "$prog" in - wget) base_opts="-O-" ;; - lynx) base_opts="-source" ;; - w3m) base_opts="-dump_source" ;; - curl) base_opts="" ;; - links) base_opts="-source" ;; - w3c) base_opts="-n -get" ;; - *) err "$THIS: unhandled web grabber '$prog'; hope it succeeds." - esac - - err "$THIS: invoking '$prog $base_opts $prog_opts $url'..." - eval "set -- $base_opts $prog_opts" - $prog "$@" "$url" -} - -# Parse command-line arguments -parse_arguments () { - while [ $# -gt 0 ]; do - case "$1" in - --encoding=*) - wholeopt="$1" - # extract encoding from after = - encoding="${wholeopt#*=}" ;; - -e|--encoding|-encoding) - shift - encoding="$1" ;; - --grabber=*) - wholeopt="$1" - # extract encoding from after = - grabber="\"${wholeopt#*=}\"" ;; - -g|--grabber|-grabber) - shift - grabber="$1" ;; - *) - if [ -z "$argument" ]; then - argument="$1" - else - err "Warning: extra argument '$1' will be ignored." - fi ;; - esac - shift - done -} - -argument= -encoding= -grabber= - -oldifs="$IFS" -IFS=$NEWLINE -parse_arguments $ARGS -IFS="$oldifs" - -inurl= -if [ -n "$argument" ] && ! [ -f "$argument" ]; then - # Treat given argument as an URL. - inurl="$argument" -fi - -# As a security measure refuse to proceed if mktemp is not available. -pathfind mktemp || { err "Couldn't find 'mktemp'; aborting."; exit 1; } - -# Avoid issues with /tmp directory on Windows/Cygwin -cygwin= -cygwin=$(uname | sed -ne '/^CYGWIN/p') -if [ -n "$cygwin" ]; then - TMPDIR=. - export TMPDIR -fi - -THIS_TEMPDIR= -THIS_TEMPDIR="$(mktemp -d -t $THIS.XXXXXXXX)" || exit 1 -readonly THIS_TEMPDIR - -trap 'exitcode=$? - [ -z "$THIS_TEMPDIR" ] || rm -rf "$THIS_TEMPDIR" - exit $exitcode' 0 1 2 3 13 15 - -if [ -n "$inurl" ]; then - err "Attempting to fetch file from '$inurl'..." - - grabber_out=$THIS_TEMPDIR/grabber.out - grabber_log=$THIS_TEMPDIR/grabber.log - if ! grab_url_with "$inurl" "$grabber" 1>$grabber_out 2>$grabber_log; then - errn "grab_url_with failed" - if [ -f $grabber_log ]; then - err " with the following error log." - err - cat >&2 $grabber_log - else - err . - fi - exit 1 - fi - - argument="$grabber_out" -fi - -if [ -z "$encoding" ] && [ "x$argument" != "x" ]; then - # Try to determine character encoding if not specified - # and input is not STDIN. - encoding=$( - head "$argument" | - LC_ALL=C tr 'A-Z' 'a-z' | - sed -ne '/<meta .*content-type.*charset=/ { - s/.*charset=["'\'']*\([-a-zA-Z0-9]*\).*["'\'']*/\1/p - }' - ) -fi - -if [ -n "$encoding" ] && pathfind iconv; then - alias to_utf8='iconv -f "$encoding" -t utf-8' -else # assume UTF-8 - alias to_utf8='cat' -fi - -htmlinput=$THIS_TEMPDIR/htmlinput - -if [ -z "$argument" ]; then - to_utf8 > $htmlinput # read from STDIN -elif [ -f "$argument" ]; then - to_utf8 "$argument" > $htmlinput # read from file -else - err "File '$argument' not found." - exit 1 -fi - -if ! cat $htmlinput | pandoc --ignore-args -r html -w markdown "$@" ; then - err "Failed to parse HTML. Trying again with tidy..." - tidy -q -asxhtml -utf8 $htmlinput | \ - pandoc --ignore-args -r html -w markdown "$@" -fi diff --git a/man/man1/hsmarkdown.1.md b/man/man1/hsmarkdown.1.md deleted file mode 100644 index a197ef2ca..000000000 --- a/man/man1/hsmarkdown.1.md +++ /dev/null @@ -1,42 +0,0 @@ -% HSMARKDOWN(1) Pandoc User Manuals -% John MacFarlane -% January 8, 2008 - -# NAME - -hsmarkdown - convert markdown-formatted text to HTML - -# SYNOPSIS - -hsmarkdown [*input-file*]... - -# DESCRIPTION - -`hsmarkdown` converts markdown-formatted text to HTML. It is designed -to be usable as a drop-in replacement for John Gruber's `Markdown.pl`. - -If no *input-file* is specified, input is read from *stdin*. -Otherwise, the *input-files* are concatenated (with a blank -line between each) and used as input. Output goes to *stdout* by -default. For output to a file, use shell redirection: - - hsmarkdown input.txt > output.html - -`hsmarkdown` uses the UTF-8 character encoding for both input and output. -If your local character encoding is not UTF-8, you should pipe input -and output through `iconv`: - - iconv -t utf-8 input.txt | hsmarkdown | iconv -f utf-8 - -`hsmarkdown` is implemented as a wrapper around `pandoc`(1). It -calls `pandoc` with the options `--from markdown --to html ---strict` and disables all other options. (Command-line options -will be interpreted as filenames, as they are by `Markdown.pl`.) - -# SEE ALSO - -`pandoc`(1). The *README* -file distributed with Pandoc contains full documentation. - -The Pandoc source code and all documentation may be downloaded from -<http://johnmacfarlane.net/pandoc/>. diff --git a/man/man1/html2markdown.1.md b/man/man1/html2markdown.1.md deleted file mode 100644 index 73e3420dd..000000000 --- a/man/man1/html2markdown.1.md +++ /dev/null @@ -1,95 +0,0 @@ -% HTML2MARKDOWN(1) Pandoc User Manuals -% John MacFarlane and Recai Oktas -% January 8, 2008 - -# NAME - -html2markdown - converts HTML to markdown-formatted text - -# SYNOPSIS - -html2markdown [*pandoc-options*] [\-- *special-options*] [*input-file* or -*URL*] - -# DESCRIPTION - -`html2markdown` converts *input-file* or *URL* (or text -from *stdin*) from HTML to markdown-formatted plain text. -If a URL is specified, `html2markdown` uses an available program -(e.g. wget, w3m, lynx or curl) to fetch its contents. Output is sent -to *stdout* unless an output file is specified using the `-o` -option. - -`html2markdown` uses the character encoding specified in the -"Content-type" meta tag. If this is not present, or if input comes -from *stdin*, UTF-8 is assumed. A character encoding may be specified -explicitly using the `-e` special option. - -# OPTIONS - -`html2markdown` is a wrapper for `pandoc`, so all of -`pandoc`'s options may be used. See `pandoc`(1) for -a complete list. The following options are most relevant: - --s, \--standalone -: Include title, author, and date information (if present) at the - top of markdown output. - --o *FILE*, \--output=*FILE* -: Write output to *FILE* instead of *stdout*. - -\--strict -: Use strict markdown syntax, with no extensions or variants. - -\--reference-links -: Use reference-style links, rather than inline links, in writing markdown - or reStructuredText. - --R, \--parse-raw -: Parse untranslatable HTML codes as raw HTML. - -\--no-wrap -: Disable text wrapping in output. (Default is to wrap text.) - --H *FILE*, \--include-in-header=*FILE* -: Include contents of *FILE* at the end of the header. Implies - `-s`. - --B *FILE*, \--include-before-body=*FILE* -: Include contents of *FILE* at the beginning of the document body. - --A *FILE*, \--include-after-body=*FILE* -: Include contents of *FILE* at the end of the document body. - --C *FILE*, \--custom-header=*FILE* -: Use contents of *FILE* - as the document header (overriding the default header, which can be - printed using `pandoc -D markdown`). Implies `-s`. - -# SPECIAL OPTIONS - -In addition, the following special options may be used. The special -options must be separated from the `html2markdown` command and any -regular `pandoc` options by the delimiter \``--`', as in - - html2markdown -o foo.txt -- -g 'curl -u bar:baz' -e latin1 \ - www.foo.com - --e *encoding*, \--encoding=*encoding* -: Assume the character encoding *encoding* in reading HTML. - (Note: *encoding* will be passed to `iconv`; a list of - available encodings may be obtained using `iconv -l`.) - If this option is not specified and input is not from - *stdin*, `html2markdown` will try to extract the character encoding - from the "Content-type" meta tag. If no character encoding is - specified in this way, or if input is from *stdin*, UTF-8 will be - assumed. - --g *command*, \--grabber=*command* -: Use *command* to fetch the contents of a URL. (By default, - `html2markdown` searches for an available program or text-based - browser to fetch the contents of a URL.) - -# SEE ALSO - -`pandoc`(1), `iconv`(1) diff --git a/pandoc.cabal b/pandoc.cabal index 4a2120079..57ad24b78 100644 --- a/pandoc.cabal +++ b/pandoc.cabal @@ -59,11 +59,10 @@ Data-Files: -- documentation README, INSTALL, COPYRIGHT, BUGS, changelog, -- wrappers - markdown2pdf, html2markdown, hsmarkdown + markdown2pdf Extra-Source-Files: -- sources for man pages man/man1/pandoc.1.md, man/man1/markdown2pdf.1.md, - man/man1/html2markdown.1.md, man/man1/hsmarkdown.1.md, -- tests tests/bodybg.gif, tests/writer.latex, @@ -120,8 +119,7 @@ Extra-Source-Files: tests/lhs-test.html+lhs, tests/lhs-test.fragment.html+lhs, tests/RunTests.hs -Extra-Tmp-Files: man/man1/pandoc.1, man/man1/hsmarkdown.1, - man/man1/html2markdown.1, man/man1/markdown2pdf.1 +Extra-Tmp-Files: man/man1/pandoc.1, man/man1/markdown2pdf.1 Flag highlighting Description: Compile in support for syntax highlighting of code blocks. @@ -130,7 +128,7 @@ Flag executable Description: Build the pandoc executable. Default: True Flag wrappers - Description: Build the wrappers (hsmarkdown, markdown2pdf). + Description: Build the wrappers (markdown2pdf). Default: True Flag library Description: Build the pandoc library. @@ -219,17 +217,6 @@ Executable pandoc else Buildable: False -Executable hsmarkdown - Hs-Source-Dirs: src - Main-Is: hsmarkdown.hs - Ghc-Options: -Wall -threaded - Ghc-Prof-Options: -auto-all - Extensions: CPP - if flag(wrappers) - Buildable: True - else - Buildable: False - Executable markdown2pdf Hs-Source-Dirs: src Main-Is: markdown2pdf.hs diff --git a/src/hsmarkdown.hs b/src/hsmarkdown.hs deleted file mode 100644 index 3f689d4ec..000000000 --- a/src/hsmarkdown.hs +++ /dev/null @@ -1,47 +0,0 @@ -{- -Copyright (C) 2006-8 John MacFarlane <jgm@berkeley.edu> - -This program is free software; you can redistribute it and/or modify -it under the terms of the GNU General Public License as published by -the Free Software Foundation; either version 2 of the License, or -(at your option) any later version. - -This program is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -GNU General Public License for more details. - -You should have received a copy of the GNU General Public License -along with this program; if not, write to the Free Software -Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA --} - -{- | - Copyright : Copyright (C) 2009 John MacFarlane - License : GNU GPL, version 2 or above - - Maintainer : John MacFarlane <jgm@berkeley@edu> - Stability : alpha - Portability : portable - -Wrapper around pandoc that emulates Markdown.pl as closely as possible. --} -module Main where -import System.Process -import System.Environment ( getArgs ) --- Note: ghc >= 6.12 (base >=4.2) supports unicode through iconv --- So we use System.IO.UTF8 only if we have an earlier version -#if MIN_VERSION_base(4,2,0) -#else -import Prelude hiding ( putStr, putStrLn, writeFile, readFile, getContents ) -import System.IO.UTF8 -#endif -import Control.Monad (forM_) - -main :: IO () -main = do - files <- getArgs - let runPandoc inp = readProcess "pandoc" ["--from", "markdown", "--to", "html", "--strict"] inp >>= putStrLn - if null files - then getContents >>= runPandoc - else forM_ files $ \f -> readFile f >>= runPandoc |