aboutsummaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Parsing.hs
AgeCommit message (Collapse)AuthorFilesLines
2012-02-05Parsing: Make characterReference fail if entity not found.John MacFarlane1-2/+2
2012-02-05Removed module Text.Pandoc.CharacterReferences.John MacFarlane1-1/+11
Moved characterReference parser to Text.Pandoc.Parsing. decodeCharacterReferences is now replaced by fromEntities in Text.Pandoc.XML.
2012-02-04Complete rewrite of LaTeX reader.John MacFarlane1-4/+20
* The new reader is more robust, accurate, and extensible. It is still quite incomplete, but it should be easier now to add features. * Text.Pandoc.Parsing: Added withRaw combinator. * Markdown reader: do escapedChar before raw latex inline. Otherwise we capture commands like \{. * Fixed latex citation tests for new citeproc. * Handle \include{} commands in latex. This is done in pandoc.hs, not the (pure) latex reader. But the reader exports the needed function, handleIncludes. * Moved err and warn from pandoc.hs to Shared. * Fixed tests - raw tex should sometimes have trailing space. * Updated lhs-test for highlighting-kate changes.
2012-01-27Fixed table parsing with wide or combining characters.John MacFarlane1-1/+1
Closes #348. Closes #108.
2012-01-01New treatment of dashes in --smart mode.John MacFarlane1-5/+29
* `---` is always em-dash, `--` is always en-dash. * pandoc no longer tries to guess when `-` should be en-dash. * A new option, `--old-dashes`, is provided for legacy documents. Rationale: The rules for en-dash are too complex and language-dependent for a guesser to work reliably. This change gives users greater control. The alternative of using unicode isn't very good, since unicode em- and en- dashes are barely distinguishable in a monospace font.
2011-12-29Better smart quote parsing.John MacFarlane1-1/+7
* Added stateLastStrPos to ParserState. This lets us keep track of whether we're parsing the position immediately after a 'str'. If we encounter a ' in such a location, it must be an apostrophe, and can't be a single quote start. * Set this in the markdown, textile, html, and rst str parsers. * Closes #360.
2011-12-27Replaced Apostrophe, Ellipses, EmDash, EnDash w/ unicode strings.John MacFarlane1-6/+6
2011-12-27Pretty: return Str with unicode instead of Apostrophe.John MacFarlane1-1/+1
2011-12-05Parsing: Removed charsInBalanced', added param to charsInBalanced.John MacFarlane1-20/+13
The extra parameter is a character parser. This is needed for proper handling of escapes, etc.
2011-12-05Parsing: Changed type of escaped to return CharJohn MacFarlane1-5/+2
2011-07-30Added nonspaceChar to Text.Pandoc.Parsing.John MacFarlane1-0/+5
2011-07-25Smart quotes: handle '...hi' properly.John MacFarlane1-1/+2
Also added test case.
2011-07-23Properly handle characters in the 128..159 range.John MacFarlane1-7/+7
These aren't valid in HTML, but many HTML files produced by Windows tools contain them. We substitute correct unicode characters.
2011-04-29Revert "Parsing: Use new type aliases, PandocParser, GeneralParser."John MacFarlane1-123/+118
This reverts commit ec5410bc4e9d228b7dc0123061d80f9addf825bf.
2011-04-29Parsing: Use new type aliases, PandocParser, GeneralParser.John MacFarlane1-118/+123
This should make it easier to change the types later.
2011-03-18Changed uri parser so it doesn't include trailing punctuation.John MacFarlane1-3/+19
So, in RST, 'http://google.com.' should be parsed as a link to 'http://google.com' followed by a period. The parser is smart enough to recognize balanced parentheses, as often occur in wikipedia links: 'http://foo.bar/baz_(bam)'. Also added ()s to RST specialChars, so '(http://google.com)' will be parsed as a link in parens. Added test cases. Resolves Issue #291.
2011-01-26Add support for attributes in inline Code.John MacFarlane1-1/+1
Additional related changes: * URLs in Code in autolinks now use class "url". * Require highlighting-kate 0.2.8.2, which omits the final <br/> tag, essential for inline code.
2011-01-26Bumped version to 1.8; depend on pandoc-types 1.8.John MacFarlane1-7/+6
The old TeX, HtmlInline and RawHtml elements have been removed and replaced by generic RawInline and RawBlock elements. All modules updated to use the new raw elements.
2011-01-19More small parser rewrites for small performance gains.John MacFarlane1-9/+11
2011-01-19Parsing: Rewrote spaceChar for significant speedup in readers.John MacFarlane1-1/+1
2011-01-14Parsing: Fixed bug in grid table parser.John MacFarlane1-5/+5
Spaces at end of line were not being stripped properly, resulting in unintended LineBreaks.
2011-01-05Fixed macro parsing.John MacFarlane1-8/+10
2011-01-04Moved 'macro' and 'applyMacros'' from markdown reader to Parsing.John MacFarlane1-2/+27
2010-12-30New HTML reader using tagsoup as a lexer.John MacFarlane1-3/+3
* The new reader is faster and more accurate. * API changes for Text.Pandoc.Readers.HTML: - removed rawHtmlBlock, anyHtmlBlockTag, anyHtmlInlineTag, anyHtmlTag, anyHtmlEndTag, htmlEndTag, extractTagType, htmlBlockElement, htmlComment - added htmlTag, htmlInBalanced, isInlineTag, isBlockTag, isTextTag * tagsoup is a new dependency. * Text.Pandoc.Parsing: Generalized type on readWith. * Benchmark.hs: Added length calculation to force full evaluation. * Updated HTML reader tests. * Updated markdown and textile readers to use the functions from the HTML reader. * Note: The markdown reader now correctly handles some cases it did not before. For example: <hr/> is reproduced without adding a space. <script> a = '<b>'; </script> is parsed correctly.
2010-12-24Use functions from Text.Pandoc.Generic instead of processWith(M).John MacFarlane1-1/+2
2010-12-17Added new prettyprinting module.John MacFarlane1-2/+3
* Added Text.Pandoc.Pretty. This is better suited for pandoc than the 'pretty' package. One advantage is that we now get proper wrapping; Emph [Inline] is no longer treated as a big unwrappable unit. Previously we only got breaks for spaces at the "outer level." We can also more easily avoid doubled blank lines. Performance is significantly better as well. * Removed Text.Pandoc.Blocks. Text.Pandoc.Pretty allows you to define blocks and concatenate them. * Modified markdown, RST, org readers to use Text.Pandoc.Pretty instead of Text.PrettyPrint.HughesPJ. * Text.Pandoc.Shared: Added writerColumns to WriterOptions. * Markdown, RST, Org writers now break text at writerColumns. * Added --columns command-line option, which sets stColumns and writerColumns. * Table parsing: If the size of the header > stColumns, use the header size as 100% for purposes of calculating relative widths of columns.
2010-12-10Removed HTML sanitization.John MacFarlane1-2/+0
This is better done on the resulting HTML; use the xss-sanitize library for this. xss-sanitize is based on pandoc's sanitization, but improves it. - Removed stateSanitize from ParserState. - Removed --sanitize-html option.
2010-12-07Smart punctuation: recognize entities.John MacFarlane1-8/+22
Now &ldquo;Hi&rdquo; gets parsed as a Quoted DoubleQuote inline.
2010-12-07Smart punctuation: don't alllow ellipses containing spaces.John MacFarlane1-1/+1
Previously we allowed '. . .', ' . . . ', etc. This caused too many complications, and removed author's flexibility in combining ellipses with spaces and periods.
2010-12-07Moved smartPunctuation from Markdown to Parsing.John MacFarlane1-3/+92
+ Parameterized smartPunctuation on an inline parser. + Handle smartPunctuation in Textile reader.
2010-12-05Fix regression: markdown references should be case-insensitive.John MacFarlane1-38/+17
This broke when we added the Key type. We had assumed that the custom case-insensitive Ord instance would ensure case-insensitive matching, but that is not how Data.Map works. * Added a test case for case-insensitivity in markdown-reader-more * Removed old refsMatch from Text.Pandoc.Parsing module; * hid the 'Key' constructor; * dropped the custom Ord and Eq instances, deriving instead; * added fromKey and toKey to convert between Keys and Inline lists; * toKey ensures that keys are case-insensitive, since this is the only way the API provides to construct a Key. Resolves Issue #272.
2010-11-06Removed CITEPROC CPP conditionals from library code.John MacFarlane1-4/+0
By Cabal policy, the API should not change depending on flags.
2010-10-26Process LaTeX macros in markdown, and apply to TeX math.John MacFarlane1-2/+7
Example: \newcommand{\plus}[2]{#1 + #2} $\plus{3}{4}$ yields: 3+4
2010-07-13Parse \chapter{} in latex.John MacFarlane1-2/+4
+ Added stateHasChapters to ParserState. + If a \chapter command is encountered, this is set to True and subsequent \section commands (etc.) will be bumped up one level.
2010-07-11Merge branch 'atlists'. Added auto-numbered example lists.John MacFarlane1-5/+27
2010-07-06Allow language-neutral table captions.John MacFarlane1-1/+4
+ Captions may now begin simply with ':', instead of 'Table:' + Captions may now appear either above or below the table. + Resolves Issue #227.
2010-07-05More refactoring of grid table code.John MacFarlane1-8/+60
2010-07-05Minor reformatting.John MacFarlane1-2/+4
2010-07-05Moved generic grid table functions from RST reader -> Parsing.John MacFarlane1-3/+85
Here they can be used by the Markdown reader as well.
2010-07-05Moved parsing functions from Text.Pandoc.Shared to new module.John MacFarlane1-0/+537
+ Text.Pandoc.Parsing