diff options
author | John MacFarlane <jgm@berkeley.edu> | 2016-10-23 23:12:36 +0200 |
---|---|---|
committer | John MacFarlane <jgm@berkeley.edu> | 2016-10-23 23:12:36 +0200 |
commit | bf72a482ebf8483028f587fb538d35e2b18dade4 (patch) | |
tree | a98ab74ccd743eb5ddd298f8e3c1d35951b621cc /src/Text/Pandoc | |
parent | 738806112bc0ee1711c6f170361d382c7d4265e8 (diff) | |
download | pandoc-bf72a482ebf8483028f587fb538d35e2b18dade4.tar.gz |
Tighten up parsing of raw email addresses.
Technically `**@user` is a valid email address, but if we
allow things like this, we get bad results in markdown flavors
that autolink raw email addresses. (See #2940.)
So we exclude a few valid email addresses in order to
avoid these more common bad cases.
Closes #2940.
Diffstat (limited to 'src/Text/Pandoc')
-rw-r--r-- | src/Text/Pandoc/Parsing.hs | 17 |
1 files changed, 13 insertions, 4 deletions
diff --git a/src/Text/Pandoc/Parsing.hs b/src/Text/Pandoc/Parsing.hs index daf8e867d..110e34c6a 100644 --- a/src/Text/Pandoc/Parsing.hs +++ b/src/Text/Pandoc/Parsing.hs @@ -178,7 +178,7 @@ import qualified Text.Pandoc.UTF8 as UTF8 (putStrLn) import Text.Parsec hiding (token) import Text.Parsec.Pos (newPos) import Data.Char ( toLower, toUpper, ord, chr, isAscii, isAlphaNum, - isHexDigit, isSpace ) + isHexDigit, isSpace, isPunctuation ) import Data.List ( intercalate, transpose, isSuffixOf ) import Text.Pandoc.Shared import qualified Data.Map as M @@ -405,9 +405,18 @@ emailAddress = try $ toResult <$> mailbox <*> (char '@' *> domain) domain = intercalate "." <$> (subdomain `sepby1` dot) dot = char '.' subdomain = many1 $ alphaNum <|> innerPunct - innerPunct = try (satisfy (\c -> isEmailPunct c || c == '@') <* - notFollowedBy space) - emailWord = many1 $ satisfy isEmailChar + -- this excludes some valid email addresses, since an + -- email could contain e.g. '__', but gives better results + -- for our purposes, when combined with markdown parsing: + innerPunct = try (satisfy (\c -> isEmailPunct c || c == '@') + <* notFollowedBy space + <* notFollowedBy (satisfy isPunctuation)) + -- technically an email address could begin with a symbol, + -- but allowing this creates too many problems. + -- See e.g. https://github.com/jgm/pandoc/issues/2940 + emailWord = do x <- satisfy isAlphaNum + xs <- many (satisfy isEmailChar) + return (x:xs) isEmailChar c = isAlphaNum c || isEmailPunct c isEmailPunct c = c `elem` "!\"#$%&'*+-/=?^_{|}~;" -- note: sepBy1 from parsec consumes input when sep |