diff options
author | John MacFarlane <jgm@berkeley.edu> | 2021-02-22 14:17:22 -0800 |
---|---|---|
committer | John MacFarlane <jgm@berkeley.edu> | 2021-02-22 14:17:22 -0800 |
commit | d30791a38166538be60a134196f1d2675275017d (patch) | |
tree | 3294ac5a972807e28aa43e05d21bf9ce712f3f4b /src/Text/Pandoc | |
parent | 5a73c5d3f8136c7fba7429c3ae3a8ae31c58030b (diff) | |
download | pandoc-d30791a38166538be60a134196f1d2675275017d.tar.gz |
Fall back to latin1 if UTF-8 decoding fails...
...when handling URL argument served with no charset in the mime type.
The assumption is that most pages that don't specify a charset
in the mime type are either UTF-8 or latin1. I think that's a good
assumption, though I'm not sure.
Diffstat (limited to 'src/Text/Pandoc')
-rw-r--r-- | src/Text/Pandoc/App.hs | 8 |
1 files changed, 7 insertions, 1 deletions
diff --git a/src/Text/Pandoc/App.hs b/src/Text/Pandoc/App.hs index 59af029b5..40fb34834 100644 --- a/src/Text/Pandoc/App.hs +++ b/src/Text/Pandoc/App.hs @@ -1,3 +1,4 @@ +{-# LANGUAGE LambdaCase #-} {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE CPP #-} {-# LANGUAGE ScopedTypeVariables #-} @@ -352,7 +353,12 @@ readURI src = do Just "UTF-8" -> return $ UTF8.toText bs Just "ISO-8859-1" -> return $ T.pack $ B8.unpack bs Just charset -> throwError $ PandocUnsupportedCharsetError charset - Nothing -> return $ UTF8.toText bs + Nothing -> liftIO $ -- try first as UTF-8, then as latin1 + E.catch (return $! UTF8.toText bs) + (\case + TSE.DecodeError{} -> + return $ T.pack $ B8.unpack bs + e -> E.throwIO e) readFile' :: MonadIO m => FilePath -> m BL.ByteString readFile' "-" = liftIO BL.getContents |