From 3776e828a83048697e5c64d9fb4bedc0145197dc Mon Sep 17 00:00:00 2001
From: John MacFarlane <jgm@berkeley.edu>
Date: Thu, 10 Jun 2021 16:47:02 -0700
Subject: Fix MediaBag regressions.

With the 2.14 release `--extract-media` stopped working as before;
there could be mismatches between the paths in the rendered document and
the extracted media.

This patch makes several changes (while keeping the same API).

The `mediaPath` in 2.14 was always constructed from the SHA1 hash of
the media contents.  Now, we preserve the original path unless it's
an absolute path or contains `..` segments (in that case we use a path
based on the SHA1 hash of the contents).

When constructing a path from the SHA1 hash, we always use the
original extension, if there is one. Otherwise we look up an
appropriate extension for the mime type.

`mediaDirectory` and `mediaItems` now use the `mediaPath`, rather
than the mediabag key, for the first component of the tuple.
This makes more sense, I think, and fits with the documentation
of these functions; eventually, though, we should rework the API so that
`mediaItems` returns both the keys and the MediaItems.

Rewriting of source paths in `extractMedia` has been fixed.

`fillMediaBag` has been modified so that it doesn't modify
image paths (that was part of the problem in #7345).

We now do path normalization (e.g. `\` separators on Windows) only
in writing the media; the paths are left unchanged in the image
links (sensibly, since they might be URLs and not file paths).

These changes should restore the original behavior from before 2.14.

Closes #7345.
---
 MANUAL.txt | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

(limited to 'MANUAL.txt')

diff --git a/MANUAL.txt b/MANUAL.txt
index b3a1f95e2..ef569433a 100644
--- a/MANUAL.txt
+++ b/MANUAL.txt
@@ -675,12 +675,12 @@ header when requesting a document from a URL:
 :   Extract images and other media contained in or linked from
     the source document to the path *DIR*, creating it if
     necessary, and adjust the images references in the document
-    so they point to the extracted files.  If the source format is
-    a binary container (docx, epub, or odt), the media is
-    extracted from the container and the original
-    filenames are used. Otherwise the media is read from the
-    file system or downloaded, and new filenames are constructed
-    based on SHA1 hashes of the contents.
+    so they point to the extracted files.  Media are downloaded,
+    read from the file system, or extracted from a binary
+    container (e.g. docx), as needed.  The original file paths
+    are used if they are relative paths not containing `..`.
+    Otherwise filenames are constructed from the SHA1 hash of
+    the contents.
 
 `--abbreviations=`*FILE*
 
-- 
cgit v1.2.3