From 63fd961cda8374a5f3f7d7d550a6aad448d64523 Mon Sep 17 00:00:00 2001 From: heyarne Date: Fri, 5 Mar 2021 12:17:16 +0000 Subject: [PATCH] Detailed write-up about 02c Corrupted Zip File --- sources/02c Corrupted Zip File.ipynb | 82 +++++++++++++++------------- 1 file changed, 45 insertions(+), 37 deletions(-) diff --git a/sources/02c Corrupted Zip File.ipynb b/sources/02c Corrupted Zip File.ipynb index f1b0dc4..5339870 100644 --- a/sources/02c Corrupted Zip File.ipynb +++ b/sources/02c Corrupted Zip File.ipynb @@ -4,11 +4,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Corrupted Zipfiles\n", + "# ZIP-File Corruption Issues\n", "\n", - "Out of the 40 files we were trying to download, some needed to be fetched from the [Long-Term Archive](https://scihub.copernicus.eu/userguide/#LTA_Long_Term_Archive_Access).\n", - "After retrying the download several times, all files could be retrieved.\n", - "However, some of the downloaded zip files are suspiciously small:" + "The Scihub platform has a policy that moves data into a Long-Term-Archive after a certain period of time.[^lta]\n", + "Retrieval of files from this offline archive is a two step process:\n", + "\n", + "1. A request to the URL that initializes the download for online products (`https://scihub.copernicus.eu/dhus/odata/v1/Products('$UUID')/$value`) instead initializes a data retrieval request which moves the archived file from offline to online storage.\n", + "2. After several minutes or hours, when the product has finished moving to online storage, a request to the same URL initializes the download.\n", + "\n", + "However, due to technical issues that the Copernicus Open Access Hub issue channels did not provide additional information on, some of the offline files were incorrectly restored. The MD5 checksum of the files as delivered was identical to the downloaded product, but products seemed to be incomplete or incorrectly encoded ZIP-files.\n", + "\n", + "The solution for this was to manually copy the file names into the search interface of the [Open Hub](https://scihub.copernicus.eu/dhus/) and retrieve a download there.\n", + "\n", + "This notebook contains information about a feasible process to identify corrupted zip files and manually initialize their correct retrieval, given the number of corruptions is sufficiently small.\n", + "\n", + "[^lta]: Detailed information about this is provided in the [Open Access Hub User Guide](\"https://web.archive.org/web/20210117042627/https://scihub.copernicus.eu/userguide/LongTermArchive\")\n", + "\n", + "## Identification Process\n", + "\n", + "Using unix command line tools, the following command lists all files in a target folder in ascending order by size. The size is printed in a human-readable format in the left column." ] }, { @@ -71,7 +85,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Trying to extract them causes an error:" + "The first 10 files have a file size that's significantly lower than what would be expected.\n", + "Using pipes the following command tries to extract one of the low-size files, which raises an error:" ] }, { @@ -101,26 +116,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## What does the API say?" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import sentinelsat" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "api = sentinelsat.SentinelAPI(os.getenv('SCIHUB_USERNAME'), os.getenv('SCIHUB_PASSWORD'))" + "## API Responses\n", + "\n", + "Continuing with the file above, `S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509`, the downloaded file is compared to what the API indicates it should look like:" ] }, { @@ -138,6 +136,10 @@ } ], "source": [ + "import os\n", + "import sentinelsat\n", + "\n", + "api = sentinelsat.SentinelAPI(os.getenv('SCIHUB_USERNAME'), os.getenv('SCIHUB_PASSWORD'))\n", "res = api.to_geodataframe(api.query(raw='S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509'))" ] }, @@ -173,9 +175,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Do the downloads fail repeatedly?\n", + "## Verification through Repetition\n", "\n", - "All files have been downloaded again to another folder, `input/tempelhofer_feld_test`." + "For the purpose of identifying whether repeated downloads fail in identical ways, the products have been downloaded into a separate target folder, `input/tempelhofer_feld_test`.\n", + "\n", + "Using the piped commands below the MD5 checksum for all ZIP-files below 500MB was calculated, once for all files in the original download folder `input/tempelhofer_feld` and once for `input/tempelhofer_feld_test`.\n", + "\n", + "These checksums being identical shows that both downloads retrieved the same (corrupted) file." ] }, { @@ -234,14 +240,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The downloads are failing in exactly the same way when trying the downloads repeatedly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Manual Download" + "## Manual Download\n", + "\n", + "Another approach was to explicitly use the link provided by the API response and compare this manual download to the downloads initialized by the `sentinelsat` API above.\n", + "\n", + "While the fact that the API response from the Open Access Hub API contained a bad checksum is already a strong indicator that the error is introduced on the server-side during the retrieval process, this manual verification tries to further rule out the `sentinelsat` module as a possible source of error.\n", + "\n", + "Seeing how the link initialized a download of a broken ZIP-file as well, all indicators point toward a server-side error on the side of the Copernicus Open Access Hub." ] }, { @@ -268,8 +273,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "When following the link above, the target file is 25MB.\n", - "This points towards an error on the side of scihub." + "## Solution\n", + "\n", + "A temporary solution, as indicated above, is to manually copy the product names - file names without the `.zip`-extension - into the [Search Mask of the Open Hub](https://scihub.copernicus.eu/dhus/#/home), which will show the product as Offline. The LTA retrieval can then be initialized manually, which is completed after several minutes.\n", + "\n", + "While this approach restores the products without corruption, it is to be expected that the Open Access Hub API will resume operation as advertised after having elevated the issue." ] }, {