Finish section 02

This commit is contained in:
heyarne 2021-03-08 14:50:04 +00:00
commit 0cb8da0286
5 changed files with 230 additions and 287 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View file

@ -6,7 +6,7 @@
"source": [
"# ZIP-File Corruption Issues\n",
"\n",
"The Scihub platform has a policy that moves data into a Long-Term-Archive after a certain period of time.[^lta]\n",
"The Copernicus Open Access Hub has a policy of moving data into a Long-Term-Archive after a certain period of time.[^lta]\n",
"Retrieval of files from this offline archive is a two step process:\n",
"\n",
"1. A request to the URL that initializes the download for online products (`https://scihub.copernicus.eu/dhus/odata/v1/Products('$UUID')/$value`) instead initializes a data retrieval request which moves the archived file from offline to online storage.\n",

View file

@ -4,11 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multi-Threading Performance Benchmark\n",
"# Multi-Processing Performance Benchmark\n",
"\n",
"This notebook contains a performance comparison of different methods to process the NDVI calculations.\n",
"This section contains a performance comparison of single- and multiprocess-based calculation methods used in [](02b Timeseries.ipynb).\n",
"\n",
"The `%%timeit` cell magic runs the cell content multiple times and outputs statistics on those multiple runs, thereby reducing factors such as garbage collection pauses etc."
"The `%%timeit` cell magic runs the cell content multiple times and prints statistics on those multiple runs. This reduces factors such as garbage collection pauses which influence the runtime performance and can be used to verify performance assumptions."
]
},
{
@ -32,12 +32,12 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Number of files: 27\n"
"Number of files: 30\n"
]
}
],
"source": [
"test_files = list(Path('output/ndvi').glob('*.tif'))\n",
"test_files = list(Path('resources/tempelhofer_feld/ndvi').glob('*.tif'))\n",
"print(f'Number of files: {len(test_files)}')"
]
},
@ -45,7 +45,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The function we test with:"
"The performance is tested with the following function call:"
]
},
{
@ -64,91 +64,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## In a single process\n",
"### Time to process a single file"
"The function receives a string or `pathlib.Path`, reads the data using `rasterio`, and calculates the average value inside the data using the `ma.average` function provided by `numpy`.\n",
"\n",
"The benchmark is performed on 4 CPUs:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"36.2 ms ± 42.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"average(test_files[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Time to process all files"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"980 ms ± 7.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"averages = [avg for avg in map(average, test_files)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Increasing the list size"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4.86 s ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"averages = [avg for avg in map(average, test_files * 5)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Time when using a worker pool\n",
"\n",
"Number of CPUs the multiprocessing pools can access:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
@ -156,7 +80,7 @@
"4"
]
},
"execution_count": 7,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
@ -169,19 +93,45 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### On One element"
"## Single File\n",
"### In a Single Process"
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"277 ms ± 3.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
"4.35 ms ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"average(test_files[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiprocessing"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"187 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
@ -195,52 +145,104 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### On the complete list"
"## All Files in Folder\n",
"### In a Single Process"
]
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"630 ms ± 8.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
"131 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"with Pool() as pool:\n",
" averages = [avg for avg in pool.map(average, test_files)]"
"averages = list(map(average, test_files))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Increasing the list size"
"### With Multiprocessing"
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.1 s ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
"248 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"with Pool() as pool:\n",
" averages = [avg for avg in pool.map(average, test_files * 5)]"
" averages = list(pool.map(average, test_files))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## All Files Multiple Times\n",
"### In a Single Process"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"654 ms ± 952 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"averages = list(map(average, test_files * 5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### With Multiprocessing"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"434 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"with Pool() as pool:\n",
" averages = list(pool.map(average, test_files * 5))"
]
},
{
@ -250,9 +252,9 @@
"## Result\n",
"\n",
"As we can see when processing a single element, multiprocessing comes with an overhead.\n",
"When the list to be processed is sufficiently large, we get a reduction in processing time of roughly 30%-50%, depending on list size.\n",
"When the list to be processed is sufficiently large, we get a slight reduction in processing time, that, even with a higher standard deviation, manages to be faster than the single-process version.\n",
"\n",
"Averaging the masked array is a fairly simple operation that scales in $O(N)$ with the size of the input array.\n",
"Averaging the masked array is an operation that can be implemented to scale in $O(N)$ with the size of the input array.\n",
"The time reduction should be even higher for more complex tasks."
]
},

View file

@ -6,7 +6,10 @@
"source": [
"# Spectral Index Pipeline\n",
"\n",
"Before running this notebook the files have to be downloaded. See [01a Download Process.ipynb](01a Download Process.ipynb) for further information.\n",
"Elaborating on the work done in previous sections, this section contains a complete implementation of the calculation of various spectral indicators.\n",
"\n",
"It does not contain code to download products from the Open Access Hub[^download_process].\n",
"It is rather a re-usable notebook that can be re-used for the calculation of indices only.\n",
"\n",
"The calculation in this notebook depends on three parameters:\n",
"\n",
@ -18,51 +21,21 @@
" - ndwi -- normalized difference in water\n",
"- `fill_value`, the value which is used to represent invalid pixels to handle division by zero.\n",
"\n",
"Change the values below and select _Kernel → Restart and Run All Cells_ to re-evaluate all cells in this notebook.\n",
"The path of the output file will be printed below the last cell."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[PosixPath('resources/forest_fires/S2A_MSIL2A_20180807T101021_N0208_R022_T33UUT_20180809T112302.zip'),\n",
" PosixPath('resources/forest_fires/S2A_MSIL2A_20180919T102021_N0208_R065_T33UUT_20180919T132226.zip'),\n",
" PosixPath('resources/forest_fires/S2A_MSIL2A_20190603T101031_N0212_R022_T33UUT_20190603T114652.zip'),\n",
" PosixPath('resources/forest_fires/S2A_MSIL2A_20190613T101031_N0212_R022_T33UUT_20190614T125329.zip'),\n",
" PosixPath('resources/forest_fires/S2A_MSIL2A_20190626T102031_N0212_R065_T33UUT_20190626T125319.zip'),\n",
" PosixPath('resources/forest_fires/S2A_MSIL2A_20190629T103031_N0212_R108_T32UPE_20190629T135351.zip'),\n",
" PosixPath('resources/forest_fires/S2A_MSIL2A_20190726T102031_N0213_R065_T32UPE_20190726T125507.zip'),\n",
" PosixPath('resources/forest_fires/S2B_MSIL2A_20180822T101019_N0208_R022_T33UUT_20180822T161243.zip'),\n",
" PosixPath('resources/forest_fires/S2B_MSIL2A_20190701T102029_N0212_R065_T32UPE_20190701T134657.zip')]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#downloads = ['input/forest_fires/S2A_MSIL2A_20190726T102031_N0213_R065_T32UPE_20190726T125507.zip',\n",
"# 'input/forest_fires/S2B_MSIL2A_20190701T102029_N0212_R065_T32UPE_20190701T134657.zip',\n",
"# 'input/forest_fires/S2A_MSIL2A_20190629T103031_N0212_R108_T32UPE_20190629T135351.zip']\n",
"When running this notebook interactively, _Kernel → Restart and Run All Cells_ can be used to re-evaluate all cells in this notebook after configuring the pipeline.\n",
"The path of the output file containing the processed values will be printed below the last cell.\n",
"\n",
"from pathlib import Path\n",
"sources = list(sorted(Path('resources/forest_fires').glob('*.zip')))\n",
"sources"
"[^download_process]: See [](01 Download Process.ipynb) for details"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"product_path = Path(sources[0])\n",
"from pathlib import Path\n",
"\n",
"product_path = Path('resources/forest_fires/S2A_MSIL2A_20180807T101021_N0208_R022_T33UUT_20180809T112302.zip')\n",
"index_to_calculate = 'nbr'"
]
},
@ -77,7 +50,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
@ -95,8 +68,8 @@
"source": [
"## Define Formulas\n",
"\n",
"We define the formulas as data so we can substitute the bands with actual values later on and execute the operations when needed.\n",
"We use a lisp-like language with prefix notation for this.\n",
"Formulas are defined as data so that the bands can be substituted with actual values later on. By declaratively expressing the formula calculations computations can be executed lazily and only when needed. \n",
"This is done so the formulas can be defined independent from actual data, which is needed only much later.\n",
"\n",
"### Operators\n",
"\n",
@ -106,12 +79,12 @@
"\\text{NDVI} = \\frac{\\text{B08} - \\text{B04}}{\\text{B08} + \\text{B04}}\n",
"$$\n",
"\n",
"To define the index calculation formulas in this way, first the the basic arithmetic operations `+`, `-`, `*` and `/` are wrapped in functions taking variadic arguments:"
"To define the index calculation formulas in this way, first the basic arithmetic operations `+`, `-`, `*` and `/` are wrapped in functions taking variadic arguments:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
@ -141,21 +114,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"These function are used to define formulas for a selection of indices.\n",
"These definitions are not exhaustive - there are many spectral indices which are not implemented in this notebook - however the general shape of theses formulas allows for enough flexibility to implement other indices as well.\n",
"These function are used to define formulas for the selection of indices mentioned in the introduction.\n",
"These indices are not exhaustive - there are many spectral indices which are not implemented in this notebook. The general shape of theses formulas however allows for enough flexibility to implement other indices as well.\n",
"\n",
"The formulas are defined in a lisp-like prefix notation: `(add, 1, 2, 3)` translates to `1 + 2 + 3`.\n",
"Each element in a formula can be either a function, a string or a tuple.\n",
"Each element in a formula can be either a function, a string or a tuple. Tuples are delimited using `()`. The first element of these tuples is one of the operations defined above. It is followed by at least one other element, which can be any of the following:\n",
"\n",
"This is done so the formulas can be defined independent from actual data.\n",
"The data is passed in much later.\n",
"\n",
"Instead they are defined using tuples, which are delimited using `()`. These tuples have as their first element one of the operations defined above and continue with a flexible amount of other tuples (allowing the recursive expression of formulas), strings, which encode band numbers, or constants:"
"- Tuples, allowing the recursive expression of formulas.\n",
"- Strings, which encode band numbers.\n",
"- Constants, i.e. integers or floats."
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
@ -176,12 +148,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"An error is thrown if `index_to_calculate` did not define an implemented index:"
"An error is thrown if `index_to_calculate` does not mach any of the implements indices above:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
@ -201,7 +173,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
@ -221,7 +193,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The resolving process needs a `band_map` in the form of `band_num` → `numpy.array`:"
"The resolving process needs a `band_map` in the form of `band_num` → `numpy.array`. By defining the arithmetic operations like above, it can be treated like any other python function - read from the formula and called using `op(args)`:"
]
},
{
@ -250,7 +222,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Because the prefix-notation is not commonly used to define mathematical formula, a function is given that converts prefix a formula from above to infix notation. This should help to avoid errors when transcribing the index formulas, which are usually given in common infix notation:"
"Because the prefix-notation is not commonly used to define mathematical formula, a function is defined that converts prefix a formula from above to infix notation. This should help to avoid errors when transcribing the index formulas, which are usually given in common infix notation:"
]
},
{
@ -294,7 +266,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Test Case"
"#### Test Cases"
]
},
{
@ -372,7 +344,8 @@
"source": [
"## Extraction of Relevant Band File Paths\n",
"\n",
"The index calculation starts with the list of bands are referenced by the index formula given by `index_to_calculate`:"
"Subsections from here on below contain the actual calculations.\n",
"They start with the list of bands are referenced by the index formula given by `index_to_calculate`:"
]
},
{
@ -480,58 +453,6 @@
"These are encoded as metadata in the raster file:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PosixPath('zip+file:/home/jovyan/sources/resources/forest_fires/S2A_MSIL2A_20180807T101021_N0208_R022_T33UUT_20180809T112302.zip!/S2A_MSIL2A_20180807T101021_N0208_R022_T33UUT_20180809T112302.SAFE/GRANULE/L2A_T33UUT_A016321_20180807T101024/IMG_DATA/R10m/T33UUT_20180807T101021_B08_10m.jp2')"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sentinel_helpers import scihub_cloud_mask\n",
"import rasterio as r\n",
"import matplotlib.pyplot as plt\n",
"\n",
"highres_raster_path = [band_path for band_path in highest_resolution_band_paths if resolution(band_path) == target_resolution][0]\n",
"highres_raster_path"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((10980, 10980),\n",
" Affine(10.0, 0.0, 300000.0,\n",
" 0.0, -10.0, 5800020.0))"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with r.open(highres_raster_path) as src:\n",
" target_transform = src.transform\n",
" target_shape = src.shape\n",
"\n",
"# shape is height and width of a 2d-array, target_transform is an affine matrix\n",
"target_shape, target_transform"
]
},
{
"cell_type": "code",
"execution_count": 19,
@ -562,11 +483,7 @@
],
"source": [
"# pixels with clouds are True, pixels without are False\n",
"raster_cloud_mask = scihub_cloud_mask(product_path,\n",
" rasterize=True,\n",
" target_shape=target_shape,\n",
" # the affine matrix is used to calculate array indices from world coordinates\n",
" target_transform=target_transform)\n",
"raster_cloud_mask = scihub_cloud_mask(product_path)\n",
"\n",
"plt.imshow(raster_cloud_mask)"
]