Change chapter file order and fix references

2026-05-06 19:13:40 +02:00 · 2021-03-18 12:46:37 +01:00 · 2021-03-18 12:46:37 +01:00 · 414e7745a1
commit 414e7745a1
parent 86096cabc6
12 changed files with 21 additions and 21 deletions
--- a/sources/02c-time-series-multiprocessing-benchmarks.ipynb
+++ b/sources/02c-time-series-multiprocessing-benchmarks.ipynb
@ -0,0 +1,290 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Multi-Processing Performance Benchmarks\n",
+    "\n",
+    "This section contains a performance comparison of single- and multiprocess-based calculation methods used in [](02c-time-series.ipynb).\n",
+    "\n",
+    "The `%%timeit` cell magic runs the cell content multiple times and prints statistics on those multiple runs. This reduces factors such as garbage collection pauses which influence the runtime performance and can be used to verify performance assumptions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from multiprocessing import Pool, cpu_count\n",
+    "from numpy import ma\n",
+    "from pathlib import Path\n",
+    "import rasterio as r"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of files: 30\n"
+     ]
+    }
+   ],
+   "source": [
+    "test_files = list(Path('resources/tempelhofer_feld/ndvi').glob('*.tif'))\n",
+    "print(f'Number of files: {len(test_files)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The performance is tested with the following function call:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def average(file_path):\n",
+    "    with r.open(file_path) as src:\n",
+    "        data = src.read(1, masked=True)\n",
+    "        return file_path, ma.average(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The function receives a string or `pathlib.Path`, reads the data using `rasterio`, and calculates the average value inside the data using the `ma.average` function provided by `numpy`.\n",
+    "\n",
+    "The benchmark is performed on 4 CPUs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "4"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cpu_count()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Single File\n",
+    "### In a Single Process"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "4.35 ms ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "average(test_files[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### With Multiprocessing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "187 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "with Pool() as pool:\n",
+    "    averages = [avg for avg in pool.map(average, test_files[:1])]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## All Files in Folder\n",
+    "### In a Single Process"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "131 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "averages = list(map(average, test_files))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### With Multiprocessing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "248 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "with Pool() as pool:\n",
+    "    averages = list(pool.map(average, test_files))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## All Files Multiple Times\n",
+    "### In a Single Process"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "654 ms ± 952 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "averages = list(map(average, test_files * 5))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### With Multiprocessing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "434 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "with Pool() as pool:\n",
+    "    averages = list(pool.map(average, test_files * 5))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Result\n",
+    "\n",
+    "Multiprocessing comes with an overhead, which is significant enough to rule it out when processing a single element.\n",
+    "When the list of elements is sufficiently large, there is a slight reduction in processing time, that, even with a higher standard deviation, manages to be faster than the single-process version.\n",
+    "\n",
+    "Averaging the masked array is an operation that can be implemented to scale in $O(N)$ with the size of the input array.\n",
+    "The time reduction should be even higher for more complex tasks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}