Re-name benchmark notebook

2026-05-06 19:13:40 +02:00 · 2021-03-08 14:54:12 +00:00 · 2021-03-08 14:54:12 +00:00 · d9c55e3554
commit d9c55e3554
parent 4d1ac2296f
2 changed files with 3 additions and 3 deletions
--- a/Benchmarks.ipynb
+++ b/Benchmarks.ipynb
@ -0,0 +1,290 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Multi-Processing Performance Benchmarks\n",
+    "\n",
+    "This section contains a performance comparison of single- and multiprocess-based calculation methods used in [](02b Timeseries.ipynb).\n",
+    "\n",
+    "The `%%timeit` cell magic runs the cell content multiple times and prints statistics on those multiple runs. This reduces factors such as garbage collection pauses which influence the runtime performance and can be used to verify performance assumptions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from multiprocessing import Pool, cpu_count\n",
+    "from numpy import ma\n",
+    "from pathlib import Path\n",
+    "import rasterio as r"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of files: 30\n"
+     ]
+    }
+   ],
+   "source": [
+    "test_files = list(Path('resources/tempelhofer_feld/ndvi').glob('*.tif'))\n",
+    "print(f'Number of files: {len(test_files)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The performance is tested with the following function call:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def average(file_path):\n",
+    "    with r.open(file_path) as src:\n",
+    "        data = src.read(1, masked=True)\n",
+    "        return file_path, ma.average(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The function receives a string or `pathlib.Path`, reads the data using `rasterio`, and calculates the average value inside the data using the `ma.average` function provided by `numpy`.\n",
+    "\n",
+    "The benchmark is performed on 4 CPUs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "4"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cpu_count()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Single File\n",
+    "### In a Single Process"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "4.35 ms ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "average(test_files[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### With Multiprocessing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "187 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "with Pool() as pool:\n",
+    "    averages = [avg for avg in pool.map(average, test_files[:1])]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## All Files in Folder\n",
+    "### In a Single Process"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "131 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "averages = list(map(average, test_files))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### With Multiprocessing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "248 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "with Pool() as pool:\n",
+    "    averages = list(pool.map(average, test_files))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## All Files Multiple Times\n",
+    "### In a Single Process"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "654 ms ± 952 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "averages = list(map(average, test_files * 5))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### With Multiprocessing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "434 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "with Pool() as pool:\n",
+    "    averages = list(pool.map(average, test_files * 5))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Result\n",
+    "\n",
+    "As we can see when processing a single element, multiprocessing comes with an overhead.\n",
+    "When the list to be processed is sufficiently large, we get a slight reduction in processing time, that, even with a higher standard deviation, manages to be faster than the single-process version.\n",
+    "\n",
+    "Averaging the masked array is an operation that can be implemented to scale in $O(N)$ with the size of the input array.\n",
+    "The time reduction should be even higher for more complex tasks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}