{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Replicate a plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your predeccessor has created the graph shown below for some publication. Since then things have changed and you need to reproduce the plot with new data. However, the code that produced it was lost (it only ever existed in an interactive `ipython` session, the computer on which it was run has long been replaced). In addition the person who created the plot is no longer in academia and cannot be reached.\n", "\n", "From looking at the plot you notice multiple features:\n", "\n", "\n", "- There are two subplots (`plt.subplot` or `plt.subplots`?). The lower one is smaller (`gridspec`) and the two subplots share an $x$-axis.\n", "- The top subplot shows both the PDF and a normalized histogram of $N = 1000$ randomly generated values of (presumably) a normal distribution (`scipy.stats.norm`). It has a meaningful title.\n", "- The top plot also contains the corresponding CDF on a separate y-axis (`ax.twinx`). It is a different color than the other plots.\n", "- The second plot contains the residual between the PDF and the histogram, using `plt.step` in order to match the binning of the histogram.\n", "- The overall plot style is not the default one. Hopefully a preset style was used. The top plot contains a grid matching the right $y$-axis and the bottom plot has a grid matching both the $x$- and $y$-axis.\n", "- In addition, the number of events is added to the plot (`plt.text`) as well as a `plt.legend` in the upper left corner, which even has a `title` It only contains the label for the histogram and the PDF.\n", "- The axis are all properly labeled. The $x$-axis even has a fancy \\LaTeX label.\n", "\n", "Try to replicate the plot as closely as possible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](plot_replication.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-06-26T14:39:39.955120Z", "start_time": "2020-06-26T14:39:38.031868Z" } }, "outputs": [], "source": [ "from scipy import stats\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "np.random.seed(13)\n", "\n", "gauss = stats.norm(loc=0, scale=1)\n", "x = np.linspace(-5, 5)\n", "pdf_values = gauss.pdf(x)\n", "generated = gauss.rvs(1000)\n", "hist, bins = np.histogram(generated, bins=len(x), density=True)\n", "diff = hist - pdf_values\n", "cdf_values = gauss.cdf(x)\n", "\n", "with plt.style.context(\"ggplot\"):\n", " fig, (ax1, ax2) = plt.subplots(2, sharex=True, figsize=(10, 10),\n", " gridspec_kw={'height_ratios': [2, 1]})\n", " ax1.set_title(\"A very important measurement\")\n", " ax1.grid(False)\n", " ax1.plot(x, pdf_values, label=\"PDF\")\n", " ax1.set_ylim(0)\n", " ax1.set_ylabel(\"Values\")\n", " ax1.hist(generated, bins=len(x), density=True, label=\"Histogram\")\n", " ax1.legend(loc=2, title=\"Legend\")\n", " ax1.text(3, 0.2, \"$N = 1000$\")\n", "\n", " ax3 = ax1.twinx()\n", " ax3.set_ylabel(\"CDF\")\n", " ax3.plot(x, cdf_values, \"g\", label=\"CDF\")\n", " ax3.set_ylim(0)\n", " \n", " ax2.step(bins[:-1], diff)\n", " ax2.set_xlim(-5, 5)\n", " ax2.set_ylabel(\"Residual\")\n", " ax2.set_xlabel(r\"$x_i$\")\n", "\n", " plt.savefig(\"plot_replication.png\")\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Geospatial data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `geopandas` to create an interesting visualization of data on a map.\n", "\n", "You can find a shapefile for the boroughs of London at the link https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london. Download the file `statistical-gis-boundaries-london.zip`.\n", "\n", "For a lot of statistical data per borough, visit https://data.london.gov.uk/dataset/london-borough-profiles and download the `csv` file.\n", "\n", "Think about how to best join the two dataframes.\n", "\n", "Use the `plot` method of the `geopandas.DataFrame` to plot some statistics per borough. An example of what this can look like can be found below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](geopandas_population.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-06-26T14:39:40.165435Z", "start_time": "2020-06-26T14:39:39.957665Z" } }, "outputs": [], "source": [ "import geopandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-06-26T14:39:40.223738Z", "start_time": "2020-06-26T14:39:40.168275Z" } }, "outputs": [], "source": [ "df = geopandas.read_file(\"London_Borough_Excluding_MHW.shp\").set_index(\"GSS_CODE\")\n", "df = df.join(pd.read_csv(\"london-borough-profiles.csv\", header=0, encoding='iso-8859-1').set_index(\"Code\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-06-26T14:39:40.230368Z", "start_time": "2020-06-26T14:39:40.225586Z" } }, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-06-26T14:39:40.831597Z", "start_time": "2020-06-26T14:39:40.232867Z" } }, "outputs": [], "source": [ "fig, ax = plt.subplots(1, figsize=(10, 6))\n", "df[\"GLA_Population_Estimate_2017\"] /= 1000\n", "df.plot(column=\"GLA_Population_Estimate_2017\", cmap=\"Blues\", linewidth=0.8, ax=ax, edgecolor='0.8')\n", "ax.axis('off')\n", "ax.set_title(\"London population estimate 2017\", fontdict={\"fontsize\": 25, \"fontweight\": 3})\n", "vmin, vmax = df[\"GLA_Population_Estimate_2017\"].min(), df[\"GLA_Population_Estimate_2017\"].max()\n", "sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=vmin, vmax=vmax))\n", "sm._A = []\n", "cbar = fig.colorbar(sm, label=\"Thousand\")\n", "plt.savefig(\"geopandas_population.png\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# More tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Web scraping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `requests` library and `bs4.BeautifulSoup` to parse our homepage for the material of each lecture.\n", "Use the developer tools of your browser to figure out the names and attributes of elements.\n", "You can use CSS selectors using `soup.select`, or directly operate on the tags with `soup.find`/`soup.find_all`. \n", "\n", "Write a `download` function that automatically downloads the material to a specified directory. For this use `response.content`, instead of `response.text`, and `open(file_name, \"wb\")` in order to directly write the binary content to a file.\n", "Make sure to create the directory if it does not exist and that you deal with file names which would be illegal (for example file names containing `/`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-06-26T14:39:41.216421Z", "start_time": "2020-06-26T14:39:40.833711Z" } }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "from pathlib import Path\n", "import os\n", "\n", "def get_soup(session, url):\n", " \"\"\"Convenience function to use a session to get the soup of a webpage.\"\"\"\n", " response = session.get(url)\n", " response.raise_for_status()\n", " return BeautifulSoup(response.text, \"lxml\")\n", "\n", "def download(session, url, file_name, directory):\n", " response = sesstion.get(url) \n", " response.raise_for_status()\n", " directory = Path(directory)\n", " if not directory.exists():\n", " os.mkdir(directory)\n", " file_name = file_name.replace(\"/\", \"_\")\n", " with open(directory / file_name, \"wb\") as f:\n", " f.write(response.content)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-06-26T14:39:41.309900Z", "start_time": "2020-06-26T14:39:41.218767Z" } }, "outputs": [], "source": [ "session = requests.Session()\n", "base_url = \"https://www.physik.uzh.ch/~python/python\"\n", "soup = get_soup(session, f\"{base_url}/programme.php\")\n", "links = [a['href'] for a in soup.select(\"a.internal\")]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-06-26T14:39:41.517925Z", "start_time": "2020-06-26T14:39:41.315529Z" } }, "outputs": [], "source": [ "for link in links:\n", " soup = get_soup(session, f\"{base_url}/{link}\")\n", " print(soup.title.text.split(\" - \")[-1])\n", " for file_name in [a['href'] for a in soup.select(\"a.download\")]:\n", " url = f\"{base_url}/{link}/{file_name}\"\n", " print(url)\n", " # uncomment this to actually download the files\n", " # download(session, url, file_name, \".\")\n", " print()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "384px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }