{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas\n",
    "  * https://pandas.pydata.org\n",
    "  * very high-level data containers with corresponding functionality\n",
    "  * many useful tools to work with time-series (look at `Series.rolling`)\n",
    "  * many SQL-like data operations (group, join, merge)\n",
    "  * Interface to a large variety of file formats (see `pd.read_[...]` functions)\n",
    "  * additional package with data-interface/API to many data repositories\n",
    "    (https://pandas-datareader.readthedocs.io/en/latest/remote_data.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Data Structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Series\n",
    "One-dimensional ndarray with axis labels (called index).\n",
    "\n",
    "Series can be created like an array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.Series([11,13,17,19,23])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or, if you want a special index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "series = pd.Series([11,13,17,19,23], index=['a', 'b', 'c', 'd', 'e'])\n",
    "print(series)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to get the content back you can use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "series.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "series.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "but the power of pandas lies in all the other attributes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#series. [TAB]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataFrame\n",
    "The primary pandas data structure. \n",
    "\n",
    "Two-dimensional size-mutable, potentially heterogeneous tabular data\n",
    "structure with labeled axes. (index: row labels, columns: column labels)\n",
    "Can be thought of as a dict-like container for Series objects. \n",
    "\n",
    "The easiest way to create a DataFrame is to read it from an input file (see later)\n",
    "\n",
    "In addition there are many ways to create DataFrames manually. Most straight forward probably is to use a dict of iterables. (Series, Lists, Arrays). Pandas tries to choose sensible indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame({\"primes\": series, \"fibo\": [1,1,2,3,5]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Refugee Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now want to use pandas to work with data from the World Bank. \n",
    "My goal is to create a plot showing the burden refugees put on different countries.\n",
    "For this we will plot the fraction of refugee in a give countries population versus that countries GDP.\n",
    "\n",
    "I downloaded and extracted the following data-sets from the Worldbank website manually:\n",
    "  * Refugee population by country or territory of asylum: https://data.worldbank.org/indicator/SM.POP.REFG\n",
    "  * Population, total: https://data.worldbank.org/indicator/SP.POP.TOTL\n",
    "  * GDP per capita (current US$): https://data.worldbank.org/indicator/NY.GDP.PCAP.CD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading and Accessing Data\n",
    "\n",
    "loading a data file with pandas is trivial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees = pd.read_csv(\"data/refugee-population.csv\", skiprows=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see pandas choose the right column labels and numbered the rows continously.\n",
    "\n",
    "We can easily change the row labels (the index) to one of the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees.set_index([\"Country Name\"], inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it's easy to select rows or columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees.loc[[\"Switzerland\",\"Germany\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees[[\"1990\",\"2000\"]].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees.get([\"1990\",\"2000\"]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with a Single Country\n",
    "\n",
    "With this we now choose the data for one country, remove all missing values and then create a plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "che = refugees.loc[\"Switzerland\"][[\"{0}\".format(year) for year in range(1990,2018)]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "che.dropna().plot()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Usually it is easier to work with real datetime objects instead of strings. So we convert the index to datetime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "che.index.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "che.index = pd.to_datetime(che.index, format=\"%Y\")\n",
    "print(che.index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As mentioned in the introduction, `pandas` offers a very usefull `rolling` method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "che.plot()\n",
    "che.rolling(center=False,window=5).mean().plot()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Removing Unwanted Data\n",
    "\n",
    "We now want to create a scatter plot with refugees divided by gdp vs. gdp-per-captita. For each data set we will use the mean of the last 7 years.\n",
    "\n",
    "Some of the rows and columns in the World-Bank Files are of no interest for this. We can remove these easily."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Excluding Non-Countries\n",
    "The World-Bank provides meta-data for each country, where we can identify rows with non-countries (e.g. regional aggregates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head data/metadata-countries_population.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load this file and extract the two relevant columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta = pd.read_csv(\"data/metadata-countries_population.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta = meta[['Country Code', 'Region']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta.set_index(\"Country Code\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this we create a list of non-countries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "non_countries = meta.loc[meta.Region.isnull()].index\n",
    "print(non_countries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and finally exclude the relevant rows "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees = refugees[~refugees[\"Country Code\"].isin(non_countries)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Excluding Columns\n",
    "\n",
    "The data contains a few rows with unneeded text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, the 2017 column is empty"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.any(refugees[\"2017\"].notnull())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "so we can create a list of all interesting columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "useful_cols = ['Country Code']\n",
    "for year in range(2010,2017):\n",
    "    useful_cols.append(\"{0}\".format(year))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "useful_cols"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "with this, we:\n",
    "  \n",
    "  * select the reduced datase\n",
    "  * switch the index to Country Code\n",
    "  * calculate the mean for each country"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees = refugees[useful_cols]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugees = refugees.set_index(\"Country Code\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "refugee_means = refugees.mean(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Additional Files\n",
    "\n",
    "Of course we could execute these commands again manually for the two remaining data-files.\n",
    "However, the proper way to solve this is to create a function for this. Especially since all files have the exact same structure. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_file(file):\n",
    "    \"\"\"Load and process a Worldbank File\"\"\"\n",
    "    data = pd.read_csv(file, skiprows=4)\n",
    "    data.set_index(\"Country Code\", inplace=True)\n",
    "    data.drop(non_countries, inplace=True)\n",
    "    data = data[[str(year) for year in range(2010,2017)]]\n",
    "    return data.mean(axis=1), data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdp_means, gdp = load_file(\"data/gdp-per-capita.csv\") "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdp_means.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdp.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "population_means, population = load_file(\"data/population.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the Plot\n",
    "We now combine our three Series with means into one DataFrame and create our plot. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.DataFrame({\"gdp\": gdp_means, \"refugees\": refugee_means/population_means}).dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Here we loose some countries with missing data.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.plot.scatter(\"gdp\", \"refugees\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can quickly find out who the three top countries are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.where(data[\"refugees\"]>0.1).dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To improve readability:\n",
    "\n",
    "  * we switch to a log-log axis (we need to exclude countries with too small refugee numbers)\n",
    "  * we highlight one selected country\n",
    "  * We add a title"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = data[data.refugees > 1e-10].plot.scatter(y=\"refugees\", x=\"gdp\", loglog=True)\n",
    "ax = data.loc[[\"CHE\"]].plot.scatter(y=\"refugees\", x=\"gdp\", ax=ax, color=\"r\", label=\"Switzerland\")\n",
    "plt.title(\"refugees fraction vs. gdp\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "again we can print the info for one country"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.loc[\"CHE\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Highlighting a Full Region\n",
    "\n",
    "Based on th meta data provided by the World Bank, we can highlight a region"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "europe = meta.loc[meta.Region == \"Europe & Central Asia\"].index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "europe[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = data[data.refugees > 1e-10].plot.scatter(y=\"refugees\", x=\"gdp\", loglog=True)\n",
    "ax = data.loc[data.index.intersection(europe)].plot.scatter(y=\"refugees\", x=\"gdp\", ax=ax, color=\"r\", label=\"Europe & Central Asia\")\n",
    "plt.title(\"refugees fraction vs. gdp\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(As we lost some countries with missing data when we called `dropna` above, we need the `data.index.intersection`-call to select only country codes really contained in our data.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fitting\n",
    "\n",
    "We now look at a tiny subset of this data and look at ways to fit a function to it. \n",
    "\n",
    "Scipy preparse a huge number of options, we will look at three options of increasing complexity and flexibility."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preparations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "first we select our subset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "europe_small = ['AUT',\n",
    " 'DEU',\n",
    " 'FRA',\n",
    " 'ITA',\n",
    " #'CHE',\n",
    " #'ESP',\n",
    " #'PRT',\n",
    " #'BEL',\n",
    " #'LUX',\n",
    " #'NLD',\n",
    " #'NOR',\n",
    " #'DEN',\n",
    " #'SWE',\n",
    " #'FIN',\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_eu = data.loc[europe_small].dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = data_eu.plot.scatter(y=\"refugees\", x=\"gdp\", color=\"r\")\n",
    "plt.title(\"refugees fraction vs. gdp\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and we create a vector with all the x values we will need to plot our fit result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(data_eu[\"gdp\"].min(), data_eu[\"gdp\"].max(), 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### polyfit\n",
    "\n",
    "Polyfit is probably the easiest way to fit a polynome to given data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy import polyfit, polyval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = polyfit(data_eu[\"gdp\"], data_eu[\"refugees\"],1)\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = data_eu.plot.scatter(y=\"refugees\", x=\"gdp\", color=\"r\")\n",
    "ax.plot(x, polyval(res, x))\n",
    "plt.title(\"refugees fraction vs. gdp\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### curve_fit\n",
    "\n",
    "With `curve_fit` you can define a complex fit function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.optimize import curve_fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_function(x,b,c):\n",
    "    return b*x+c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = curve_fit(fit_function, data_eu[\"gdp\"], data_eu[\"refugees\"])\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = data_eu.plot.scatter(y=\"refugees\", x=\"gdp\", color=\"r\")\n",
    "ax.plot(x, fit_function(x, *(res[0])))\n",
    "plt.title(\"refugees fraction vs. gdp\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### leastsq\n",
    "\n",
    "Finally, least-squares allows you to even specify the cost function. With this you can factor in uncertainties or weights for your data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.optimize import leastsq"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_function(x, p):\n",
    "    return x*p[0]+p[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def error_function(params):\n",
    "    return data_eu[\"refugees\"] - fit_function(data_eu[\"gdp\"], params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = leastsq(error_function, [0,0])\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = data_eu.plot.scatter(y=\"refugees\", x=\"gdp\", color=\"r\")\n",
    "ax.plot(x, fit_function(x, res[0]))\n",
    "plt.title(\"refugees fraction vs. gdp\")\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}