{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas\n", " * https://pandas.pydata.org\n", " * very high-level data containers with corresponding functionality\n", " * many useful tools to work with time-series (look at `Series.rolling`)\n", " * many SQL-like data operations (group, join, merge)\n", " * Interface to a large variety of file formats (see `pd.read_[...]` functions)\n", " * additional package with data-interface/API to many data repositories\n", " (https://pandas-datareader.readthedocs.io/en/latest/remote_data.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Data Structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series\n", "One-dimensional ndarray with axis labels (called index).\n", "\n", "Series can be created like an array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.Series([11,13,17,19,23])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or, if you want a special index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "series = pd.Series([11,13,17,19,23], index=['a', 'b', 'c', 'd', 'e'])\n", "print(series)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to get the content back you can use" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "series.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "series.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but the power of pandas lies in all the other attributes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#series. [TAB]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrame\n", "The primary pandas data structure. \n", "\n", "Two-dimensional size-mutable, potentially heterogeneous tabular data\n", "structure with labeled axes. (index: row labels, columns: column labels)\n", "Can be thought of as a dict-like container for Series objects. \n", "\n", "The easiest way to create a DataFrame is to read it from an input file (see later)\n", "\n", "In addition there are many ways to create DataFrames manually. Most straight forward probably is to use a dict of iterables. (Series, Lists, Arrays). Pandas tries to choose sensible indexes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({\"primes\": series, \"fibo\": [1,1,2,3,5]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Refugee Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now want to use pandas to work with data from the World Bank. \n", "My goal is to create a plot showing the burden refugees put on different countries.\n", "For this we will plot the fraction of refugee in a give countries population versus that countries GDP.\n", "\n", "I downloaded and extracted the following data-sets from the Worldbank website manually:\n", " * Refugee population by country or territory of asylum: https://data.worldbank.org/indicator/SM.POP.REFG\n", " * Population, total: https://data.worldbank.org/indicator/SP.POP.TOTL\n", " * GDP per capita (current US$): https://data.worldbank.org/indicator/NY.GDP.PCAP.CD" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading and Accessing Data\n", "\n", "loading a data file with pandas is trivial" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees = pd.read_csv(\"data/refugee-population.csv\", skiprows=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see pandas choose the right column labels and numbered the rows continously.\n", "\n", "We can easily change the row labels (the index) to one of the columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees.set_index([\"Country Name\"], inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's easy to select rows or columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees.loc[[\"Switzerland\",\"Germany\"]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees[[\"1990\",\"2000\"]].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees.get([\"1990\",\"2000\"]).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with a Single Country\n", "\n", "With this we now choose the data for one country, remove all missing values and then create a plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "che = refugees.loc[\"Switzerland\"][[\"{0}\".format(year) for year in range(1990,2018)]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "che.dropna().plot()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usually it is easier to work with real datetime objects instead of strings. So we convert the index to datetime" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "che.index.values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "che.index = pd.to_datetime(che.index, format=\"%Y\")\n", "print(che.index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned in the introduction, `pandas` offers a very usefull `rolling` method" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "che.plot()\n", "che.rolling(center=False,window=5).mean().plot()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Removing Unwanted Data\n", "\n", "We now want to create a scatter plot with refugees divided by gdp vs. gdp-per-captita. For each data set we will use the mean of the last 7 years.\n", "\n", "Some of the rows and columns in the World-Bank Files are of no interest for this. We can remove these easily." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excluding Non-Countries\n", "The World-Bank provides meta-data for each country, where we can identify rows with non-countries (e.g. regional aggregates)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head data/metadata-countries_population.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We load this file and extract the two relevant columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "meta = pd.read_csv(\"data/metadata-countries_population.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "meta.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "meta = meta[['Country Code', 'Region']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "meta.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "meta.set_index(\"Country Code\", inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this we create a list of non-countries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "non_countries = meta.loc[meta.Region.isnull()].index\n", "print(non_countries)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and finally exclude the relevant rows " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees = refugees[~refugees[\"Country Code\"].isin(non_countries)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excluding Columns\n", "\n", "The data contains a few rows with unneeded text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, the 2017 column is empty" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.any(refugees[\"2017\"].notnull())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "so we can create a list of all interesting columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "useful_cols = ['Country Code']\n", "for year in range(2010,2017):\n", " useful_cols.append(\"{0}\".format(year))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "useful_cols" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "with this, we:\n", " \n", " * select the reduced datase\n", " * switch the index to Country Code\n", " * calculate the mean for each country" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees = refugees[useful_cols]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugees = refugees.set_index(\"Country Code\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refugee_means = refugees.mean(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Additional Files\n", "\n", "Of course we could execute these commands again manually for the two remaining data-files.\n", "However, the proper way to solve this is to create a function for this. Especially since all files have the exact same structure. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def load_file(file):\n", " \"\"\"Load and process a Worldbank File\"\"\"\n", " data = pd.read_csv(file, skiprows=4)\n", " data.set_index(\"Country Code\", inplace=True)\n", " data.drop(non_countries, inplace=True)\n", " data = data[[str(year) for year in range(2010,2017)]]\n", " return data.mean(axis=1), data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdp_means, gdp = load_file(\"data/gdp-per-capita.csv\") " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdp_means.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdp.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population_means, population = load_file(\"data/population.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the Plot\n", "We now combine our three Series with means into one DataFrame and create our plot. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.DataFrame({\"gdp\": gdp_means, \"refugees\": refugee_means/population_means}).dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Here we loose some countries with missing data.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.plot.scatter(\"gdp\", \"refugees\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can quickly find out who the three top countries are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.where(data[\"refugees\"]>0.1).dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To improve readability:\n", "\n", " * we switch to a log-log axis (we need to exclude countries with too small refugee numbers)\n", " * we highlight one selected country\n", " * We add a title" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = data[data.refugees > 1e-10].plot.scatter(y=\"refugees\", x=\"gdp\", loglog=True)\n", "ax = data.loc[[\"CHE\"]].plot.scatter(y=\"refugees\", x=\"gdp\", ax=ax, color=\"r\", label=\"Switzerland\")\n", "plt.title(\"refugees fraction vs. gdp\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "again we can print the info for one country" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.loc[\"CHE\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Highlighting a Full Region\n", "\n", "Based on th meta data provided by the World Bank, we can highlight a region" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "europe = meta.loc[meta.Region == \"Europe & Central Asia\"].index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "europe[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = data[data.refugees > 1e-10].plot.scatter(y=\"refugees\", x=\"gdp\", loglog=True)\n", "ax = data.loc[data.index.intersection(europe)].plot.scatter(y=\"refugees\", x=\"gdp\", ax=ax, color=\"r\", label=\"Europe & Central Asia\")\n", "plt.title(\"refugees fraction vs. gdp\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(As we lost some countries with missing data when we called `dropna` above, we need the `data.index.intersection`-call to select only country codes really contained in our data.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting\n", "\n", "We now look at a tiny subset of this data and look at ways to fit a function to it. \n", "\n", "Scipy preparse a huge number of options, we will look at three options of increasing complexity and flexibility." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "first we select our subset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "europe_small = ['AUT',\n", " 'DEU',\n", " 'FRA',\n", " 'ITA',\n", " #'CHE',\n", " #'ESP',\n", " #'PRT',\n", " #'BEL',\n", " #'LUX',\n", " #'NLD',\n", " #'NOR',\n", " #'DEN',\n", " #'SWE',\n", " #'FIN',\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_eu = data.loc[europe_small].dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = data_eu.plot.scatter(y=\"refugees\", x=\"gdp\", color=\"r\")\n", "plt.title(\"refugees fraction vs. gdp\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and we create a vector with all the x values we will need to plot our fit result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(data_eu[\"gdp\"].min(), data_eu[\"gdp\"].max(), 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### polyfit\n", "\n", "Polyfit is probably the easiest way to fit a polynome to given data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import polyfit, polyval" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = polyfit(data_eu[\"gdp\"], data_eu[\"refugees\"],1)\n", "print(res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = data_eu.plot.scatter(y=\"refugees\", x=\"gdp\", color=\"r\")\n", "ax.plot(x, polyval(res, x))\n", "plt.title(\"refugees fraction vs. gdp\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### curve_fit\n", "\n", "With `curve_fit` you can define a complex fit function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.optimize import curve_fit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fit_function(x,b,c):\n", " return b*x+c" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = curve_fit(fit_function, data_eu[\"gdp\"], data_eu[\"refugees\"])\n", "print(res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = data_eu.plot.scatter(y=\"refugees\", x=\"gdp\", color=\"r\")\n", "ax.plot(x, fit_function(x, *(res[0])))\n", "plt.title(\"refugees fraction vs. gdp\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### leastsq\n", "\n", "Finally, least-squares allows you to even specify the cost function. With this you can factor in uncertainties or weights for your data points." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.optimize import leastsq" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fit_function(x, p):\n", " return x*p[0]+p[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def error_function(params):\n", " return data_eu[\"refugees\"] - fit_function(data_eu[\"gdp\"], params)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = leastsq(error_function, [0,0])\n", "print(res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = data_eu.plot.scatter(y=\"refugees\", x=\"gdp\", color=\"r\")\n", "ax.plot(x, fit_function(x, res[0]))\n", "plt.title(\"refugees fraction vs. gdp\")\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6rc1" } }, "nbformat": 4, "nbformat_minor": 2 }