Series.rolling
)pd.read_[...]
functions)import pandas as pd
pd.Series([11,13,17,19,23])
or, if you want a special index
series = pd.Series([11,13,17,19,23], index=['a', 'b', 'c', 'd', 'e'])
print(series)
to get the content back you can use
series.index
series.values
but the power of pandas lies in all the other attributes
#series. [TAB]
The primary pandas data structure.
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. (index: row labels, columns: column labels) Can be thought of as a dict-like container for Series objects.
The easiest way to create a DataFrame is to read it from an input file (see later)
In addition there are many ways to create DataFrames manually. Most straight forward probably is to use a dict of iterables. (Series, Lists, Arrays). Pandas tries to choose sensible indexes.
pd.DataFrame({"primes": series, "fibo": [1,1,2,3,5]})
We now want to use pandas to work with data from the World Bank. My goal is to create a plot showing the burden refugees put on different countries. For this we will plot the fraction of refugee in a give countries population versus that countries GDP.
I downloaded and extracted the following data-sets from the Worldbank website manually:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
loading a data file with pandas is trivial
refugees = pd.read_csv("data/refugee-population.csv", skiprows=4)
refugees.head()
As you can see pandas choose the right column labels and numbered the rows continously.
We can easily change the row labels (the index) to one of the columns.
refugees.set_index(["Country Name"], inplace=True)
refugees.head()
Now it's easy to select rows or columns
refugees.loc[["Switzerland","Germany"]]
refugees[["1990","2000"]].head()
refugees.get(["1990","2000"]).head()
With this we now choose the data for one country, remove all missing values and then create a plot:
che = refugees.loc["Switzerland"][["{0}".format(year) for year in range(1990,2018)]]
che.dropna().plot()
plt.show()
Usually it is easier to work with real datetime objects instead of strings. So we convert the index to datetime
che.index.values
che.index = pd.to_datetime(che.index, format="%Y")
print(che.index)
As mentioned in the introduction, pandas
offers a very usefull rolling
method
che.plot()
che.rolling(center=False,window=5).mean().plot()
plt.show()
We now want to create a scatter plot with refugees divided by gdp vs. gdp-per-captita. For each data set we will use the mean of the last 7 years.
Some of the rows and columns in the World-Bank Files are of no interest for this. We can remove these easily.
The World-Bank provides meta-data for each country, where we can identify rows with non-countries (e.g. regional aggregates)
!head data/metadata-countries_population.csv
We load this file and extract the two relevant columns
meta = pd.read_csv("data/metadata-countries_population.csv")
meta.columns
meta = meta[['Country Code', 'Region']]
meta.head()
meta.set_index("Country Code", inplace=True)
From this we create a list of non-countries
non_countries = meta.loc[meta.Region.isnull()].index
print(non_countries)
and finally exclude the relevant rows
refugees = refugees[~refugees["Country Code"].isin(non_countries)]
The data contains a few rows with unneeded text
refugees.columns
In addition, the 2017 column is empty
np.any(refugees["2017"].notnull())
so we can create a list of all interesting columns
useful_cols = ['Country Code']
for year in range(2010,2017):
useful_cols.append("{0}".format(year))
useful_cols
with this, we:
refugees = refugees[useful_cols]
refugees = refugees.set_index("Country Code")
refugee_means = refugees.mean(axis=1)
Of course we could execute these commands again manually for the two remaining data-files. However, the proper way to solve this is to create a function for this. Especially since all files have the exact same structure.
def load_file(file):
"""Load and process a Worldbank File"""
data = pd.read_csv(file, skiprows=4)
data.set_index("Country Code", inplace=True)
data.drop(non_countries, inplace=True)
data = data[[str(year) for year in range(2010,2017)]]
return data.mean(axis=1), data
gdp_means, gdp = load_file("data/gdp-per-capita.csv")
gdp_means.head()
gdp.head()
population_means, population = load_file("data/population.csv")
We now combine our three Series with means into one DataFrame and create our plot.
data = pd.DataFrame({"gdp": gdp_means, "refugees": refugee_means/population_means}).dropna()
(Here we loose some countries with missing data.)
data.plot.scatter("gdp", "refugees")
plt.show()
We can quickly find out who the three top countries are:
data.where(data["refugees"]>0.1).dropna()
To improve readability:
ax = data[data.refugees > 1e-10].plot.scatter(y="refugees", x="gdp", loglog=True)
ax = data.loc[["CHE"]].plot.scatter(y="refugees", x="gdp", ax=ax, color="r", label="Switzerland")
plt.title("refugees fraction vs. gdp")
plt.show()
again we can print the info for one country
data.loc["CHE"]
Based on th meta data provided by the World Bank, we can highlight a region
europe = meta.loc[meta.Region == "Europe & Central Asia"].index
europe[:10]
ax = data[data.refugees > 1e-10].plot.scatter(y="refugees", x="gdp", loglog=True)
ax = data.loc[data.index.intersection(europe)].plot.scatter(y="refugees", x="gdp", ax=ax, color="r", label="Europe & Central Asia")
plt.title("refugees fraction vs. gdp")
plt.show()
(As we lost some countries with missing data when we called dropna
above, we need the data.index.intersection
-call to select only country codes really contained in our data.)
We now look at a tiny subset of this data and look at ways to fit a function to it.
Scipy preparse a huge number of options, we will look at three options of increasing complexity and flexibility.
first we select our subset
europe_small = ['AUT',
'DEU',
'FRA',
'ITA',
#'CHE',
#'ESP',
#'PRT',
#'BEL',
#'LUX',
#'NLD',
#'NOR',
#'DEN',
#'SWE',
#'FIN',
]
data_eu = data.loc[europe_small].dropna()
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
plt.title("refugees fraction vs. gdp")
plt.show()
and we create a vector with all the x values we will need to plot our fit result
x = np.linspace(data_eu["gdp"].min(), data_eu["gdp"].max(), 100)
Polyfit is probably the easiest way to fit a polynome to given data.
from scipy import polyfit, polyval
res = polyfit(data_eu["gdp"], data_eu["refugees"],1)
print(res)
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
ax.plot(x, polyval(res, x))
plt.title("refugees fraction vs. gdp")
plt.show()
With curve_fit
you can define a complex fit function.
from scipy.optimize import curve_fit
def fit_function(x,b,c):
return b*x+c
res = curve_fit(fit_function, data_eu["gdp"], data_eu["refugees"])
print(res)
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
ax.plot(x, fit_function(x, *(res[0])))
plt.title("refugees fraction vs. gdp")
plt.show()
Finally, least-squares allows you to even specify the cost function. With this you can factor in uncertainties or weights for your data points.
from scipy.optimize import leastsq
def fit_function(x, p):
return x*p[0]+p[1]
def error_function(params):
return data_eu["refugees"] - fit_function(data_eu["gdp"], params)
res = leastsq(error_function, [0,0])
print(res)
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
ax.plot(x, fit_function(x, res[0]))
plt.title("refugees fraction vs. gdp")
plt.show()