Series.rolling
)pd.read_[...]
functions)import pandas as pd
pd.Series([11,13,17,19,23])
or, if you want a special index
series = pd.Series([11,13,17,19,23], index=['a', 'b', 'c', 'd', 'e'])
print(series)
to get the content back you can use
series.index
series.values
but the power of pandas lies in all the other attributes
#series. [TAB]
The primary pandas data structure.
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. (index: row labels, columns: column labels) Can be thought of as a dict-like container for Series objects.
The easiest way to create a DataFrame is to read it from an input file (see later)
In addition there are many ways to create DataFrames manually. Most straight forward probably is to use a dict of iterables. (Series, Lists, Arrays). Pandas tries to choose sensible indexes.
frame = pd.DataFrame({"primes": series, "fibo": [1,1,2,3,5], "0-4": range(5)})
print(frame)
We now want to use pandas to work with data from the World Bank. My goal is to create a plot showing the burden refugees put on different countries. For this we will plot the fraction of refugee in a give countries population versus that countries GDP.
I downloaded and extracted the following data-sets from the Worldbank website manually:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
loading a data file with pandas is trivial
refugees = pd.read_csv("data/refugee-population.csv", skiprows=4)
refugees.head()
As you can see pandas choose the right column labels and numbered the rows continously.
We can easily change the row labels (the index) to one of the columns.
refugees.set_index(["Country Code"], inplace=True)
refugees.head()
Now it's easy to select rows or columns
refugees.loc[["CHE","DEU"]]
refugees[["1990","2000"]].head()
refugees.get(["1990","2000"]).head()
With this we now choose the data for one country, remove all missing values and then create a plot:
che = refugees.loc["CHE"][[str(year) for year in range(1990,2020)]]
che.dropna().plot()
plt.show()
Usually it is easier to work with real datetime objects instead of strings. So we convert the index to datetime
che.index.values
che.index = pd.to_datetime(che.index, format="%Y")
print(che.index)
As mentioned in the introduction, pandas
offers a very usefull rolling
method
che.plot()
che.rolling(center=False,window=5).mean().plot()
plt.show()
We now want to create a scatter plot with refugees divided by gdp vs. gdp-per-captita. For each data set we will use the mean of the last 7 years.
Some of the rows and columns in the World-Bank Files are of no interest for this. We can remove these easily.
The World-Bank provides meta-data for each country, where we can identify rows with non-countries (e.g. regional aggregates)
!head data/metadata-countries_population.csv
We load this file and extract the two relevant columns
meta = pd.read_csv("data/metadata-countries_population.csv")
meta.columns
meta = meta[['Country Code', 'Region']]
meta.head()
meta.set_index("Country Code", inplace=True)
From this we create a list of non-countries
non_countries = meta.loc[meta.Region.isnull()].index
print(non_countries)
and finally exclude the relevant rows
refugees = refugees.drop(non_countries)
The data contains a few rows with unneeded text
refugees.columns
In addition, the 2019 column is empty
np.any(refugees["2019"].notnull())
so we can create a list of all interesting columns
useful_cols = []
for year in range(2010,2019):
useful_cols.append(str(year))
useful_cols
with this, we:
refugees = refugees[useful_cols]
refugee_means = refugees.mean(axis=1)
Of course we could execute these commands again manually for the two remaining data-files. However, the proper way to solve this is to create a function for this. Especially since all files have the exact same structure.
def load_file(file):
"""Load and process a Worldbank File"""
data = pd.read_csv(file, skiprows=4)
data.set_index("Country Code", inplace=True)
data.drop(non_countries, inplace=True)
data = data[[str(year) for year in range(2010,2017)]]
return data.mean(axis=1), data
gdp_means, gdp = load_file("data/gdp-per-capita.csv")
gdp_means.head()
gdp.head()
population_means, population = load_file("data/population.csv")
We now combine our three Series with means into one DataFrame and create our plot.
data = pd.DataFrame({"gdp": gdp_means, "refugees": refugee_means/population_means}).dropna()
(Here we loose some countries with missing data.)
data.plot.scatter("gdp", "refugees")
plt.show()
We can quickly find out who the three top countries are:
data.where(data["refugees"]>0.1).dropna()
To improve readability:
ax = data[data["refugees"] > 1e-10].plot.scatter(y="refugees", x="gdp", loglog=True)
ax = data.loc[["CHE"]].plot.scatter(y="refugees", x="gdp", ax=ax, color="r", label="Switzerland")
plt.title("refugees fraction vs. gdp")
plt.show()
again we can print the info for one country
data.loc["CHE"]
Based on th meta data provided by the World Bank, we can highlight a region
europe = meta.loc[meta.Region == "Europe & Central Asia"].index
europe[:10]
ax = data[data["refugees"] > 1e-10].plot.scatter(y="refugees", x="gdp", loglog=True)
ax = data.loc[data.index.intersection(europe)].plot.scatter(y="refugees", x="gdp", ax=ax, color="r", label="Europe & Central Asia")
plt.title("refugees fraction vs. gdp")
plt.show()
(As we lost some countries with missing data when we called dropna
above, we need the data.index.intersection
-call to select only country codes really contained in our data.)
We now look at a tiny subset of this data and look at ways to fit a function to it.
Scipy preparse a huge number of options, we will look at three options of increasing complexity and flexibility.
first we select our subset
europe_small = ['AUT',
'DEU',
'FRA',
'ITA',
]
data_eu = data.loc[europe_small].dropna()
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
plt.title("refugees fraction vs. gdp")
plt.show()
and we create a vector with all the x values we will need to plot our fit result
x = np.linspace(data_eu["gdp"].min(), data_eu["gdp"].max(), 100)
Polyfit is probably the easiest way to fit a polynome to given data.
from scipy import polyfit, polyval
res = polyfit(data_eu["gdp"], data_eu["refugees"],1)
print(res)
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
ax.plot(x, polyval(res, x))
plt.title("refugees fraction vs. gdp")
plt.show()
With curve_fit
you can define a complex fit function.
from scipy.optimize import curve_fit
def fit_function(x,b,c):
return b*x+c
res = curve_fit(fit_function, data_eu["gdp"], data_eu["refugees"])
print(res)
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
ax.plot(x, fit_function(x, *(res[0])))
plt.title("refugees fraction vs. gdp")
plt.show()
Finally, least-squares allows you to even specify the cost function. With this you can factor in uncertainties or weights for your data points.
from scipy.optimize import leastsq
def fit_function(x, p):
return x*p[0]+p[1]
def error_function(params):
return data_eu["refugees"] - fit_function(data_eu["gdp"], params)
res = leastsq(error_function, [0,0])
print(res)
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
ax.plot(x, fit_function(x, res[0]))
plt.title("refugees fraction vs. gdp")
plt.show()
import statsmodels.formula.api as smf
res = smf.ols("refugees ~ gdp", data=data_eu).fit()
print(res.summary())
print(res.params)
ax = data_eu.plot.scatter(y="refugees", x="gdp", color="r")
ax.plot(x, res.params[1]*x+res.params[0])
plt.title("refugees fraction vs. gdp")
plt.show()
Passing a single value to loc
returns a Series
frame.loc["a"]
Passing a list to loc
returns a DataFrame
(even if the list contains a single a single value)
frame.loc[["a"]]
frame.loc[["a","c"]]
Also slicing works (but includes the upper boundary)
frame.loc["b":"d"]
A list of boolean values with n-Rows entries, is considered a mask to select rows
frame.loc[[True,False,True,False,True]]
Instead of a list, a boolean-series can be used. Rows are matched on the index. (frame[["primes"]] > 20
would not work as this returns a frame instead of a series.)
frame.loc[frame["primes"] > 20]
When using a mask, .loc
is optional (but recommended to avoid confusion with columns).
frame[frame["primes"] > 20]
Using iloc
it is possible to access rows by position as well. (without using the index)
frame.iloc[2:-1]
The frame is subscripted directly. Again, passing a singel value returns a series.
frame["primes"]
While a list returns a DataFrame
frame[["primes"]]
frame[["primes","0-4"]]
Instead of subscripting, the get
-method can be used.
frame.get(["primes","0-4"])
For single columns, an attribute with the same name exists
frame.primes
But this fails, if the column-name is not a valid attribute-name
# Raises SyntaxError
#frame.0-4
For even more options have a look at the pandas-website: https://pandas.pydata.org/pandas-docs/stable/indexing.html