Context
I am developing my data science skills in areas outside of my previous work. An interesting problem for me was to identify which factors influence life expectancy on a national level. There is an existing Kaggle data set that explored this, but that information was corrupted. Part of the problem solving process is to step back periodically and ask "does this make sense?" Without reasonable data, it is harder to notice mistakes in my analysis code (as opposed to unusual behavior due to the data itself). I wanted to make a similar data set, but with reliable information.
This is my first time exploring life expectancy, so I had to guess which features might be of interest when making the data set. Some were included for comparison with the other Kaggle data set. A number of potentially interesting features (like air pollution) were left off due to limited year or country coverage. Since the data was collected from more than one server, some features are present more than once, to explore the differences.
Content
A goal of the World Health Organization (WHO) is to ensure that a billion more people are protected from health emergencies, and provided better health and well-being. They provide public data collected from many sources to identify and monitor factors that are important to reach this goal. This set was primarily made using GHO (Global Health Observatory) and UNESCO (United Nations Educational Scientific and Culture Organization) information. The set covers the years 2000-2016 for 183 countries, in a single CSV file. Missing data is left in place, for the user to decide how to deal with it.
Three notebooks are provided for my cursory analysis, a comparison with the other Kaggle set, and a template for creating this data set.
Inspiration
There is a lot to explore, if the user is interested. The GHO server alone has over 2000 "indicators".
How are the GHO and UNESCO life expectancies calculated, and what is causing the difference? That could also be asked for Gross National Income (GNI) and mortality features.
How does the life expectancy after age 60 compare to the life expectancy at birth? Is the relationship with the features in this data set different for those two targets?
What other indicators on the servers might be interesting to use? Some of the GHO indicators are different studies with different coverage. Can they be combined to make a more useful and robust data feature?
Unraveling the correlations between the features would take significant work.