Tanzania is the largest country in East Africa within the African Great Lakes region, with a population of 59 millions people. Like many poor nations around the world, Tanzania suffers from serious issues involving not having access to clean water. According to water.org statistics, 4 million people in Tanzania lack access to an improved source of safe water, and 30 million don’t have access to improved sanitation. According to Tanzania National Website, water-borne illnesses, such as malaria and cholera “account for over half of the diseases affecting the population,” because people don’t have access to sanitary options (Shore, n.d.).
Using data provided by The Tanzanian Water Ministry and Taarifa, DrivenData began a competition to solve this problem by improving clean water sources. This project involved using information given about water-points in Tanzania to predict whether or not a given water source was working correctly.
There are 3 different datasets: training set, test set and train labels set which contains the status of wells.
- The given data included a target with three classes — ‘functional’, ‘non-functional’, and ‘functional needs repair’.
- The train & test set contain 59400 water points data with 40 features.
The idea was to build a model that could predict if a given water-point would fall into one of these three classes. Since we have 3 classes, this is a multi-class, or ternary classification model.
non functional 0.384242
functional needs repair 0.072677
We can see that non-functional (38%) and functional-needs-repair (7%) combined is just as many as functional (54%). So not only that it is very difficult for Tanzanian people to find access to clean and sanitary water, the sources that they do have is estimated to be only 54% functional. It seems that although the investment and technology are made available to these communities, sustainability is overlooked if not neglected.
Data Cleaning Methodology
- Merge train set and test set data so that we can do cleaning on both of them at the same time
- Clean up missing, null, zero values by replacing it with mean, median, or classify them as ‘other’
3. Remove unnecessary columns such as ‘date_recorded’, ‘recorded_by’, ‘num_private’
4. Fix misspellings and variations as much as we can. For instance:
array(['germany republi', 'germany', 'a/co germany', 'aco/germany',
'germany misionary', 'germany cristians',
'bingo foundation germany', 'africa project ev germany',
'germany missionary'], dtype=object)
5. For features with too many unique values, we select for the 20 most common ones and categorize the rest as ‘other’. For example, let’s look at ‘funder_group’
After grouping, we can explore the relationship or interaction between each of the target classes and each of the ‘funder_group’’s unique values, using crosstab, which is also called the contingency table.
Let’s visualize this table:
By running this function, we got ‘funder_group’ from 1897 unique values down to only 20 unique values.
We repeat this process of selecting top 20 common unique values for other features such as ‘installer’, ‘wpt_name’, ‘subvillage’, ‘lga’, ‘ward’, and ‘scheme_name’. We later dropped ‘wpt_name’, ‘subvillage’, ‘ward’, and ‘scheme_name’ since they don’t contribute any useful information to the classification training. We kept ‘installer’ and ‘lga’ since reputable installers and authorities are more likely to upkeep a well functioning waterpoint. More research into these organizations can help us understand more about their quality of work. Due to time constraint, we will not do that in this project.
All location features such as ‘longitude’, ‘latitude’, ‘region’, ‘region_code’, ‘district_code’, are kept since we know from previous projects that location is one of the most important information.
Redundant features such as ‘extraction_type_group’, ‘extraction_type_class’, ‘management’, ‘payment_type’, ‘quality_group’, ‘quantity_group’, ‘source_type’, and ‘waterpoint_type_group’ are dropped.
Summary of Data Cleaning
- d̶a̶t̶e̶_̶r̶e̶c̶o̶r̶d̶e̶d̶ — replaced with ‘year_recorded’, fill missing values with mean — dummies encoding
- funder — select for 20 most common —dummies encoding
- installer — select for 20 most common — dummies encoding
- wpt_name — select for 20 most common — dummies encoding
- basin — dummies encoding
- subvillage — select for 20 most common — dummies encoding
- region —dummies encoding
- region_code — dummies encoding
- district_code —dummies encoding
- lga — select for 20 most common — dummies encoding
- ward — select for 20 most common — dummies encoding
- population — fill missing values with median
- public_meeting —dummies encoding
- scheme_management — dummies encoding
- scheme_name — select for 20 most common —dummies encoding
- permit —dummies encoding
- c̶o̶n̶s̶t̶r̶u̶c̶t̶i̶o̶n̶_̶y̶e̶a̶r̶ — Compute ‘age’ —dummies encoding
- extraction_type —dummies encoding
- management_group —dummies encoding
- payment — dummies encoding
- water_quality — dummies encoding
- quantity —dummies encoding
- source —dummies encoding
- source_class —dummies encoding
- waterpoint_type — dummies encoding
We can see that water sources by Germany and private companies are well-maintained and functional and water sources by Finland are due for major repair works. However, private companies and international’s help contribute only to the minority of water points. The majority of water points are funded by the Government of Tanzania, which has a higher non-functional counts than functional.
Finland, again, is not up to speed with their maintenance and repair. DMDD (Diocese of Mbulu Development Department), CES (Consulting Engineers Salzgitter), and rc church (Roman Catholic Church) are doing a great job . DWE (District Water Engineer), who is in charge of the functionality status of water points, is the main installer and is struggling to keep up with almost 50% of water points being either non-functional or needs-repair. On the other hand, district council, Tanzanian government, RWE (a German electricity generation company), and LGA (Local Government Authority) are having more non-functional water points than functional.
A lot of LGA (Local Government Authority) are seriously struggling. Those especially need help are Kyela, Magu, and Mbozi. If we want to increase sustainability long-term, we need to strengthen management and regulation.
Majority of water points are managed by WWC (World Water Council). All managements are keeping their functional water points count higher than non-functional with the exception of SWC (State Water Contractor).
Private operators, although a minority contributor, are doing the best job among all managements.
User-groups are doing a bigger bulk of work managing water points than commercial and parastatal groups. Under NAWAPO (National Water Policy), user groups is to take the full responsibility for operating, maintaining and sustaining water points at the village level. However, disbursement of funds and report of functionality must follow a long bureaucratic process of accountability, requiring upwards reporting at each level of government, all the way from the village, to the district, and, finally, to the Ministry of Water (Lemmens et al., 2017). The problem found is not only the miscommunication but also the power struggle around roles, responsibilities, and accountability between many different levels of government.
On top of that, the data is published by the ministry which are based on the coverage reported by district, are not reliable, as recognized by DWE (District Water Engineers) (Jiménez, 2011). User-group, which is at village level and who is responsible for basically everything regarding a water point, should be the one in direct communication for funds and functionality report.
As also seen here, public meeting does not help much. Miscommunication is a huge issue between the many levels of government.
If the water point management charges money, the more likely that it is better maintained and kept functional. Regular payment is a better approach for preventative treatments rather than trying to secure a large amount of fund for when the system breaks down. Regular payments can be used toward regular maintenance and upkeep.
However, to have a stable payment system in place, it will require a restructuring authority so that there is a system of co-responsibility between the central, regional and local levels, which has been a serious lack in Tanzania, as we have seen so far. Also observed here with ‘never pay’ being the most common, we can see that villagers or people who directly benefit from the water points, are left to their own devices after a water point is funded and installed, without further technical support for longterm sustainability.
To further emphasize on the sustainability problem, we will look at ‘age’. According to the plot, within the first year, 30% of water points become non-functional and only 54% of water points are working 15 years after installation.
On first look, it makes sense that the older the water point, the more likely it is non-functional or needs repair. However, Jiménez (2011) showed just within the first five years of operation, about 30% of water points become non-functional. Only between 35% and 47% of water points are working 15 years after installation.
Regardless of hundreds of millions of dollars over budget and years past the original deadline of the Water Sector Development Program (WSDP), local government and communities find themselves unable to raise the money to fix and maintain their water points operation and maintenance cost.
As suggested above, a local payment system should be put in place so that user-group can be independently responsible for their own water points while direct funding from international donor to village-level should also be implemented instead of having to go through that long bureaucratic process of accountability, where money get lost a long the way between ministry and district.
Static head measures the total vertical distance that a pump raises water.
Extraction types can be grouped into 3 categories: hand pumps, motorized systems, and gravity-fed systems. Jiménez (2011) found that:
- Hand pumps had the least favorable functionality– time function, dropping from 61% in the first five years to 6% in the 25-year period.
- Motorized systems started at 77% and dropped to 13% in the same period.
- Gravity-fed systems worked better in the long run than any other category of water points, dropping from 66% to 20%.
We can observe this with visualization above: gravity is the most commonly used method and has a higher ratio of functional:non-functional.
Fluoride water points are highly functional while fluoride abandoned, meaning water points that have too much fluoride, are more likely to be abandoned and left non-functional. Unidentified water quality is also uncared for and left broken.
Dry water points can’t function, for obvious reason.
Water points from lakes, which are just a large accumulation of water in low places, are more likely to be non-functional, probably due to not having large flowing water in and out like other natural sources like rivers and springs (including shallow wells). Water points that are human-operated like hand dtw, machine dbh, rainwater harvesting, and dam are more likely to be functional.
Groundwater is located underground and must be pumped out of the ground after drilling a deep well. Surface water is found in lakes, rivers and streams.
Since water connection in-home is not possible in a poor country like Tanzania, communal standpipe, a.k.a. public water dispenser, are where people come to get water. Since these are usually free, overtime the inability to recover costs resulted in decline of functional units.
- In order for the classification model to correctly predict the target, we need to change the labels from strings to integers.
Now we have:
- Nonfunctional = 0
- Functional = 1
- Functional needs repair = 2
2. Dummies encode categorical variable
3. The main two classes (functional, non-functional) are fairly balanced, though the third class (functional needs repair) is 7% of the total target. The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important. To solve this, we use oversampling technique SMOTE (Synthetic Minority Over-Sampling Technique). SMOTE works on the idea of nearest neighbors and create its synthetic data. It is one of the most popular techniques for oversampling.
Class Distribution Before:
Name: status_group, dtype: int64
Class Distribution After:
Name: status_group, dtype: int64
4. Scale: After splitting the data into training and test sets, we use the
MixMaxScaler() to fit and transform
X_train and transform
X_test of continuous variables. We want to fit and transform only the training data because in a real-world setting, we only have access to this data. We can then use the same scalar object to transform the test data. If we to first transform the data and then split into training and test sets, it will lead to data-leakage.
Our first model Decision Tree generated a feature_importance list so we decide if we want to keep all the features or if we want to trim it down further.
We dropped all ‘ward’, ‘subvillage’, and ‘wpt_name’ related features because they don’t provide much information for our classification problem since they are just names after all.
Two sets of 9 models were built with one set is not corrected for class imbalance and the other one is corrected for class imbalance. The accuracy score is higher for the imbalance set and training time is much faster than the set that has been oversampled.
Building baseline models are pretty simple and quick. However, we’d want to tune the models’ parameters so that we can optimize to get the best one. This parameter tuning step is done with GridSearchCV, which takes significantly a long time to run so we settled down with just a few selected parameters and a narrow range of options. If we have more time, we’d want to employ more varieties.
- Decision Tree: is the simplest tree-based method. It is a very popular method because of its robustness to noise, tolerance against missing information, handling of irrelevant, redundant predictive attribute values, low computational cost, interpretability, fast run time and robust predictors (Mithrakumar, 2019).
- Logistic Regression: Logistic regression, by default, is limited to two-class classification problems. With this 3-class problem, we will use extension one-vs-rest to allow logistic regression to be used for this multi-class classification problem.
- K-nearest Neighbor: An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.
- Bagged Tree: to improve on Decision Tree model, which have high-variance estimators, i.e. a small number of additional training observations can dramatically alter the prediction performance of a learned tree. Bagging is a general-purpose procedure for reducing the variance.
- Random Forest: is building a forest with many decision trees on a bootstrapped training set.
- AdaBoost: Boosting idea involves growing trees sequentially, meaning that each tree is built based on the information from previously grown trees. For ADABoost, the predictions are made through majority vote, with the instances being classified according to which class receives the most votes from the weak learners.
- Gradient Boosting: is the AdaBoost method combined with weighted minimization, after which the classifiers and weighted inputs are recalculated.
- XGBoost: is a refined and customized Gradient Boosting Decision Tree system a.k.a “eXtreme Gradient Boosting”, optimized for speed and performance.
- Support Vector Machine: The idea is to find nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.
- Accuracy: Out of all the classes, how much we predicted correctly. It should be high as possible.
- Cross validation: this should be as close as possible with the model’s RMSE or else we overfit our model.
- Precision: What proportion of positive identifications was actually correct? i.e. TP / (TP + FP). It should be high as possible.
- Recall: What proportion of positive identifications was actually correct? i.e. TP / (TP + FN). It should be high as possible.
- F1: It is difficult to compare two models with low precision and high recall or vice versa. It is often convenient to combine precision and recall into a single metric called the F1 score. The F1 score is the harmonic mean of precision and recall. The F1 score favors classifiers that have similar precision and recall. Unfortunately, we can’t have it both ways: increasing precision reduces recall and vice versa. This is called the precision/recall tradeoff.
- MAE: Lower values of MAE indicate better fit.
- MSE: Lower values of MSE indicate better fit.
- RMSE: Lower values of RMSE indicate better fit.
- Bias & variance: Under-fitting models: high Bias error, model makes very simplistic assumptions on it. Over-fitting models: high Variance error, model learns too much from data and cannot generalized.
- Confusion matrix: The matrix compares the actual target values with those predicted by the machine learning model. The general idea is to count the number of times instances of class 1 are classified as a different class such as 0 or 2, etc.
Model Building Steps:
- Build a baseline model with random_state = 21 for reproducibility
- Use GridSearchCV to search for optimal parameters’ values and train an improved model with those values. Due to time constraint, only a few parameters and their value options are chosen
- Fit the model with optimized parameters and hyperparameters
- Make prediction yhat
- If accuracy improves along with better fit, we’ll select either the baseline or the optimized model
- More evaluation metrics are done on the chosen model for further analysis
- Finally interpret the result
Summary of Models & Their Performance
Best Model: #14 with GradientBoostingClassifier
The best model is Imbalance Gradient Boost, which is without correcting class imbalance (model #14), with accuracy of 81.3%. The DrivenData Leader Board has 82.9% accuracy as the #1 leading score.
Gradient Boosting Classifier is a group of machine learning algorithms that combine many weak learning models together to create a strong predictive model. Gradient Boosting Classifier is a specific type of algorithm that is used for classification tasks, as the name suggests. The objective of Gradient Boosting Classifier is to minimize the loss, or the difference between the actual class value of the training example and the predicted class value. The power of Gradient Boosting Classifier is its ability to be tasked with not only binary classification problems but also on multi-class classification problems and even regression problems.
Using GridSearchCV, we identified optimal values for parameters.
Gradient Boosting Classifier system has two other necessary parts: a weak learner and an additive component. The weak learner used is DecisionTreeClassifier (model #9). We trained a baseline GradientBoostClassifier with the values identified above and those identified in DecisionTreeClassifier model.
Decision Tree Classifier GridSearchCV:
- criterion: is the quality of a split. The chosen value here is ‘entropy’, which is measure of disorder/ impurity.
- max_depth: is the maximum depth of the tree. The deeper the tree grow, the more complex our model will become and this could potentially cause overfitting. Conversely, low depth will cause underfitting.
- max_features: is the number of features to consider when looking for the best split.
- min_samples_leaf: leaf is an external node which has no children, used to control over-fitting by defining that each leaf has more than one element.
- min_samples_split: is an internal node which can have further split. It is the minimum number of samples required to be at a leaf node.
Gradient Boosting Classifier GridSearchCV:
Train with both results from GridSearchCV:
Train accuracy: 99.32659932659934
Test accuracy: 81.3047138047138
Although the model’s accuracy is high, it is overfit since train and test accuracy are very different, the model did not generalize well. High variance error also indicates this overfit.
More features such as ‘scheme_name’ and ‘year_recorded’ have also been dropped to improve overfit but did not succeed.
Model: Gradient Boosting
precision recall f1-score support
0 0.85 0.79 0.82 4614
1 0.81 0.89 0.85 6434
2 0.54 0.35 0.43 832
accuracy 0.81 11880
macro avg 0.73 0.68 0.70 11880
weighted avg 0.81 0.81 0.81 11880
- Macro average is the average of precision/recall/f1-score.
- Weighted average is the weighted average of precision/recall/f1-score.
Class functional (1) and class non-functional (0) have similar precision and recall score but class functional-needs-repair (2) has very low precision and recall.
The f1 score favors classifiers that have similar precision and recall and this model has the highest f1 score out of all models.
The model does very well in classifying functional (1) as function (1) and non-functional (0) as non-functional (0). However it doesn’t do as well when classifying functional-needs-repair (2), it tends to classify it as functional (1). Although functional-needs-repair is still functional, classifying as functional (1) is more costly than classifying it as non-functional (0) because repair and maintenance will be overdue, causing more damages, leads to non-functional. The minority class of functional-needs-repair should be number one priority as preventative repairing and maintenance on time and on schedule are more cost- and time-effective than having the well goes non-functional, which will take much more to fix, if it can be fixed at all.
We tried to fix the class imbalance with SMOTE(), however, as seen in the below confusion matrix of the same GradientBoostingClassifier model, but with class imbalance correction, the situation improved but not that much. For reasons that need more research and further analysis, correcting class imbalance makes the accuracy worse.
Train accuracy: 99.63472087770249
Test accuracy: 79.6969696969697
- Since correcting class imbalance did not improve the model, we can try model stacking i.e build a binary classification between functional vs non-functional and another binary classification between functional vs. functional needs repair.
- Try more parameters tuning with more and wider range of options
- Work to reduce overfit while maintaining and/or improving accuracy score
- Find out why correcting class imbalance affects accuracy negatively
Solutions to the Water Crisis in Tanzania
- Focus on sustainability: early preventative strategy rather than letting things go broken
- Decentralized management: If we want to increase sustainability long-term, we need to restructure authority so that there is a system of co-responsibility between the central, regional and local levels.
- Improved payment system:
- A local payment system should be put in place so that the user-group can be independently responsible for their own water points
- Direct funding from international donors to village-level should also be implemented instead of having to go through that long bureaucratic process of accountability, where money get lost along the way between ministry and district.
Tanzania is the largest country in East Africa within the African Great Lakes region, with a population of 59 millions…
DrivenData. (n.d.). Pump it Up: Data Mining the Water Table. Retrieved from https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/
Jiménez, A., & Pérez-Foguet, A. (2011). The relationship between technology and functionality of rural water points: evidence from Tanzania. Water science and technology : a journal of the International Association on Water Pollution Research, 63(5), 948–955. https://doi.org/10.2166/wst.2011.274
Lemmens, R., Lungo, J., Georgiadou, Y., & Verplanke, J. (2017). Monitoring Rural Water Points in Tanzania with Mobile Phones: The Evolution of the SEMA App. ISPRS International Journal of Geo-Information, 6(10), 316. doi:10.3390/ijgi6100316
Mithrakumar, M. (2019, November 12). How to tune a decision tree? Retrieved March 29, 2021, from https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680
Nelson, D. (n.d.). Gradient boosting classifiers in Python with scikit-learn. Retrieved March 23, 2021, from https://stackabuse.com/gradient-boosting-classifiers-in-python-with-scikit-learn/
Shore, R. (n.d.). Water In Crisis — Spotlight Tanzania. The Water Project. Retrieved February 28, 2021, from https://thewaterproject.org/water-crisis/water-in-crisis-tanzania
The Water Project. (n.d.). Facts and Statistics about Water and Its Effects. The Water Project. Retrieved February 28, 2021, from https://thewaterproject.org/water-scarcity/water_stats
water.org. (n.d.). Tanzania’s Water Crisis — Tanzania’s Water In 2020. Water.Org. Retrieved February 28, 2021, from https://water.org/our-impact/where-we-work/tanzania/