Data Science vs. Pump It Up Competition

Introduction

Tanzania is the largest country in East Africa within the African Great Lakes region, with a population of 59 millions people. Like many poor nations around the world, Tanzania suffers from serious issues involving not having access to clean water. According to water.org statistics, 4 million people in Tanzania lack access to an improved source of safe water, and 30 million don’t have access to improved sanitation. According to Tanzania National Website, water-borne illnesses, such as malaria and cholera “account for over half of the diseases affecting the population,” because people don’t have access to sanitary options (Shore, n.d.).

Using data provided by The Tanzanian Water Ministry and Taarifa, DrivenData began a competition to solve this problem by improving clean water sources. This project involved using information given about water-points in Tanzania to predict whether or not a given water source was working correctly.

There are 3 different datasets: training set, test set and train labels set which contains the status of wells.

  • The given data included a target with three classes — ‘functional’, ‘non-functional’, and ‘functional needs repair’.

The idea was to build a model that could predict if a given water-point would fall into one of these three classes. Since we have 3 classes, this is a multi-class, or ternary classification model.

functional                 0.543081
non functional 0.384242
functional needs repair 0.072677

We can see that non-functional (38%) and functional-needs-repair (7%) combined is just as many as functional (54%). So not only that it is very difficult for Tanzanian people to find access to clean and sanitary water, the sources that they do have is estimated to be only 54% functional. It seems that although the investment and technology are made available to these communities, sustainability is overlooked if not neglected.

Data Cleaning Methodology

  1. Merge train set and test set data so that we can do cleaning on both of them at the same time

3. Remove unnecessary columns such as ‘date_recorded’, ‘recorded_by’, ‘num_private’

4. Fix misspellings and variations as much as we can. For instance:

array(['germany republi', 'germany', 'a/co germany', 'aco/germany',
'germany misionary', 'germany cristians',
'bingo foundation germany', 'africa project ev germany',
'germany missionary'], dtype=object)

5. For features with too many unique values, we select for the 20 most common ones and categorize the rest as ‘other’. For example, let’s look at ‘funder_group’

After grouping, we can explore the relationship or interaction between each of the target classes and each of the ‘funder_group’’s unique values, using crosstab, which is also called the contingency table.

Let’s visualize this table:

Top 20 Funders

By running this function, we got ‘funder_group’ from 1897 unique values down to only 20 unique values.

We repeat this process of selecting top 20 common unique values for other features such as ‘installer’, ‘wpt_name’, ‘subvillage’, ‘lga’, ‘ward’, and ‘scheme_name’. We later dropped ‘wpt_name’, ‘subvillage’, ‘ward’, and ‘scheme_name’ since they don’t contribute any useful information to the classification training. We kept ‘installer’ and ‘lga’ since reputable installers and authorities are more likely to upkeep a well functioning waterpoint. More research into these organizations can help us understand more about their quality of work. Due to time constraint, we will not do that in this project.

All location features such as ‘longitude’, ‘latitude’, ‘region’, ‘region_code’, ‘district_code’, are kept since we know from previous projects that location is one of the most important information.

Redundant features such as ‘extraction_type_group’, ‘extraction_type_class’, ‘management’, ‘payment_type’, ‘quality_group’, ‘quantity_group’, ‘source_type’, and ‘waterpoint_type_group’ are dropped.

Summary of Data Cleaning

  • amount_tsh

Data Exploration

‘funder_group’

We can see that water sources by Germany and private companies are well-maintained and functional and water sources by Finland are due for major repair works. However, private companies and international’s help contribute only to the minority of water points. The majority of water points are funded by the Government of Tanzania, which has a higher non-functional counts than functional.

‘installer_group’

Finland, again, is not up to speed with their maintenance and repair. DMDD (Diocese of Mbulu Development Department), CES (Consulting Engineers Salzgitter), and rc church (Roman Catholic Church) are doing a great job . DWE (District Water Engineer), who is in charge of the functionality status of water points, is the main installer and is struggling to keep up with almost 50% of water points being either non-functional or needs-repair. On the other hand, district council, Tanzanian government, RWE (a German electricity generation company), and LGA (Local Government Authority) are having more non-functional water points than functional.

‘lga_group’

A lot of LGA (Local Government Authority) are seriously struggling. Those especially need help are Kyela, Magu, and Mbozi. If we want to increase sustainability long-term, we need to strengthen management and regulation.

‘scheme_management’

Majority of water points are managed by WWC (World Water Council). All managements are keeping their functional water points count higher than non-functional with the exception of SWC (State Water Contractor).

Private operators, although a minority contributor, are doing the best job among all managements.

‘management_group’

User-groups are doing a bigger bulk of work managing water points than commercial and parastatal groups. Under NAWAPO (National Water Policy), user groups is to take the full responsibility for operating, maintaining and sustaining water points at the village level. However, disbursement of funds and report of functionality must follow a long bureaucratic process of accountability, requiring upwards reporting at each level of government, all the way from the village, to the district, and, finally, to the Ministry of Water (Lemmens et al., 2017). The problem found is not only the miscommunication but also the power struggle around roles, responsibilities, and accountability between many different levels of government.

On top of that, the data is published by the ministry which are based on the coverage reported by district, are not reliable, as recognized by DWE (District Water Engineers) (Jiménez, 2011). User-group, which is at village level and who is responsible for basically everything regarding a water point, should be the one in direct communication for funds and functionality report.

‘public_meeting’

As also seen here, public meeting does not help much. Miscommunication is a huge issue between the many levels of government.

‘payment’

If the water point management charges money, the more likely that it is better maintained and kept functional. Regular payment is a better approach for preventative treatments rather than trying to secure a large amount of fund for when the system breaks down. Regular payments can be used toward regular maintenance and upkeep.

However, to have a stable payment system in place, it will require a restructuring authority so that there is a system of co-responsibility between the central, regional and local levels, which has been a serious lack in Tanzania, as we have seen so far. Also observed here with ‘never pay’ being the most common, we can see that villagers or people who directly benefit from the water points, are left to their own devices after a water point is funded and installed, without further technical support for longterm sustainability.

‘age’

To further emphasize on the sustainability problem, we will look at ‘age’. According to the plot, within the first year, 30% of water points become non-functional and only 54% of water points are working 15 years after installation.

On first look, it makes sense that the older the water point, the more likely it is non-functional or needs repair. However, Jiménez (2011) showed just within the first five years of operation, about 30% of water points become non-functional. Only between 35% and 47% of water points are working 15 years after installation.

Regardless of hundreds of millions of dollars over budget and years past the original deadline of the Water Sector Development Program (WSDP), local government and communities find themselves unable to raise the money to fix and maintain their water points operation and maintenance cost.

As suggested above, a local payment system should be put in place so that user-group can be independently responsible for their own water points while direct funding from international donor to village-level should also be implemented instead of having to go through that long bureaucratic process of accountability, where money get lost a long the way between ministry and district.

‘amount_tsh’

Static head measures the total vertical distance that a pump raises water.

‘extraction_type’

Extraction types can be grouped into 3 categories: hand pumps, motorized systems, and gravity-fed systems. Jiménez (2011) found that:

  • Hand pumps had the least favorable functionality– time function, dropping from 61% in the first five years to 6% in the 25-year period.

We can observe this with visualization above: gravity is the most commonly used method and has a higher ratio of functional:non-functional.

‘water_quality’

Fluoride water points are highly functional while fluoride abandoned, meaning water points that have too much fluoride, are more likely to be abandoned and left non-functional. Unidentified water quality is also uncared for and left broken.

‘water_quantity’

Dry water points can’t function, for obvious reason.

‘source’

Water points from lakes, which are just a large accumulation of water in low places, are more likely to be non-functional, probably due to not having large flowing water in and out like other natural sources like rivers and springs (including shallow wells). Water points that are human-operated like hand dtw, machine dbh, rainwater harvesting, and dam are more likely to be functional.

‘source_class’

Groundwater is located underground and must be pumped out of the ground after drilling a deep well. Surface water is found in lakes, rivers and streams.

‘water_point_type’

Since water connection in-home is not possible in a poor country like Tanzania, communal standpipe, a.k.a. public water dispenser, are where people come to get water. Since these are usually free, overtime the inability to recover costs resulted in decline of functional units.

Data Preparation

Outline of Training and Validating Steps
  1. In order for the classification model to correctly predict the target, we need to change the labels from strings to integers.

Now we have:

  • Nonfunctional = 0

2. Dummies encode categorical variable

3. The main two classes (functional, non-functional) are fairly balanced, though the third class (functional needs repair) is 7% of the total target. The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important. To solve this, we use oversampling technique SMOTE (Synthetic Minority Over-Sampling Technique). SMOTE works on the idea of nearest neighbors and create its synthetic data. It is one of the most popular techniques for oversampling.

Class Distribution Before:
Train Set
1 25825
0 18210
2 3485
Name: status_group, dtype: int64
Class Distribution After:
Train Set
2 25825
1 25825
0 25825
Name: status_group, dtype: int64

4. Scale: After splitting the data into training and test sets, we use the MixMaxScaler() to fit and transform X_train and transform X_test of continuous variables. We want to fit and transform only the training data because in a real-world setting, we only have access to this data. We can then use the same scalar object to transform the test data. If we to first transform the data and then split into training and test sets, it will lead to data-leakage.

Modeling

Our first model Decision Tree generated a feature_importance list so we decide if we want to keep all the features or if we want to trim it down further.

We dropped all ‘ward’, ‘subvillage’, and ‘wpt_name’ related features because they don’t provide much information for our classification problem since they are just names after all.

Two sets of 9 models were built with one set is not corrected for class imbalance and the other one is corrected for class imbalance. The accuracy score is higher for the imbalance set and training time is much faster than the set that has been oversampled.

Building baseline models are pretty simple and quick. However, we’d want to tune the models’ parameters so that we can optimize to get the best one. This parameter tuning step is done with GridSearchCV, which takes significantly a long time to run so we settled down with just a few selected parameters and a narrow range of options. If we have more time, we’d want to employ more varieties.

Models:

  • Decision Tree: is the simplest tree-based method. It is a very popular method because of its robustness to noise, tolerance against missing information, handling of irrelevant, redundant predictive attribute values, low computational cost, interpretability, fast run time and robust predictors (Mithrakumar, 2019).

Evaluation Metrics:

  • Accuracy: Out of all the classes, how much we predicted correctly. It should be high as possible.

Model Building Steps:

  • Build a baseline model with random_state = 21 for reproducibility

Summary of Models & Their Performance

Best Model: #14 with GradientBoostingClassifier

The best model is Imbalance Gradient Boost, which is without correcting class imbalance (model #14), with accuracy of 81.3%. The DrivenData Leader Board has 82.9% accuracy as the #1 leading score.

Gradient Boosting Classifier is a group of machine learning algorithms that combine many weak learning models together to create a strong predictive model. Gradient Boosting Classifier is a specific type of algorithm that is used for classification tasks, as the name suggests. The objective of Gradient Boosting Classifier is to minimize the loss, or the difference between the actual class value of the training example and the predicted class value. The power of Gradient Boosting Classifier is its ability to be tasked with not only binary classification problems but also on multi-class classification problems and even regression problems.

Model Building

Using GridSearchCV, we identified optimal values for parameters.

Gradient Boosting Classifier system has two other necessary parts: a weak learner and an additive component. The weak learner used is DecisionTreeClassifier (model #9). We trained a baseline GradientBoostClassifier with the values identified above and those identified in DecisionTreeClassifier model.

Decision Tree Classifier GridSearchCV:

’criterion’: ‘entropy’,
’max_depth’: 40,
’max_features’: ‘auto’,
’min_samples_leaf’: 4,
’min_samples_split’: 5
  • criterion: is the quality of a split. The chosen value here is ‘entropy’, which is measure of disorder/ impurity.

Gradient Boosting Classifier GridSearchCV:

'learning_rate': 0.05, 
'subsample': 0.5

Train with both results from GridSearchCV:

Train accuracy: 99.32659932659934
Test accuracy: 81.3047138047138

Although the model’s accuracy is high, it is overfit since train and test accuracy are very different, the model did not generalize well. High variance error also indicates this overfit.

Bias: 0.00202020202020202
Variance: 0.30902626149259155

More features such as ‘scheme_name’ and ‘year_recorded’ have also been dropped to improve overfit but did not succeed.

Result Interpretation

Model: Gradient Boosting 
precision recall f1-score support

0 0.85 0.79 0.82 4614
1 0.81 0.89 0.85 6434
2 0.54 0.35 0.43 832

accuracy 0.81 11880
macro avg 0.73 0.68 0.70 11880
weighted avg 0.81 0.81 0.81 11880
  • Macro average is the average of precision/recall/f1-score.

Class functional (1) and class non-functional (0) have similar precision and recall score but class functional-needs-repair (2) has very low precision and recall.

The f1 score favors classifiers that have similar precision and recall and this model has the highest f1 score out of all models.

Imbalance GradientBoostingClassifier

The model does very well in classifying functional (1) as function (1) and non-functional (0) as non-functional (0). However it doesn’t do as well when classifying functional-needs-repair (2), it tends to classify it as functional (1). Although functional-needs-repair is still functional, classifying as functional (1) is more costly than classifying it as non-functional (0) because repair and maintenance will be overdue, causing more damages, leads to non-functional. The minority class of functional-needs-repair should be number one priority as preventative repairing and maintenance on time and on schedule are more cost- and time-effective than having the well goes non-functional, which will take much more to fix, if it can be fixed at all.

We tried to fix the class imbalance with SMOTE(), however, as seen in the below confusion matrix of the same GradientBoostingClassifier model, but with class imbalance correction, the situation improved but not that much. For reasons that need more research and further analysis, correcting class imbalance makes the accuracy worse.

Balanced GradientBoostingClassifier
Train accuracy: 99.63472087770249
Test accuracy: 79.6969696969697

Future Work

  1. Since correcting class imbalance did not improve the model, we can try model stacking i.e build a binary classification between functional vs non-functional and another binary classification between functional vs. functional needs repair.

Solutions to the Water Crisis in Tanzania

  1. Focus on sustainability: early preventative strategy rather than letting things go broken
  • A local payment system should be put in place so that the user-group can be independently responsible for their own water points

Github

Reference

DrivenData. (n.d.). Pump it Up: Data Mining the Water Table. Retrieved from https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/

Jiménez, A., & Pérez-Foguet, A. (2011). The relationship between technology and functionality of rural water points: evidence from Tanzania. Water science and technology : a journal of the International Association on Water Pollution Research, 63(5), 948–955. https://doi.org/10.2166/wst.2011.274

Lemmens, R., Lungo, J., Georgiadou, Y., & Verplanke, J. (2017). Monitoring Rural Water Points in Tanzania with Mobile Phones: The Evolution of the SEMA App. ISPRS International Journal of Geo-Information, 6(10), 316. doi:10.3390/ijgi6100316

Mithrakumar, M. (2019, November 12). How to tune a decision tree? Retrieved March 29, 2021, from https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680

Nelson, D. (n.d.). Gradient boosting classifiers in Python with scikit-learn. Retrieved March 23, 2021, from https://stackabuse.com/gradient-boosting-classifiers-in-python-with-scikit-learn/

Shore, R. (n.d.). Water In Crisis — Spotlight Tanzania. The Water Project. Retrieved February 28, 2021, from https://thewaterproject.org/water-crisis/water-in-crisis-tanzania

The Water Project. (n.d.). Facts and Statistics about Water and Its Effects. The Water Project. Retrieved February 28, 2021, from https://thewaterproject.org/water-scarcity/water_stats

water.org. (n.d.). Tanzania’s Water Crisis — Tanzania’s Water In 2020. Water.Org. Retrieved February 28, 2021, from https://water.org/our-impact/where-we-work/tanzania/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store