Utilities for applying scikit-learn to spatial datasets
Project description
## Python module for geospatial prediction using scikit-learn and GDAL
*Impute*: Estimating missing data by inference.
The naming and core concept is based on the [yaImpute](http://cran.r-project.org/web/packages/yaImpute/index.html) R Package.
### Overview
*Imputation* and *Geospatial prediction* are broad terms for techniques aimed at estimating spatially-explicity characteristics of the landscape based on sparse observations.
The observations, known as the **training data**, contain:
* **explanatory** variables: relatively inexpensive to measure or predict; explain the spatial patterns of response variables
* **response** variables: relatively expensive or impossible to obtain
The **target data** contains *only* explanatory variables represented by full coverage raster data. There are no response variables available for the target data.
The ultimate goal is to predict spatially-explicit responses based on the target data. It is very similar to the concept of "supervised classification" in remote sensing but can also be used to predict continuious variables (i.e. regression)
### Installation
Install some prerequisites (unless you want to have pip compile them for you)...
```
python setup.py python-numpy python-gdal python-pandas
```
and go...
```
pip install pyimpute
```
Alternatively, grab the source code and go...
```
git clone https://github.com/perrygeo/pyimpute.git
cd pyimpute
python setup.py develop
python setup.py test
```
Finally, check out the [examples](https://github.com/perrygeo/python-impute/blob/master/examples/).
### The goal of pyimpute
`pyimpute` helps optimize and streamline the process of predicting spatial data through supervised classification and regression.
`pyimpute` doesn't do much other than provide some powerful, high-level utility functions
that leverage the raster data tools of GDAL and the machine learning algorithms of scikit-learn
Having a clean workflow for spatial predictions will allow us to:
* frequently update predictions with new information (e.g. new Landsat imagery as it becomes available)
* explore new variables more easily
* bring the technique to other disciplines and geographies
### The process
1. Loading spatial data
* Easiest method: import table containing training observations with both explanatory and response variables.
* Alternate method: perform random stratified sampling on a response data
and generate training data from rasters
* Another alternate method: Use python-rasters-stats to grab the data from explanatory rasters using point data
2. Fit a classification or regression model
* Leverage scikit-learn classifiers or regressors
* Fit to training data
* optional: scale and optionally reduce dimensionality of data
* optional: calibrate using grid search cv to find optimal parameters
* optional: create your own ensemble [2]
* Evaluate:
* crossvalidation (average score over k-folds)
* train_test split
* metrics [3]
* confusion matrix
* compare to dummy estimators
* identify most informative features [1]
3. Generate spatial prediction from target data
* scikit classifiers to make predictions and generate certainty estimates
* GDAL to write predicted classes array to new raster
* write raster of prediction probability for each pixel
* write rasters (one for each class) with probability of that class over space
### Some resources
[1] http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers?rq=1
[2] http://stackoverflow.com/questions/21506128/best-way-to-combine-probabilistic-classifiers-in-scikit-learn/21544196#21544196
[3] http://scikit-learn.org/stable/modules/model_evaluation.html#prediction-error-metrics
[4] http://scikit-learn.org/stable/auto_examples/plot_classification_probability.html
### Example
Please check out [the examples](https://github.com/perrygeo/python-impute/blob/master/examples/)
This example walks through the main steps of loading training data, setting up and evaluating a classifier, and using it to predict a raster of the response variable.
The resulting prediction raster
![alt tag](https://raw.github.com/perrygeo/python-impute/master/img/example_responses.png)
The certainty estimates
![alt tag](https://raw.github.com/perrygeo/python-impute/master/img/example_certainty.png)
#### Note about performance and memory limitations
Depending on the classifier you use, memory and/or time might become limited.
The `impute` method takes an optional argument, (`linechunk`) which calibrates the performance.
Specifically, it determines how many lines/rows of the raster file are processed at once.
**tl;dr;** You want to set `linechunk` as high as possible without exceeding your memory capacity.
In this example, I use the RandomForest classifier. Other classifiers may exhibit different behavior
but, in general, there is a tradeoff between speed and memory;
as you increase `linechunk` memory increases *linearly*
![alt tag](https://raw.github.com/perrygeo/python-impute/master/img/memory.png)
while performance increases *exponentially*.
![alt tag](https://raw.github.com/perrygeo/python-impute/master/img/time.png)
#### Note about geostatistics
While kriging and other geostatistical techniques are technically "geospatial prediction", they rely on spatial dependence between observations. The problems for which this module was built are landscape scale
and rarely suited to such approaches. There is great potential to meld the two approaches (i.e. consider spatial autocorrelation between training data as well as explanatory variables) but this is currently outside the scope of this module.
*Impute*: Estimating missing data by inference.
The naming and core concept is based on the [yaImpute](http://cran.r-project.org/web/packages/yaImpute/index.html) R Package.
### Overview
*Imputation* and *Geospatial prediction* are broad terms for techniques aimed at estimating spatially-explicity characteristics of the landscape based on sparse observations.
The observations, known as the **training data**, contain:
* **explanatory** variables: relatively inexpensive to measure or predict; explain the spatial patterns of response variables
* **response** variables: relatively expensive or impossible to obtain
The **target data** contains *only* explanatory variables represented by full coverage raster data. There are no response variables available for the target data.
The ultimate goal is to predict spatially-explicit responses based on the target data. It is very similar to the concept of "supervised classification" in remote sensing but can also be used to predict continuious variables (i.e. regression)
### Installation
Install some prerequisites (unless you want to have pip compile them for you)...
```
python setup.py python-numpy python-gdal python-pandas
```
and go...
```
pip install pyimpute
```
Alternatively, grab the source code and go...
```
git clone https://github.com/perrygeo/pyimpute.git
cd pyimpute
python setup.py develop
python setup.py test
```
Finally, check out the [examples](https://github.com/perrygeo/python-impute/blob/master/examples/).
### The goal of pyimpute
`pyimpute` helps optimize and streamline the process of predicting spatial data through supervised classification and regression.
`pyimpute` doesn't do much other than provide some powerful, high-level utility functions
that leverage the raster data tools of GDAL and the machine learning algorithms of scikit-learn
Having a clean workflow for spatial predictions will allow us to:
* frequently update predictions with new information (e.g. new Landsat imagery as it becomes available)
* explore new variables more easily
* bring the technique to other disciplines and geographies
### The process
1. Loading spatial data
* Easiest method: import table containing training observations with both explanatory and response variables.
* Alternate method: perform random stratified sampling on a response data
and generate training data from rasters
* Another alternate method: Use python-rasters-stats to grab the data from explanatory rasters using point data
2. Fit a classification or regression model
* Leverage scikit-learn classifiers or regressors
* Fit to training data
* optional: scale and optionally reduce dimensionality of data
* optional: calibrate using grid search cv to find optimal parameters
* optional: create your own ensemble [2]
* Evaluate:
* crossvalidation (average score over k-folds)
* train_test split
* metrics [3]
* confusion matrix
* compare to dummy estimators
* identify most informative features [1]
3. Generate spatial prediction from target data
* scikit classifiers to make predictions and generate certainty estimates
* GDAL to write predicted classes array to new raster
* write raster of prediction probability for each pixel
* write rasters (one for each class) with probability of that class over space
### Some resources
[1] http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers?rq=1
[2] http://stackoverflow.com/questions/21506128/best-way-to-combine-probabilistic-classifiers-in-scikit-learn/21544196#21544196
[3] http://scikit-learn.org/stable/modules/model_evaluation.html#prediction-error-metrics
[4] http://scikit-learn.org/stable/auto_examples/plot_classification_probability.html
### Example
Please check out [the examples](https://github.com/perrygeo/python-impute/blob/master/examples/)
This example walks through the main steps of loading training data, setting up and evaluating a classifier, and using it to predict a raster of the response variable.
The resulting prediction raster
![alt tag](https://raw.github.com/perrygeo/python-impute/master/img/example_responses.png)
The certainty estimates
![alt tag](https://raw.github.com/perrygeo/python-impute/master/img/example_certainty.png)
#### Note about performance and memory limitations
Depending on the classifier you use, memory and/or time might become limited.
The `impute` method takes an optional argument, (`linechunk`) which calibrates the performance.
Specifically, it determines how many lines/rows of the raster file are processed at once.
**tl;dr;** You want to set `linechunk` as high as possible without exceeding your memory capacity.
In this example, I use the RandomForest classifier. Other classifiers may exhibit different behavior
but, in general, there is a tradeoff between speed and memory;
as you increase `linechunk` memory increases *linearly*
![alt tag](https://raw.github.com/perrygeo/python-impute/master/img/memory.png)
while performance increases *exponentially*.
![alt tag](https://raw.github.com/perrygeo/python-impute/master/img/time.png)
#### Note about geostatistics
While kriging and other geostatistical techniques are technically "geospatial prediction", they rely on spatial dependence between observations. The problems for which this module was built are landscape scale
and rarely suited to such approaches. There is great potential to meld the two approaches (i.e. consider spatial autocorrelation between training data as well as explanatory variables) but this is currently outside the scope of this module.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
pyimpute-0.0.3.zip
(15.2 kB
view details)
pyimpute-0.0.3.tar.gz
(8.7 kB
view details)
File details
Details for the file pyimpute-0.0.3.zip
.
File metadata
- Download URL: pyimpute-0.0.3.zip
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7897dc5de2b7e0b479d48458c4314abb3c59578a836a6536ca3836e8390ec54c |
|
MD5 | 38d246cf8c8a19f05de135ed8f3ef28c |
|
BLAKE2b-256 | 9ae60dd09f354534e0eeedf849d9ec8a46cb0a6f8cf4e4512ee67b2318ba1a80 |
File details
Details for the file pyimpute-0.0.3.tar.gz
.
File metadata
- Download URL: pyimpute-0.0.3.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e09f018250f18b747745f6545f001751479727eaf5fa7cadcd704bad02bc1215 |
|
MD5 | 10ebb2fd6c662fad21ab3447310d2a75 |
|
BLAKE2b-256 | 77875f45b268686a1d894ec51fd14378901165ea9d5bff9a36985774094099ed |