Commit d099bd6d authored by lukas leufen's avatar lukas leufen

Merge branch 'release_v0.11.0' into 'master'

Release v0.11.0

Closes #136, #138, #139, #140, #141, #142, #143, #144, #151, #152

See merge request !135
parents 7f0b5ac2 a15bf606
Pipeline #46089 passed with stages
in 7 minutes and 17 seconds
......@@ -60,7 +60,7 @@ Thumbs.db
htmlcov/
.pytest_cache
/test/data/
/test/test_modules/data/
/test/test_run_modules/data/
report.html
/TestExperiment/
/testrun_network*/
......
# Changelog
All notable changes to this project will be documented in this file.
## v0.11.0 - 2020-08-24 - Advanced Data Handling for MLAir
### general
- Introduce advanced data handling with much more flexibility (independent of TOAR DB, custom data handling is
pluggable), #144
- default data handler is still using TOAR DB
### new features
- default data handler using TOAR DB refactored according to advanced data handling, #140, #141, #152
- data sets are handled as collections, #142, and are itable in a standard way (StandardIterator) and optimised for
keras (KerasIterator), #143
- automatically moving station map plot, #136
### technical
- model modules available from package, #139
- renaming of parameter time dimension, #151
- refactoring of README.md, #138
## v0.10.0 - 2020-07-15 - MLAir is official name, Workflows, easy Model plug-in
### general
......
#!/usr/bin/env bash
# run coverage twice, 1) for html deploy 2) for success evaluation
python3.6 -m pytest --cov=src --cov-report term --cov-report html test/ | tee coverage_results.out
python3.6 -m pytest --cov=mlair --cov-report term --cov-report html test/ | tee coverage_results.out
IS_FAILED=$?
......
MIT License
Copyright (c) 2020 Lukas Leufen
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
\ No newline at end of file
This diff is collapsed.
......@@ -17,7 +17,7 @@ sys.path.insert(0, os.path.abspath('../..'))
# -- Project information -----------------------------------------------------
project = 'machinelearningtools'
project = 'MLAir'
copyright = '2020, Lukas H Leufen, Felix Kleinert'
author = 'Lukas H Leufen, Felix Kleinert'
......@@ -118,7 +118,7 @@ latex_elements = {
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'machinelearningtools.tex', 'MachineLearningTools Documentation',
(master_doc, 'mlair.tex', 'MLAir Documentation',
author, 'manual'),
]
......
Get started with MachineLearningTools
=====================================
Get started with MLAir
======================
<what is machinelearningtools?>
Install MLAir
-------------
MLT Module and Funtion Documentation
------------------------------------
MLAir is based on several python frameworks. To work properly, you have to install all packages from the
`requirements.txt` file. Additionally to support the geographical plotting part it is required to install geo
packages built for your operating system. Name names of these package may differ for different systems, we refer
here to the opensuse / leap OS. The geo plot can be removed from the `plot_list`, in this case there is no need to
install the geo packages.
Install MachineLearningTools
----------------------------
* (geo) Install **proj** on your machine using the console. E.g. for opensuse / leap `zypper install proj`
* (geo) A c++ compiler is required for the installation of the program **cartopy**
* Install all requirements from [`requirements.txt`](https://gitlab.version.fz-juelich.de/toar/machinelearningtools/-/blob/master/requirements.txt)
preferably in a virtual environment
* (tf) Currently, TensorFlow-1.13 is mentioned in the requirements. We already tested the TensorFlow-1.15 version and couldn't
find any compatibility errors. Please note, that tf-1.13 and 1.15 have two distinct branches each, the default branch
for CPU support, and the "-gpu" branch for GPU support. If the GPU version is installed, MLAir will make use of the GPU
device.
* Installation of **MLAir**:
* Either clone MLAir from the [gitlab repository](https://gitlab.version.fz-juelich.de/toar/machinelearningtools.git)
and use it without installation (beside the requirements)
* or download the distribution file (?? .whl) and install it via `pip install <??>`. In this case, you can simply
import MLAir in any python script inside your virtual environment using `import mlair`.
Dependencies
How to start with MLAir
-----------------------
In this section, we show three examples how to work with MLAir.
Example 1
~~~~~~~~~
We start MLAir in a dry run without any modification. Just import mlair and run it.
.. code-block:: python
import mlair
# just give it a dry run without any modification
mlair.run()
The logging output will show you many informations. Additional information (including debug messages) are collected
inside the experiment path in the logging folder.
.. code-block::
INFO: mlair started
INFO: ExperimentSetup started
INFO: Experiment path is: /home/<usr>/mlair/testrun_network
...
INFO: load data for DEBW001 from JOIN
...
INFO: Training started
...
INFO: mlair finished after 00:00:12 (hh:mm:ss)
Example 2
~~~~~~~~~
Now we update the stations and customise the window history size parameter.
.. code-block:: python
import mlair
# our new stations to use
stations = ['DEBW030', 'DEBW037', 'DEBW031', 'DEBW015', 'DEBW107']
# expanded temporal context to 14 (days, because of default sampling="daily")
window_history_size = 14
# restart the experiment with little customisation
mlair.run(stations=stations,
window_history_size=window_history_size)
The output looks similar, but we can see, that the new stations are loaded.
.. code-block::
INFO: mlair started
INFO: ExperimentSetup started
...
INFO: load data for DEBW030 from JOIN
INFO: load data for DEBW037 from JOIN
...
INFO: Training started
...
INFO: mlair finished after 00:00:24 (hh:mm:ss)
Example 3
~~~~~~~~~
Let's just apply our trained model to new data. Therefore we keep the window history size parameter but change the stations.
In the run method, we need to disable the trainable and create new model parameters. MLAir will use the model we have
trained before. Note, this only works if the experiment path has not changed or a suitable trained model is placed
inside the experiment path.
.. code-block:: python
import mlair
# our new stations to use
stations = ['DEBY002', 'DEBY079']
# same setting for window_history_size
window_history_size = 14
# run experiment without training
mlair.run(stations=stations,
window_history_size=window_history_size,
create_new_model=False,
trainable=False)
We can see from the terminal that no training was performed. Analysis is now made on the new stations.
.. code-block::
INFO: mlair started
...
INFO: No training has started, because trainable parameter was false.
...
INFO: mlair finished after 00:00:06 (hh:mm:ss)
Customised workflows and models
-------------------------------
Custom Workflow
~~~~~~~~~~~~~~~
MLAir provides a default workflow. If additional steps are to be performed, you have to append custom run modules to
the workflow.
.. code-block:: python
import mlair
import logging
class CustomStage(mlair.RunEnvironment):
"""A custom MLAir stage for demonstration."""
def __init__(self, test_string):
super().__init__() # always call super init method
self._run(test_string) # call a class method
def _run(self, test_string):
logging.info("Just running a custom stage.")
logging.info("test_string = " + test_string)
epochs = self.data_store.get("epochs")
logging.info("epochs = " + str(epochs))
# create your custom MLAir workflow
CustomWorkflow = mlair.Workflow()
# provide stages without initialisation
CustomWorkflow.add(mlair.ExperimentSetup, epochs=128)
# add also keyword arguments for a specific stage
CustomWorkflow.add(CustomStage, test_string="Hello World")
# finally execute custom workflow in order of adding
CustomWorkflow.run()
.. code-block::
INFO: mlair started
...
INFO: ExperimentSetup finished after 00:00:12 (hh:mm:ss)
INFO: CustomStage started
INFO: Just running a custom stage.
INFO: test_string = Hello World
INFO: epochs = 128
INFO: CustomStage finished after 00:00:01 (hh:mm:ss)
INFO: mlair finished after 00:00:13 (hh:mm:ss)
Custom Model
~~~~~~~~~~~~
Data
~~~~
Each model has to inherit from the abstract model class to ensure a smooth training and evaluation behaviour. It is
required to implement the set model and set compile options methods. The later has to set the loss at least.
.. code-block:: python
import keras
from keras.losses import mean_squared_error as mse
from keras.optimizers import SGD
from mlair.model_modules import AbstractModelClass
class MyLittleModel(AbstractModelClass):
"""
A customised model with a 1x1 Conv, and 3 Dense layers (32, 16
window_lead_time). Dropout is used after Conv layer.
"""
def __init__(self, window_history_size, window_lead_time, channels):
super().__init__()
# settings
self.window_history_size = window_history_size
self.window_lead_time = window_lead_time
self.channels = channels
self.dropout_rate = 0.1
self.activation = keras.layers.PReLU
self.lr = 1e-2
# apply to model
self.set_model()
self.set_compile_options()
self.set_custom_objects(loss=self.compile_options['loss'])
def set_model(self):
# add 1 to window_size to include current time step t0
shape = (self.window_history_size + 1, 1, self.channels)
x_input = keras.layers.Input(shape=shape)
x_in = keras.layers.Conv2D(32, (1, 1), padding='same')(x_input)
x_in = self.activation()(x_in)
x_in = keras.layers.Flatten()(x_in)
x_in = keras.layers.Dropout(self.dropout_rate)(x_in)
x_in = keras.layers.Dense(32)(x_in)
x_in = self.activation()(x_in)
x_in = keras.layers.Dense(16)(x_in)
x_in = self.activation()(x_in)
x_in = keras.layers.Dense(self.window_lead_time)(x_in)
out = self.activation()(x_in)
self.model = keras.Model(inputs=x_input, outputs=[out])
def set_compile_options(self):
self.compile_options = {"optimizer": SGD(lr=self.lr),
"loss": mse,
"metrics": ["mse"]}
__version_info__ = {
'major': 0,
'minor': 10,
'minor': 11,
'micro': 0,
}
from mlair.run_modules import *
from mlair.run_modules import RunEnvironment, ExperimentSetup, PreProcessing, ModelSetup, Training, PostProcessing
from mlair.workflows import DefaultWorkflow, Workflow
from mlair.run_script import run
from mlair.model_modules import AbstractModelClass
def get_version():
......
......@@ -13,7 +13,8 @@ DEFAULT_START = "1997-01-01"
DEFAULT_END = "2017-12-31"
DEFAULT_WINDOW_HISTORY_SIZE = 13
DEFAULT_OVERWRITE_LOCAL_DATA = False
DEFAULT_TRANSFORMATION = {"scope": "data", "method": "standardise", "mean": "estimate"}
# DEFAULT_TRANSFORMATION = {"scope": "data", "method": "standardise", "mean": "estimate"}
DEFAULT_TRANSFORMATION = {"scope": "data", "method": "standardise"}
DEFAULT_HPC_LOGIN_LIST = ["ju", "hdfmll"] # ju[wels} #hdfmll(ogin)
DEFAULT_HPC_HOST_LIST = ["jw", "hdfmlc"] # first part of node names for Juwels (jw[comp], hdfmlc(ompute).
DEFAULT_CREATE_NEW_MODEL = True
......@@ -28,9 +29,9 @@ DEFAULT_TARGET_VAR = "o3"
DEFAULT_TARGET_DIM = "variables"
DEFAULT_WINDOW_LEAD_TIME = 3
DEFAULT_DIMENSIONS = {"new_index": ["datetime", "Stations"]}
DEFAULT_INTERPOLATION_DIM = "datetime"
DEFAULT_TIME_DIM = "datetime"
DEFAULT_INTERPOLATION_METHOD = "linear"
DEFAULT_LIMIT_NAN_FILL = 1
DEFAULT_INTERPOLATION_LIMIT = 1
DEFAULT_TRAIN_START = "1997-01-01"
DEFAULT_TRAIN_END = "2007-12-31"
DEFAULT_TRAIN_MIN_LENGTH = 90
......
......@@ -33,13 +33,13 @@ def prepare_host(create_new=True, data_path=None, sampling="daily") -> str:
elif hostname == "zam347":
data_path = f"/home/{user}/Data/toar_{sampling}/"
elif hostname == "linux-aa9b":
data_path = f"/home/{user}/machinelearningtools/data/toar_{sampling}/"
data_path = f"/home/{user}/mlair/data/toar_{sampling}/"
elif (len(hostname) > 2) and (hostname[:2] == "jr"):
data_path = f"/p/project/cjjsc42/{user}/DATA/toar_{sampling}/"
elif (len(hostname) > 2) and (hostname[:2] in ['jw', 'ju'] or hostname[:5] in ['hdfml']):
data_path = f"/p/project/deepacf/intelliaq/{user}/DATA/toar_{sampling}/"
elif runner_regex.match(hostname) is not None:
data_path = f"/home/{user}/machinelearningtools/data/toar_{sampling}/"
data_path = f"/home/{user}/mlair/data/toar_{sampling}/"
else:
data_path = os.path.join(os.getcwd(), "data", sampling)
# raise OSError(f"unknown host '{hostname}'")
......
......@@ -10,6 +10,6 @@ __date__ = '2020-04-17'
from .bootstraps import BootStraps
from .data_preparation_join import DataPrepJoin
from .data_generator import DataGenerator
from .data_distributor import Distributor
from .iterator import KerasIterator, DataCollection
from .advanced_data_handler import DefaultDataPreparation, AbstractDataPreparation
from .data_preparation_neighbors import DataPreparationNeighbors
This diff is collapsed.
"""
Collections of bootstrap methods and classes.
How to use
----------
test
"""
__author__ = 'Felix Kleinert, Lukas Leufen'
__date__ = '2020-02-07'
import os
from collections import Iterator, Iterable
from itertools import chain
import numpy as np
import xarray as xr
from mlair.data_handler.advanced_data_handler import AbstractDataPreparation
class BootstrapIterator(Iterator):
_position: int = None
def __init__(self, data: "BootStraps"):
assert isinstance(data, BootStraps)
self._data = data
self._dimension = data.bootstrap_dimension
self._collection = self._data.bootstraps()
self._position = 0
def __next__(self):
"""Return next element or stop iteration."""
try:
index, dimension = self._collection[self._position]
nboot = self._data.number_of_bootstraps
_X, _Y = self._data.data.get_data(as_numpy=False)
_X = list(map(lambda x: x.expand_dims({'boots': range(nboot)}, axis=-1), _X))
_Y = _Y.expand_dims({"boots": range(nboot)}, axis=-1)
single_variable = _X[index].sel({self._dimension: [dimension]})
shuffled_variable = self.shuffle(single_variable.values)
shuffled_data = xr.DataArray(shuffled_variable, coords=single_variable.coords, dims=single_variable.dims)
_X[index] = shuffled_data.combine_first(_X[index]).reindex_like(_X[index])
self._position += 1
except IndexError:
raise StopIteration()
_X, _Y = self._to_numpy(_X), self._to_numpy(_Y)
return self._reshape(_X), self._reshape(_Y), (index, dimension)
@staticmethod
def _reshape(d):
if isinstance(d, list):
return list(map(lambda x: np.rollaxis(x, -1, 0).reshape(x.shape[0] * x.shape[-1], *x.shape[1:-1]), d))
else:
shape = d.shape
return np.rollaxis(d, -1, 0).reshape(shape[0] * shape[-1], *shape[1:-1])
@staticmethod
def _to_numpy(d):
if isinstance(d, list):
return list(map(lambda x: x.values, d))
else:
return d.values
@staticmethod
def shuffle(data: np.ndarray) -> np.ndarray:
"""
Shuffle randomly from given data (draw elements with replacement).
:param data: data to shuffle
:return: shuffled data as numpy array
"""
size = data.shape
return np.random.choice(data.reshape(-1, ), size=size)
class BootStraps(Iterable):
"""
Main class to perform bootstrap operations.
This class requires a data handler following the definition of the AbstractDataPreparation, the number of bootstraps
to create and the dimension along this bootstrapping is performed (default dimension is `variables`).
When iterating on this class, it returns the bootstrapped X, Y and a tuple with (position of variable in X, name of
this variable). The tuple is interesting if X consists on mutliple input streams X_i (e.g. two or more stations)
because it shows which variable of which input X_i has been bootstrapped. All bootstrap combinations can be
retrieved by calling the .bootstraps() method. Further more, by calling the .get_orig_prediction() this class
imitates according to the set number of bootstraps the original prediction
"""
def __init__(self, data: AbstractDataPreparation, number_of_bootstraps: int = 10,
bootstrap_dimension: str = "variables"):
"""
Create iterable class to be ready to iter.
:param data: a data generator object to get data / history
:param number_of_bootstraps: the number of bootstrap realisations
"""
self.data = data
self.number_of_bootstraps = number_of_bootstraps
self.bootstrap_dimension = bootstrap_dimension
def __iter__(self):
return BootstrapIterator(self)
def __len__(self):
return len(self.bootstraps())
def bootstraps(self):
l = []
for i, x in enumerate(self.data.get_X(as_numpy=False)):
l.append(list(map(lambda y: (i, y), x.indexes['variables'])))
return list(chain(*l))
def get_orig_prediction(self, path: str, file_name: str, prediction_name: str = "CNN") -> np.ndarray:
"""
Repeat predictions from given file(_name) in path by the number of boots.
:param path: path to file
:param file_name: file name
:param prediction_name: name of the prediction to select from loaded file (default CNN)
:return: repeated predictions
"""
file = os.path.join(path, file_name)
prediction = xr.open_dataarray(file).sel(type=prediction_name).squeeze()
vals = np.tile(prediction.data, (self.number_of_bootstraps, 1))
return vals[~np.isnan(vals).any(axis=1), :]
__author__ = 'Lukas Leufen'
__date__ = '2020-07-17'
from mlair.helpers import to_list
from mlair.data_handler.station_preparation import StationPrep
from mlair.data_handler.advanced_data_handler import DefaultDataPreparation
import os
from typing import Union, List
number = Union[float, int]
num_or_list = Union[number, List[number]]
class DataPreparationNeighbors(DefaultDataPreparation):
def __init__(self, id_class, data_path, neighbors=None, min_length=0,
extreme_values: num_or_list = None, extremes_on_right_tail_only: bool = False):
self.neighbors = to_list(neighbors) if neighbors is not None else []
super().__init__(id_class, data_path, min_length=min_length, extreme_values=extreme_values,
extremes_on_right_tail_only=extremes_on_right_tail_only)
@classmethod
def build(cls, station, **kwargs):
sp_keys = {k: kwargs[k] for k in cls._requirements if k in kwargs}
sp = StationPrep(station, **sp_keys)
n_list = []
for neighbor in kwargs.get("neighbors", []):
n_list.append(StationPrep(neighbor, **sp_keys))
else:
kwargs["neighbors"] = n_list if len(n_list) > 0 else None
dp_args = {k: kwargs[k] for k in cls.own_args("id_class") if k in kwargs}
return cls(sp, **dp_args)
def _create_collection(self):
return [self.id_class] + self.neighbors
def get_coordinates(self, include_neighbors=False):
neighbors = list(map(lambda n: n.get_coordinates(), self.neighbors)) if include_neighbors is True else []
return [super(DataPreparationNeighbors, self).get_coordinates()].append(neighbors)
if __name__ == "__main__":
a = DataPreparationNeighbors
requirements = a.requirements()
kwargs = {"path": os.path.join(os.path.dirname(os.path.abspath(__file__)), "testdata"),
"station_type": None,
"network": 'UBA',
"sampling": 'daily',
"target_dim": 'variables',
"target_var": 'o3',
"time_dim": 'datetime',
"window_history_size": 7,
"window_lead_time": 3,
"neighbors": ["DEBW034"],
"data_path": os.path.join(os.path.dirname(os.path.abspath(__file__)), "testdata"),
"statistics_per_var": {'o3': 'dma8eu', 'temp': 'maximum'},
"transformation": None,}
a_inst = a.build("DEBW011", **kwargs)
print(a_inst)
__author__ = 'Lukas Leufen'
__date__ = '2020-07-07'
from collections import Iterator, Iterable
import keras
import numpy as np
import math
import os
import shutil
import pickle
from typing import Tuple, List
class StandardIterator(Iterator):
_position: int = None
def __init__(self, collection: list):
assert isinstance(collection, list)
self._collection = collection
self._position = 0
def __next__(self):
"""Return next element or stop iteration."""
try:
value = self._collection[self._position]