# Custom Dataset

In this example, we will load a dataset from `scikit-learn` and use it to create a custom `Dataset` object in _Olympus_.

In [1]:
import pandas as pd
import numpy as np
from olympus import Dataset

In [2]:
# load the boston dataset from sklearn
from sklearn.datasets import load_boston
boston = load_boston()

In [3]:
# concatenate the features and targets into single lists/arrays and use the to create a pandas dataframe
data = np.c_[boston['data'], boston['target']]
columns = list(boston['feature_names'])
columns.append('target')

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


In [4]:
# pass the Dataframe as the data argument for Dataset and specify which one is the target variable
dataset = Dataset(data=df, target_ids=['target'])

Now `dataset` is an instance of the _Olympus_ class `Dataset`. However, before we can use it to train a custom `Emulator`, we need to specicify the parameter space for this dataset/problem.

In [5]:
from olympus import ParameterSpace, Parameter

# initialise a parameter space object
param_space = ParameterSpace()

# add all features in the dataset as a variable in the parameter space
for feature in dataset.features:
    low = np.min(dataset.data[feature])   # take the min in the data
    high = np.max(dataset.data[feature])  # take the max in the data
    param = Parameter(kind='continuous', name=feature, low=low, high=high)
    param_space.add(param)
    
dataset.set_param_space(param_space)

Note that in the above code we set the bounds of the parameters based on the min/max samples in the dataset. This can also be achieved by using the `infer_param_space` method of `Dataset`, as follows:

In [6]:
dataset.infer_param_space()

However, most often you will want these bounds to depend on the details your problem, in which case you can explicitly specify the bounds for all parameters.

Now we define a small Bayesian Neural Network and we will test its performance in emulating this dataset. Note that, by default, `Dataset` creates 5 random folds for cross validation and reserves 20% of the data for testing.

In [7]:
from olympus import Emulator
from olympus.models import BayesNeuralNet

mymodel = BayesNeuralNet(hidden_depth=2, hidden_nodes=12, hidden_act='leaky_relu', out_act="relu", 
                         batch_size=50, reg=0.005, max_epochs=10000)
emulator = Emulator(dataset=dataset, model=mymodel, feature_transform='normalize', target_transform='normalize')

In [8]:
emulator.train()

[0;37m[INFO] >>> Training model on 80% of the dataset, testing on 20%...
Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
[0m[0;37m[INFO]           Epoch       Train R2     Train RMSD        Test R2      Test RMSD
[0m[0;37m[INFO]               0         -4.011          0.483         -5.083          0.404 *
[0m[0;37m[INFO]             100         -4.011          0.483         -5.083          0.404
[0m[0;37m[INFO]             200         -4.011          0.483         -5.083          0.404
[0m[0;37m[INFO]             300         -4.011          0.483         -5.083          0.404
[0m[0;37m[INFO]             400         -4.011          0.483         -5.083          0.404
[0m[0;37m[INFO]             500         -4.011          0.483         -5.083          0.404
[0m[0;37m[INFO]             600         -4.011          0.483         -5.083          0.404
[0m[0;37m[INFO] 

{'train_r2': 0.932159417784379,
 'test_r2': 0.9044223215989219,
 'train_rmsd': 0.05624565811984018,
 'test_rmsd': 0.05059158705037693}

Let's now say you would like to share this dataset with the community by uploading it to the _Olympus Datasets_. You can do this with the `upload` command line tool in _Olympus_ as described in the documentation. However, you first need to prepare the dataset in the expected format. One way to easily do this is to use the `to_disk` method available to `Dataset` objects.

In [9]:
# save dataset to disk
dataset.to_disk('custom_dataset')

In [10]:
!ls custom_dataset/

config.json     data.csv        description.txt
