Datasets¶
Olympus provides various datasets form across the natural sciences that form the basis of realistic and challenging benchmarks for optimization algorithms. Models trained on these datasets provide Emulators that are used to simulate an experimental campaign.
While you can load pre-trained Emulators based on these datasets, you can load these datasets with Dataset
class:
from olympus.datasets import Dataset
dataset = Dataset(kind='snar')
The datasets currently available are the following:
No. |
Dataset |
Kind Keyword |
Objective |
Goal |
---|---|---|---|---|
1 |
alkox |
reaction rate |
Max |
|
2 |
colors_bob |
green-ness |
Min |
|
3 |
colors_n9 |
green-ness |
Min |
|
4 |
fullerenes |
yield of X1+X2 |
Max |
|
5 |
hplc |
peak area |
Max |
|
6 |
photo_pce10 |
stability |
Min |
|
7 |
photo_wf3 |
stability |
Min |
|
8 |
snar |
e_factor |
Min |
|
9 |
benzylation |
e_factor |
Min |
|
10 |
suzuki |
yield |
Max |
In addition to the Olympus datasets, you can load your own custom ones:
from olympus.datasets import Dataset
import pandas as pd
mydata = pd.from_csv('mydata.csv')
dataset = Dataset(data=mydata)
Dataset Class¶
-
class
olympus.datasets.
Dataset
(kind=None, data=None, columns=None, target_ids=None, test_frac=0.2, num_folds=5, random_seed=None)[source] A
Dataset
object stores the data of a dataset by wrapping apandas.DataFrame
in itsdata
attribute, provides additional information on the dataset, and provides convenience methods to access features and targets as well as to generate training/validation/test splits.- Parameters
kind (str) – kind of the Olympus dataset to load.
data (array) – custom dataset. Same input as for pandas.DataFrame.
columns (list) – column names. Same input as for pandas.DataFrame.
target_ids (list) – list of column indices, or names if provided, that identify the targets for the predictions.
test_frac (float) – fraction of the data to be used as test set.
num_folds (int) – number of cross validation folds the training set will be split into.
random_seed (int) – random seed for numpy. Setting a seed makes the random splits reproducible.
Methods
Provide summary info about dataset.
set_param_space
(param_space)Define the parameter space of the dataset.
get_cv_fold
(fold)Get the data for a specific cross-validation fold.
-
create_train_validate_test_splits
(test_frac=0.2, num_folds=5, test_indices=None)[source] - Parameters
test_frac (float) –
num_folds (int) –
test_indices (array) – Array with the indices of the samples to be used as test set.
-
dataset_info
()[source] Provide summary info about dataset.
-
get_cv_fold
(fold)[source] Get the data for a specific cross-validation fold.
- Parameters
fold (int) – fold id.
- Returns
data for the chosen fold.
- Return type
data (DataFrame)
-
infer_param_space
()[source] Guess the parameter space from the dataset. The range for all parameters will be define based on the minimum and maximum values in the dataset for each variable. All variables will be assumed not to be periodic.
-
set_param_space
(param_space)[source] Define the parameter space of the dataset.
- Parameters
param_space (ParameterSpace) – ParameterSpace object with information about all variables in the dataset.
-
to_disk
(folder='custom_dataset')[source] Save the dataset to disk in the format expected by Olympus for its own datasets. This can be useful if you plan to upload the dataset to the community datasets available online.
- Parameters
folder (str) – Folder in which to save the dataset files.