Datasets

Olympus provides various datasets form across the natural sciences that form the basis of realistic and challenging benchmarks for optimization algorithms. Models trained on these datasets provide Emulators that are used to simulate an experimental campaign.

While you can load pre-trained Emulators based on these datasets, you can load these datasets with Dataset class:

from olympus.datasets import Dataset
dataset = Dataset(kind='snar')

The datasets currently available are the following:

No.

Dataset

Kind Keyword

Objective

Goal

1

Alkoxylation

alkox

reaction rate

Max

2

Colors Bob

colors_bob

green-ness

Min

3

Colors N9

colors_n9

green-ness

Min

4

Buckminsterfullerene adducts

fullerenes

yield of X1+X2

Max

5

HPLC

hplc

peak area

Max

6

Photobleaching PCE10

photo_pce10

stability

Min

7

Photobleaching WF3

photo_wf3

stability

Min

8

SnAr reaction

snar

e_factor

Min

9

N-benzylation

benzylation

e_factor

Min

10

Suzuki reaction

suzuki

yield

Max

In addition to the Olympus datasets, you can load your own custom ones:

from olympus.datasets import Dataset
import pandas as pd

mydata = pd.from_csv('mydata.csv')
dataset = Dataset(data=mydata)

Dataset Class

class olympus.datasets.Dataset(kind=None, data=None, columns=None, target_ids=None, test_frac=0.2, num_folds=5, random_seed=None)[source]

A Dataset object stores the data of a dataset by wrapping a pandas.DataFrame in its data attribute, provides additional information on the dataset, and provides convenience methods to access features and targets as well as to generate training/validation/test splits.

Parameters
  • kind (str) – kind of the Olympus dataset to load.

  • data (array) – custom dataset. Same input as for pandas.DataFrame.

  • columns (list) – column names. Same input as for pandas.DataFrame.

  • target_ids (list) – list of column indices, or names if provided, that identify the targets for the predictions.

  • test_frac (float) – fraction of the data to be used as test set.

  • num_folds (int) – number of cross validation folds the training set will be split into.

  • random_seed (int) – random seed for numpy. Setting a seed makes the random splits reproducible.

Methods

dataset_info()

Provide summary info about dataset.

set_param_space(param_space)

Define the parameter space of the dataset.

get_cv_fold(fold)

Get the data for a specific cross-validation fold.

create_train_validate_test_splits(test_frac=0.2, num_folds=5, test_indices=None)[source]
Parameters
  • test_frac (float) –

  • num_folds (int) –

  • test_indices (array) – Array with the indices of the samples to be used as test set.

dataset_info()[source]

Provide summary info about dataset.

get_cv_fold(fold)[source]

Get the data for a specific cross-validation fold.

Parameters

fold (int) – fold id.

Returns

data for the chosen fold.

Return type

data (DataFrame)

infer_param_space()[source]

Guess the parameter space from the dataset. The range for all parameters will be define based on the minimum and maximum values in the dataset for each variable. All variables will be assumed not to be periodic.

set_param_space(param_space)[source]

Define the parameter space of the dataset.

Parameters

param_space (ParameterSpace) – ParameterSpace object with information about all variables in the dataset.

to_disk(folder='custom_dataset')[source]

Save the dataset to disk in the format expected by Olympus for its own datasets. This can be useful if you plan to upload the dataset to the community datasets available online.

Parameters

folder (str) – Folder in which to save the dataset files.