Data
Data Objects, Data Blocks and Dataset
Dataset
The dataset has two parts: a Dataset
object and - in separate folders - the files containing the data (samples).
Each dataset requires a separate unique folder location, which is automatically a folder with the name of the dataset in the specified root path. The samples are stored in fixed subfolder directories.
The Dataset
object holds the data objects and data blocks that define the data.
It also provides methods to, for example:
generate random samples,
import data from another dataset, or
get random samples without incorporating them to the dataset, etc.
Data Objects
All variables in the project (both design parameters and performance attributes) need to be defined as one of the following types:
DataInt
- for integersDataReal
- for real/float valuesDataCategorical
- for categorical variable such as labels, e.g. “cat”, “dog”DataOrdinal
- e.g. “small”, “medium”, “big”DataBool
- for True/False
For every variable, create a corresponding data object with a user-defined name and indicating their dimensionality. The variable names must be unique and cannot be any of the following: “error”, “uid” (which are reserved for other purposes).
Attention
These data objects do not store sample values - they only define the variables.
For variables that are design parameters, the user also needs to specify their domain, for example:
Interval(min_val, max_val)
- for real and integer variablesOptions(<list of items>)
- for integer, ordinal and categorical variables
For performance attributes, the domains are usually not known beforehand. They will be automatically updated from samples in the dataset.
Data Blocks
Data objects are collected into data blocks according to their role in the dataset or in the neural network model.
The dataset comprises the following data blocks:
DesignParameters
PerformanceAttributes
The neural network model requires two data blocks:
InputML
- for variables, for which the generative model will generate new values conditioned on the user’s requestOutputML
- for variables used in the user’s request
In many design scenarios, DesignParameters
entails the same data objects as InputML
,
and PerformanceAttributes
are the same as OutputML
.
However, InputML
and OutputML
blocks can be defined from any two (disjoint) subsets of data objects.
Create a dataset
If you don’t have a dataset yet for your project, you can create one from scratch. First, set up a new dataset, and then generate the data samples.
Set up a dataset
Here is a dummy example of how to define a dataset, data objects and data blocks:
# Data objects - design parameters
a = DataReal(name = "a", dim=1, domain=Interval(0,10))
b = DataCategorical(name = "b", dim=5, domain=Options(['red','green','blue']))
# Data objects - performance attributes
c = DataInt(name = "c", dim=3, domain = Options([1,2,3,4,5]))
d = DataBool(name = "d", dim=10)
# Data blocks
dp = DesignParameters(name = 'DP', dobj_list=[a,b])
pa = PerformanceAttributes(name = 'PA', dobj_list = [c,d])
# Dataset
dataset = Dataset(root_path=ROOT,
name="my_dataset",
design_par=dp,
perf_attributes=pa)
Generate data samples
To generate data samples, you need to perform two steps:
sampling: generate random values for the variables declared as design parameters
analysis: feed these values through the parametric model to obtain the values of the performance attributes
The toolkit provides tools to easily perform both steps, with multiple options to customize the process through custom callback functions.
Sampling
dataset.sampling()
is an easy way to generate a batch of random values for design parameters, and automatically save the data to files.
dataset.sampling(n_samples=10000,
samples_perfile=100,
callbacks_class=None,
engine="random")
This will generate 10000 samples, and save them in files with 100 samples each, in the folder named design_parameters inside the dataset directory. For more advanced sampling options, see section Sampler.
Analysis
If you can define your “parametric design model” as a python function, you can use it as a callback to the dataset.analysis()
.
It will automatically retrieve the pre-generated design parameters, and for each sample calculate the performance attributes,
and automatically save them to files.
from aixd.data.custom_callbacks import AnalysisCallback
analyzer_class = AnalysisCallback('Analysis function',
func_callback = [analysis_pipeline],
dataset = dataset)
dataset.analysis(analyzer = analyzer_class)
The example will calculate performance attributes for all samples, by passing the design parameter values to the analysis callback function. The performance attribute data will be saved in corresponding files in the subfolder performance_attributes. If the analysis task is computationally demanding, you can run it in batches, by just indicating the number of files of design parameters to process using the parameter n_files. This will automatically run only for the next batch of not yet processed samples.
If the analysis pipeline - or the parametric design model - is carried out in a separate pipeline (e.g. in a CAD software such as Rhino/Grasshopper), it is possible to calculate the performance attributes externally. To add them to the dataset, you can use the methods to import data explained in Importing an external/custom dataset.
If you resume/continue …
If you want to add more samples to your dataset, simply repeat the sampling and analysis campaign. The steps of sampling and analysis can be run any number of times.
Data contained in the dataset object
Once the data is loaded using the method dataset.load()
, the data is additionally stored in the Dataset object.
In order to access the data, we have different methods.
dataset.data
This method prints all the data contained in the Dataset
object.
The data is returned as a dictionary, with keys 'design_parameters'
, and 'performance attributes'
,
and each pointing to a pandas DataFrame where rows correspond to the samples, and columns to the data values.
These data frames are flattened, which means that values for multi-dimensional data objects
are stored as separate columns.
These columns are named automatically by adding a suffix to the original data object’s name (e.g. _0
, _1
etc.)
and can be retrieved with:
dataset.design_par.columns_df
dataset.perf_attributes.columns_df
Besides, the DesignParameters
DataFrame has an additional columns, uid, to identify the sample.
The DataFrame PerformanceAttributes
also contains an additional column uid, and another column called error,
which is used to flag samples for which the calculation of performance attributes has failed.
dataset.data_mats
This method similarly returns a dictionary, but in this case the data is formatted as numpy arrays, instead of data frames.
Importing a dataset
Importing an existing AIXD dataset
If you already have a dataset created using AIXD
toolkit earlier, simply load it using the pickled dataset object file.
It is expected that the samples files reside in the same directory as the dataset object file in the respective subfolders.
from aixd.data.dataset import Dataset
# recreate the Dataset object from the pickled file:
dataset = Dataset.from_dataset_object(filepath=path_to_datasetobject_file)
# load the data
dataset.load()
The method load()
takes care of loading the data into memory to be accessible from the Dataset
object.
By default, all samples, i.e. tuples of design parameters and performance attributes, are loaded.
Through the parameter n_samples you can specify a smaller number of samples to load, in case you prefer to explore,
or train the model, in just a subset.
Importing an external/custom dataset
If you already have data samples of your design or project, but in a different format,
you can import them to make them compatible with the AIXD
workflow.
First, set up a dataset to define the data objects, data blocks and a dataset object in a way that corresponds to the type, dimensionality and order of your data. Then use:
dataset.import_data_from_df(data=dataframe_with_my_samples,
samples_perfile=100)
or
dataset.import_data_from_csv(file_path=path_to_csv_file)
This will reorganize your data and save them in the correct format and locations. In both cases, the name of the columns are used to try to match them with the data objects defined in the dataset. For more details and advanced options, see API Reference.
Data analysis & exploration
The data itself can be a valuable source of insights about the design task. See the section about the plotter, which includes methods for data exploration and plotting.