Preprocessing
In-built preprocessing functions to handle graph data.
apply_scaler(dataset, method='zero_mean', target='node')
#
Applies the selected scaling method to the provided dataset. After scaling, the used scaler instance is
accessible through the StaticGraphDataset instance as either node_scaler or edge_scaler depending on the
given scaling target.
The dataset needs to be split with either of the create_train_test_split-methods in order to correctly apply
scaling. (Fitting only on training data and applying to training and test data)
Scaling is applied to both the inputs and labels and done per feature. For time-series data, this means that each feature of every graph in the input sequence is scaled independently to avoid weighting repetitions in the sequence too much.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset |
StaticGraphDataset
|
Dataset to be scaled |
required |
method |
str
|
Scaling method to be applied. Either |
'zero_mean'
|
target |
str
|
Either |
'node'
|
Returns:
| Type | Description |
|---|---|
Tuple[GraphList, GraphList] | Tuple[GraphList, GraphList, GraphList, GraphList]
|
Either a 4-tuple of scaled data if the dataset consists of time-series data. Else a 2-tuple of the scaled train and test data. |
Source code in graphs_on_grids/preprocessing/preprocessing.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | |
create_train_test_split(dataset, train_size=0.8, random_state=None, shuffle=True)
#
Create a train-test-split from an instance of gog.structure.graph.StaticGraphDataset()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset |
StaticGraphDataset
|
Dataset to be split |
required |
train_size |
Relative size of the training set as a value between 0 and 1. The test set will contain percent of the instances |
0.8
|
|
random_state |
Sets the random state for shuffling |
None
|
|
shuffle |
Whether to shuffle the data before splitting. |
True
|
Returns:
| Type | Description |
|---|---|
Tuple[GraphList, GraphList]
|
Tuple of |
Source code in graphs_on_grids/preprocessing/preprocessing.py
create_train_test_split_windowed(dataset, window_size, len_labels=1, step=1, start=0, train_size=0.8, random_state=None, shuffle=False)
#
Creates a windowed dataset from the provided StaticGraphDatasetinstance. After that, a train-test-split
is created from the windowed data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset |
StaticGraphDataset
|
Dataset to be windowed and split |
required |
window_size |
int
|
Sequence length of to be provided as input to the model |
required |
len_labels |
int
|
The output sequence length to be predicted by the model |
1
|
step |
int
|
Step size of the windowing algorithm. Describes how much the window start is shifted after creating a window instance. If set to |
1
|
start |
int
|
Start index for windowing |
0
|
train_size |
float
|
Relative size of the training set as a value between 0 and 1. The test set will contain percent of the instances |
0.8
|
random_state |
int
|
Sets the random state for shuffling |
None
|
shuffle |
bool
|
Whether to shuffle the data before splitting. |
False
|
Returns:
| Type | Description |
|---|---|
Tuple[GraphList, GraphList, GraphList, GraphList]
|
Tuple of |
Source code in graphs_on_grids/preprocessing/preprocessing.py
create_validation_set(X, y, validation_size=0.2)
#
Creates a validation set from provided data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X |
GraphList
|
Training data to be split |
required |
y |
GraphList
|
Labels of training data to be split |
required |
validation_size |
float
|
Relative size of validation set. The training set will be of size percent of the original training set. |
0.2
|
Returns:
| Type | Description |
|---|---|
Tuple[GraphList, GraphList, GraphList, GraphList]
|
A 4-Tuple of the training and validation set inputs and targets. |
Source code in graphs_on_grids/preprocessing/preprocessing.py
mask_features(X_train, X_test, targets, node_indices, method='zeros')
#
Masks selected features of nodes at the provided indices by either a set or random value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_train |
GraphList
|
Training set |
required |
X_test |
GraphList
|
Test set |
required |
targets |
List[str]
|
Which node features to mask |
required |
node_indices |
List | ndarray
|
Which nodes to apply the feature masking to |
required |
method |
str
|
Either |
'zeros'
|
Returns:
| Type | Description |
|---|---|
Tuple[GraphList, GraphList]
|
A pair of the masked train and test split |