π Adding a Custom Dataset Tutorial#
π― Tutorial Overview#
This comprehensive guide walks you through the process of integrating your custom dataset into our library. The process is divided into three main steps:
Dataset Creation π¨
Implement data loading mechanisms
Define preprocessing steps
Structure data in the required format
Integrate with Dataset APIs π
Add dataset to the library framework
Ensure compatibility with existing systems
Set up proper inheritance structure
Configuration Setup βοΈ
Define dataset parameters
Specify data paths and formats
Configure preprocessing options
π Tutorial Structure#
This tutorial follows a unique structure to provide the clearest possible learning experience:
π‘ Main Notebook (Current File)
High-level concepts and explanations
Step-by-step workflow description
References to implementation files
π Supporting Files
Detailed code implementations
Specific examples and use cases
Technical documentation
π οΈ Technical Framework#
This tutorial demonstrates custom dataset integration using:
torch_geometric.data.InMemoryDatasetas the base classlibraryβs dataset management system
π Important Notes#
To make the learning process concrete, weβll work with a practical toy βlanguageβ dataset example:
While we use the βlanguageβ dataset as an example, all file references use the generic
<dataset_name>format for better generalization
Step 1: Create a Dataset π οΈ#
Overview#
Adding your custom dataset to requires implementing specific loading and preprocessing functionality. We utilize the torch_geometric.data.InMemoryDataset interface to make this process straightforward.
Required Methods#
To implement your dataset, you need to override two key methods from the torch_geometric.data.InMemoryDataset class:
download(): Handles dataset acquisitionprocess(): Manages data preprocessing
π‘ Reference Implementation: For a complete example, check
topobenchmark/data/datasets/language_dataset.py
Deep Dive: The Download Method#
The download() method is responsible for acquiring dataset files from external resources. Letβs examine its implementation using our language dataset example, where we store data in a GoogleDrive-hosted zip file.
Implementation Steps#
Download Data π₯
Fetch data from the specified source URL
Save to the raw directory
Extract Content π¦
Unzip the downloaded file
Place contents in appropriate directory
Organize Files π
Move extracted files to named folders
Clean up temporary files and directories
Code Implementation#
```python def download(self) -> None: rβββDownload the dataset from a URL and saves it to the raw directory.
Raises:
FileNotFoundError: If the dataset URL is not found.
"""
# Step 1: Download data from the source
self.url = self.URLS[self.name]
self.file_format = self.FILE_FORMAT[self.name]
download_file_from_drive(
file_link=self.url,
path_to_save=self.raw_dir,
dataset_name=self.name,
file_format=self.file_format,
)
# Step 2: extract zip file
folder = self.raw_dir
filename = f"{self.name}.{self.file_format}"
path = osp.join(folder, filename)
extract_zip(path, folder)
# Delete zip file
os.unlink(path)
# Step 3: organize files
# Move files from osp.join(folder, name_download) to folder
for file in os.listdir(osp.join(folder, self.name)):
shutil.move(osp.join(folder, self.name, file), folder)
# Delete osp.join(folder, self.name) dir
shutil.rmtree(osp.join(folder, self.name))
Deep Dive: The Process Method#
The process() method handles data preprocessing and organization. Hereβs the methodβs structure:
```python def process(self) -> None: rβββHandle the data for the dataset.
This method loads the Language dataset, applies preprocessing transformations, and saves processed data.βββ
# Step 1: extract the data β¦ # Convert raw data to list of torch_geometric.data.Data objects
# Step 2: collate the graphs self.data, self.slices = self.collate(graph_sentences)
# Step 3: save processed data fs.torch_save( (self._data.to_dict(), self.slices, {}, self._data.__class__), self.processed_paths[0], )
self.collate β Collates a list of Data or HeteroData objects to the internal storage format; meaning that it transforms a list of torch.data.Data objectis into one torch.data.BaseData.
Step 2: Integrate with Dataset APIs π#
Now that we have created a dataset class, we need to integrate it with the library. In this section we describe where to add the dataset files and how to make it available through data loaders.
Hereβs how to structure your files, the files highlighted with ** are going to be updated:
topobenchmark/
βββ data/
β βββ datasets/
β β βββ **init.py**
β β βββ base.py
β β βββ <dataset_name>.py # Your dataset file
β β βββ ...
β βββ loaders/
β β βββ init.py
β β βββ base.py
β β βββ graph/
β β β βββ <loader_name>.py # Your loader file
β β βββ hypergraph/
β β β βββ <loader_name>.py # Your loader file
β β βββ .../
To make your dataset available to library:
The file <dataset_name>.py has been created during the previous steps (us_county_demos_dataset.py in our case) and should be placed in the topobenchmark/data/datasets/ directory.
The registry topobenchmark/data/datasets/__init__.py discovers the files in topobenchmark/data/datasets and updates __all__ variable of topobenchmark/data/datasets/__init__.py automatically. Hence there is no need to update the __init__.py file manually to allow your dataset to be loaded by the library. Simply creare a file <dataset_name>.py and place it in the topobenchmark/data/datasets/ directory.
Next it is required to update the data loader system. Modify the loader file (topobenchmark/data/loaders/loaders.py:) to include your custom dataset:
For the the example dataset we add the following into the file topobenchmark/data/loaders/graph/us_county_demos_dataset_loader.py which consist of the following:
class USCountyDemosDatasetLoader(AbstractLoader):
"""Load US County Demos dataset with configurable year and task variable.
Parameters
----------
parameters : DictConfig
Configuration parameters containing:
- data_dir: Root directory for data
- data_name: Name of the dataset
- year: Year of the dataset (if applicable)
- task_variable: Task variable for the dataset
"""
def __init__(self, parameters: DictConfig) -> None:
super().__init__(parameters)
def load_dataset(self) -> USCountyDemosDataset:
"""Load the US County Demos dataset.
Returns
-------
USCountyDemosDataset
The loaded US County Demos dataset with the appropriate `data_dir`.
Raises
------
RuntimeError
If dataset loading fails.
"""
dataset = self._initialize_dataset()
self.data_dir = self._redefine_data_dir(dataset)
return dataset
def _initialize_dataset(self) -> USCountyDemosDataset:
"""Initialize the US County Demos dataset.
Returns
-------
USCountyDemosDataset
The initialized dataset instance.
"""
return USCountyDemosDataset(
root=str(self.root_data_dir),
name=self.parameters.data_name,
parameters=self.parameters,
)
def _redefine_data_dir(self, dataset: USCountyDemosDataset) -> Path:
"""Redefine the data directory based on the chosen (year, task_variable) pair.
Parameters
----------
dataset : USCountyDemosDataset
The dataset instance.
Returns
-------
Path
The redefined data directory path.
"""
return dataset.processed_root
where the method load_dataset is required while other methods are optional used for convenience and structure.
The
load_datasetofAbstractLoaderclass requires to returntorch.utils.data.Datasetobject.Important: to allow the automatic registering of the loader, make sure to include βDatasetLoaderβ into name of loader class (Example: USCountyDemosDatasetLoader)
Step 3: Define Configuration π§#
Now that weβve integrated our dataset, we need to define its configuration parameters. In this section, weβll explain how to create and structure the configuration file for your dataset.
Configuration File Structure#
Create a new YAML file for your dataset in configs/dataset/<dataset_name>.yaml with the following structure:
While creating a configuration file, you will need to specify:#
Loader class (
topobenchmark.data.loaders.USCountyDemosDatasetLoader) for automatic instantialization inside the provided pipeline and the parameters for the loader.
# Dataset loader config
loader:
_target_: topobenchmark.data.loaders.USCountyDemosDatasetLoader
parameters:
data_domain: graph # Primary data domain. Options: ['graph', 'hypergrpah', 'cell, 'simplicial']
data_type: cornel # Data type. String emphasizing from where dataset come from.
data_name: US-county-demos # Name of the dataset
year: 2012 # In the case of US-county-demos there are multiple version of this dataset. Options:[2012, 2016]
task_variable: 'Election' # Different target variable used as target. Options: ['Election', 'MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']
data_dir: ${paths.data_dir}/${dataset.loader.parameters.data_domain}/${dataset.loader.parameters.data_type}
The dataset parameters:
# Dataset parameters
parameters:
num_features: 6 # Number of features in the dataset
num_classes: 1 # Dimentuin of the target variable
task: regression # Dataset task. Options: [classification, regression]
loss_type: mse # Task-specific loss function
monitor_metric: mae # Metric to monitor during training
task_level: node # Task level. Options: [classification, regression]
The dataset split parameters:
#splits
split_params:
learning_setting: transductive # Type of learning. Options:['transductive', 'inductive']
data_seed: 0 # Seed for data splitting
split_type: random # Type of splitting. Options: ['k-fold', 'random']
k: 10 # Number of folds in case of "k-fold" cross-validation
train_prop: 0.5 # Training proportion in case of 'random' splitting strategy
standardize: True # Standardize the data or not. Options: [True, False]
data_split_dir: ${paths.data_dir}/data_splits/${dataset.loader.parameters.data_name}
Finally the dataloader parameters:
# Dataloader parameters
dataloader_params:
batch_size: 1 # Number of graphs per batch. In sace of transductive always 1 as there is only one graph.
num_workers: 0 # Number of workers for data loading
pin_memory: False # Pin memory for data loading
Notes:#
The
pathssection in the configuration file is automatically populated with the paths to the data directory and the data splits directory.Some of the dataset parameters are used to configure the model.yaml and other files. Hence we suggest always include the above parameters in the dataset configuration file.
Hereβs the markdown for easy copying:
Preparing to Load the Custom Dataset: Understanding Configuration Imports#
Before loading our dataset, itβs crucial to understand the configuration imports, particularly those from the topobenchmark.utils.config_resolvers module. These utility functions play a key role in dynamically configuring your machine learning pipeline.
Key Imports for Dynamic Configuration#
Letβs import the essential configuration resolver functions:
from topobenchmark.utils.config_resolvers import (
get_default_transform,
get_monitor_metric,
get_monitor_mode,
infer_in_channels,
)
Why These Imports Matter#
In our previous step, we explored configuration variables that use dynamic lookups, such as:
data_dir: ${paths.data_dir}/${dataset.loader.parameters.data_domain}/${dataset.loader.parameters.data_type}
However, some configurations require more advanced automation, which is where these imported functions become invaluable.
Practical Example: Dynamic Transforms#
Consider the configuration in projects/TopoBenchmark/configs/run.yaml, where the transforms parameter uses the get_default_transform function:
transforms: ${get_default_transform:${dataset},${model}}
This syntax allows for automatic transformation selection based on the dataset and model, demonstrating the power of these configuration resolver functions.
By importing and utilizing these functions, you gain:
Flexible configuration management
Automatic parameter inference
Reduced manual configuration overhead
These facilitate seamless dataset loading and preprocessing for multiple topological domains and provide an easy and intuitive interface for incorporating novel functionality. ```
In [1]:
from hydra import compose, initialize
from hydra.utils import instantiate
from topobenchmark.utils.config_resolvers import (
get_default_transform,
get_monitor_metric,
get_monitor_mode,
infer_in_channels,
)
initialize(config_path="../configs", job_name="job")
cfg = compose(
config_name="run.yaml",
overrides=[
"model=hypergraph/unignn2",
"dataset=graph/US-county-demos",
],
return_hydra_config=True
)
loader = instantiate(cfg.dataset.loader)
dataset, dataset_dir = loader.load()
/tmp/ipykernel_1170891/1713955081.py:14: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
initialize(config_path="../configs", job_name="job")
In [2]:
print(dataset)
US-county-demos(self.root=/home/lev/projects/TopoBenchmark/datasets/graph/cornel, self.name=US-county-demos, self.parameters={'data_domain': 'graph', 'data_type': 'cornel', 'data_name': 'US-county-demos', 'year': 2012, 'task_variable': 'Election', 'data_dir': '/home/lev/projects/TopoBenchmark/datasets/graph/cornel'}, self.force_reload=False)
In [3]:
print(dataset[0])
Data(x=[3224, 6], edge_index=[2, 18966], y=[3224])
Step 4.1: Default Data Transformations βοΈ#
While most datasets can be used directly after integration, some require specific preprocessing transformations. These transformations might vary depending on the task, model, or other conditions.
Example Case: US-county-demos Dataset#
Letβs look at our language datasetβs structure the compose function.
cfg = compose(
config_name="run.yaml",
overrides=[
"model=hypergraph/unignn2",
"dataset=graph/US-county-demos",
],
return_hydra_config=True
)
we can see that the model is hypergraph/unignn2 from hypergraph domain while the dataset is from graph domain. This implied that the discussed above get_default_transform function:
transforms: ${get_default_transform:${dataset},${model}}
Inferred a default transform from graph to hypegraph domain.
In [4]:
print('Transform name:', cfg.transforms.keys())
print('Transform parameters:', cfg.transforms['graph2hypergraph_lifting'])
Transform name: dict_keys(['graph2hypergraph_lifting'])
Transform parameters: {'_target_': 'topobenchmark.transforms.data_transform.DataTransform', 'transform_type': 'lifting', 'transform_name': 'HypergraphKHopLifting', 'k_value': 1, 'feature_lifting': 'ProjectionSum', 'neighborhoods': '${oc.select:model.backbone.neighborhoods,null}'}
Some datasets require might require default transforms which are applied whenever it is nedded to model the data.
The topobenchmark library provides a simple way to define custom transformations and apply them to the dataset. Take a look at TopoBenchmark/configs/transforms/dataset_defaults folder where you can find some default transformations for different datasets.
For example, REDDIT-BINARY does not have initial node features and it is a common practice to define initial features as gaussian noise. Hence the TopoBenchmark/configs/transforms/dataset_defaults/REDDIT-BINARY.yaml file incorporates the gaussian_noise transform by default. Hence whenver you choose to uplodad the REDDIT-BINARY dataset (and do not modify transforms parameter), the gaussian_noise transform will be applied to the dataset.
defaults:
- data_manipulations: equal_gaus_features
- liftings@_here_: ${get_required_lifting:graph,${model}}
Below we provide an quick tutorial on how to create a data transformations and create a sequence of default transformations that will be executed whener you use the defined dataset config file.
Below we provide an quick tutorial on how to create a data transformations and create a sequence of default transformations that will be executed whener you use the defined dataset config file.
In [5]:
# Avoid override transforms
cfg = compose(
config_name="run.yaml",
overrides=[
"model=hypergraph/unignn2",
"dataset=graph/REDDIT-BINARY",
],
return_hydra_config=True
)
loader = instantiate(cfg.dataset.loader)
dataset, dataset_dir = loader.load()
REDDIT_BINARY dataset does not have any initial node features
In [6]:
dataset[0]
Out [6]:
Data(edge_index=[2, 480], y=[1], num_nodes=218)
Take a look at the default transforms and the parameters of equal_gaus_features transform
In [7]:
print('Transform name:', cfg.transforms.keys())
print('Transform parameters:', cfg.transforms['equal_gaus_features'])
Transform name: dict_keys(['equal_gaus_features', 'graph2hypergraph_lifting'])
Transform parameters: {'_target_': 'topobenchmark.transforms.data_transform.DataTransform', 'transform_name': 'EqualGausFeatures', 'transform_type': 'data manipulation', 'mean': 0, 'std': 0.1, 'num_features': '${dataset.parameters.num_features}'}
In [8]:
from topobenchmark.data.preprocessor import PreProcessor
preprocessed_dataset = PreProcessor(dataset, dataset_dir, cfg['transforms'])
Processing...
Done!
In [9]:
preprocessed_dataset[0]
Out [9]:
Data(x=[218, 10], edge_index=[2, 480], y=[1], incidence_hyperedges=[218, 218], num_hyperedges=[1], x_0=[218, 10], x_hyperedges=[218, 10], num_nodes=218)
The preprocessed dataset has the features generated by the preprocessor. And the connectivity of the dataset has been transformed into hypegraph domain.
Creating your own default transforms#
Now when we have seen how to add custom dataset and how does the default transform works. One might want to reate your own default transforms for new dataset that will be executed always whenwever the dataset under default configuration is used.
To configure the deafult transform navigate to configs/transforms/dataset_defaults create <def_transforms.yaml> and the follwoing .yaml file:
defaults:
- transform_1: transform_1
- transform_2: transform_2
- transform_3: transform_3
Important There are different types of transforms, including data_manipulation, liftings, and feature_liftings. In case you want to use multiple transforms from the same categoty, letβs say from data_manipulation, then it is required to stick to a special syntaxis. `See hydra configuration for more information <>`__ or the example below:
defaults:
- data_manipulation@first_usage: transform_1
- data_manipulation@second_usage: transform_2
Notes:#
Transforms from the same category: If There are a two transforms from the same catgory, for example,
data_manipulations, it is required to use operator@to assign new diffrerent namesfirst_usageandsecond_usageto each transform.In the case of
equal_gaus_featureswe have to override the initial number of features as theequal_gaus_features.yamlwhich uses a special register to infer the feature dimension (the registed logic descrived in Step 3.) However by some reason we want to specifynum_featuresparameter we can override it in the default file without the need to change the transform config file.
defaults:
- data_manipulations@equal_gaus_features: equal_gaus_features
- data_manipulations@some_transform: some_transform
- liftings@_here_: ${get_required_lifting:graph,${model}}
equal_gaus_features:
num_features: 100
some_transform:
some_param: bla
We recommend to always add
liftings@_here_: ${get_required_lifting:graph,${model}}so that a default lifting is applied to run any domain-specific topological model.
Step 4.2: Custom Data Transformations βοΈ#
In general any transfom in the library inherits torch_geometric.transforms.BaseTransform class, which allow to apply a sequency of transforms to the data. Our inderface requires to implement the forward method. The important part of all transforms is that it takes torch_geometric.data.Data object and returns updated torch_geometric.data.Data object.
For language dataset, we have generated the equal_gaus_features transfroms that is a data_manipulation transform hence we place it into topobenchmark/transforms/data_manipulation/ folder. Below you can see th EqualGausFeatures class:
class EqualGausFeatures(torch_geometric.transforms.BaseTransform):
r"""A transform that generates equal Gaussian features for all nodes.
Parameters
----------
**kwargs : optional
Additional arguments for the class. It should contain the following keys:
- mean (float): The mean of the Gaussian distribution.
- std (float): The standard deviation of the Gaussian distribution.
- num_features (int): The number of features to generate.
"""
def __init__(self, **kwargs):
super().__init__()
self.type = "generate_non_informative_features"
# Torch generate feature vector from gaus distribution
self.mean = kwargs["mean"]
self.std = kwargs["std"]
self.feature_vector = kwargs["num_features"]
self.feature_vector = torch.normal(
mean=self.mean, std=self.std, size=(1, self.feature_vector)
)
def __repr__(self) -> str:
return f"{self.__class__.__name__}(type={self.type!r}, mean={self.mean!r}, std={self.std!r}, feature_vector={self.feature_vector!r})"
def forward(self, data: torch_geometric.data.Data):
r"""Apply the transform to the input data.
Parameters
----------
data : torch_geometric.data.Data
The input data.
Returns
-------
torch_geometric.data.Data
The transformed data.
"""
data.x = self.feature_vector.expand(data.num_nodes, -1)
return data
As we said above the forward function takes as input the torch_geometric.data.Data object, modifies it, and returns it.
Similarly to adding dataset the transformations you have created and placed at right folder are automatically registered.
Now as we have registered the transform we can finally create the configuration file and use it in the framework:
_target_: topobenchmark.transforms.data_transform.DataTransform
transform_name: "EqualGausFeatures"
transform_type: "data manipulation"
mean: 0
std: 0.1
num_features: ${dataset.parameters.num_features}
Please refer to configs/transforms/dataset_defaults/equal_gaus_features.yaml for the example.
Notes:
You might notice an interesting key
_target_in the configuration file. In general for any new transform you the_target_is alwaystopobenchmark.transforms.data_transform.DataTransform. For more information please refer to hydra documentation βInstantiating objects with Hydraβ section..