📚 Adding a Custom Dataset Tutorial#

🎯 Tutorial Overview#

This comprehensive guide walks you through the process of integrating your custom dataset into our library. The process is divided into three main steps:

Dataset Creation 🔨
- Implement data loading mechanisms
- Define preprocessing steps
- Structure data in the required format
Integrate with Dataset APIs 🔄
- Add dataset to the library framework
- Ensure compatibility with existing systems
- Set up proper inheritance structure
Configuration Setup ⚙️
- Define dataset parameters
- Specify data paths and formats
- Configure preprocessing options

📋 Tutorial Structure#

This tutorial follows a unique structure to provide the clearest possible learning experience:

💡 Main Notebook (Current File)

High-level concepts and explanations

Step-by-step workflow description

References to implementation files

📁 Supporting Files

Detailed code implementations

Specific examples and use cases

Technical documentation

🛠️ Technical Framework#

This tutorial demonstrates custom dataset integration using:

torch_geometric.data.InMemoryDataset as the base class
library’s dataset management system

🎓 Important Notes#

To make the learning process concrete, we’ll work with a practical toy “language” dataset example:
While we use the “language” dataset as an example, all file references use the generic <dataset_name> format for better generalization

Step 1: Create a Dataset 🛠️#

Overview#

Adding your custom dataset to requires implementing specific loading and preprocessing functionality. We utilize the torch_geometric.data.InMemoryDataset interface to make this process straightforward.

Required Methods#

To implement your dataset, you need to override two key methods from the torch_geometric.data.InMemoryDataset class:

download(): Handles dataset acquisition
process(): Manages data preprocessing

💡 Reference Implementation: For a complete example, check topobenchmark/data/datasets/language_dataset.py

Deep Dive: The Download Method#

The download() method is responsible for acquiring dataset files from external resources. Let’s examine its implementation using our language dataset example, where we store data in a GoogleDrive-hosted zip file.

Implementation Steps#

Download Data 📥

Fetch data from the specified source URL
Save to the raw directory

Extract Content 📦

Unzip the downloaded file
Place contents in appropriate directory

Organize Files 📂

Move extracted files to named folders
Clean up temporary files and directories

Code Implementation#

```python def download(self) -> None: r”””Download the dataset from a URL and saves it to the raw directory.

Raises:
    FileNotFoundError: If the dataset URL is not found.
"""
# Step 1: Download data from the source
self.url = self.URLS[self.name]
self.file_format = self.FILE_FORMAT[self.name]
download_file_from_drive(
    file_link=self.url,
    path_to_save=self.raw_dir,
    dataset_name=self.name,
    file_format=self.file_format,
)

# Step 2: extract zip file
folder = self.raw_dir
filename = f"{self.name}.{self.file_format}"
path = osp.join(folder, filename)
extract_zip(path, folder)
# Delete zip file
os.unlink(path)

# Step 3: organize files
# Move files from osp.join(folder, name_download) to folder
for file in os.listdir(osp.join(folder, self.name)):
    shutil.move(osp.join(folder, self.name, file), folder)
# Delete osp.join(folder, self.name) dir
shutil.rmtree(osp.join(folder, self.name))

Deep Dive: The Process Method#

The process() method handles data preprocessing and organization. Here’s the method’s structure:

```python def process(self) -> None: r”””Handle the data for the dataset.

This method loads the Language dataset, applies preprocessing transformations, and saves processed data.”””

# Step 1: extract the data … # Convert raw data to list of torch_geometric.data.Data objects

# Step 2: collate the graphs self.data, self.slices = self.collate(graph_sentences)

# Step 3: save processed data fs.torch_save( (self._data.to_dict(), self.slices, {}, self._data.__class__), self.processed_paths[0], )

self.collate – Collates a list of Data or HeteroData objects to the internal storage format; meaning that it transforms a list of torch.data.Data objectis into one torch.data.BaseData.

Step 2: Integrate with Dataset APIs 🔄#

Now that we have created a dataset class, we need to integrate it with the library. In this section we describe where to add the dataset files and how to make it available through data loaders.

Here’s how to structure your files, the files highlighted with ** are going to be updated:

topobenchmark/
├── data/
│   ├── datasets/
│   │   ├── **init.py**
│   │   ├── base.py
│   │   ├── <dataset_name>.py   # Your dataset file
│   │   └── ...
│   ├── loaders/
│   │   ├── init.py
│   │   ├── base.py
│   │   ├── graph/
│   │   │   ├── <loader_name>.py   # Your loader file
│   │   ├── hypergraph/
│   │   │   ├── <loader_name>.py   # Your loader file
│   │   ├── .../

To make your dataset available to library:

The file <dataset_name>.py has been created during the previous steps (us_county_demos_dataset.py in our case) and should be placed in the topobenchmark/data/datasets/ directory.

The registry topobenchmark/data/datasets/__init__.py discovers the files in topobenchmark/data/datasets and updates __all__ variable of topobenchmark/data/datasets/__init__.py automatically. Hence there is no need to update the __init__.py file manually to allow your dataset to be loaded by the library. Simply creare a file <dataset_name>.py and place it in the topobenchmark/data/datasets/ directory.

Next it is required to update the data loader system. Modify the loader file (topobenchmark/data/loaders/loaders.py:) to include your custom dataset:

For the the example dataset we add the following into the file topobenchmark/data/loaders/graph/us_county_demos_dataset_loader.py which consist of the following:

class USCountyDemosDatasetLoader(AbstractLoader):
    """Load US County Demos dataset with configurable year and task variable.

    Parameters
    ----------
    parameters : DictConfig
        Configuration parameters containing:
            - data_dir: Root directory for data
            - data_name: Name of the dataset
            - year: Year of the dataset (if applicable)
            - task_variable: Task variable for the dataset
    """

    def __init__(self, parameters: DictConfig) -> None:
        super().__init__(parameters)

    def load_dataset(self) -> USCountyDemosDataset:
        """Load the US County Demos dataset.

        Returns
        -------
        USCountyDemosDataset
            The loaded US County Demos dataset with the appropriate `data_dir`.

        Raises
        ------
        RuntimeError
            If dataset loading fails.
        """

        dataset = self._initialize_dataset()
        self.data_dir = self._redefine_data_dir(dataset)
        return dataset

    def _initialize_dataset(self) -> USCountyDemosDataset:
        """Initialize the US County Demos dataset.

        Returns
        -------
        USCountyDemosDataset
            The initialized dataset instance.
        """
        return USCountyDemosDataset(
            root=str(self.root_data_dir),
            name=self.parameters.data_name,
            parameters=self.parameters,
        )

    def _redefine_data_dir(self, dataset: USCountyDemosDataset) -> Path:
        """Redefine the data directory based on the chosen (year, task_variable) pair.

        Parameters
        ----------
        dataset : USCountyDemosDataset
            The dataset instance.

        Returns
        -------
        Path
            The redefined data directory path.
        """
        return dataset.processed_root

where the method load_dataset is required while other methods are optional used for convenience and structure.

The load_dataset of AbstractLoader class requires to return torch.utils.data.Dataset object.
Important: to allow the automatic registering of the loader, make sure to include “DatasetLoader” into name of loader class (Example: USCountyDemosDatasetLoader)

Step 3: Define Configuration 🔧#

Now that we’ve integrated our dataset, we need to define its configuration parameters. In this section, we’ll explain how to create and structure the configuration file for your dataset.

Configuration File Structure#

Create a new YAML file for your dataset in configs/dataset/<dataset_name>.yaml with the following structure:

While creating a configuration file, you will need to specify:#

Loader class (topobenchmark.data.loaders.USCountyDemosDatasetLoader) for automatic instantialization inside the provided pipeline and the parameters for the loader.

# Dataset loader config
loader:
  _target_: topobenchmark.data.loaders.USCountyDemosDatasetLoader
  parameters:
    data_domain: graph             # Primary data domain. Options: ['graph', 'hypergrpah', 'cell, 'simplicial']
    data_type: cornel              # Data type. String emphasizing from where dataset come from.
    data_name: US-county-demos     # Name of the dataset
    year: 2012                     # In the case of US-county-demos there are multiple version of this dataset. Options:[2012, 2016]
    task_variable: 'Election'      # Different target variable used as target. Options: ['Election', 'MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']
    data_dir: ${paths.data_dir}/${dataset.loader.parameters.data_domain}/${dataset.loader.parameters.data_type}

The dataset parameters:

# Dataset parameters
parameters:
  num_features: 6         # Number of features in the dataset
  num_classes: 1          # Dimentuin of the target variable
  task: regression        # Dataset task. Options: [classification, regression]
  loss_type: mse          # Task-specific loss function
  monitor_metric: mae     # Metric to monitor during training
  task_level: node        # Task level. Options: [classification, regression]

The dataset split parameters:

#splits
split_params:
  learning_setting: transductive      # Type of learning. Options:['transductive', 'inductive']
  data_seed: 0                        # Seed for data splitting
  split_type: random                  # Type of splitting. Options: ['k-fold', 'random']
  k: 10                               # Number of folds in case of "k-fold" cross-validation
  train_prop: 0.5                     # Training proportion in case of 'random' splitting strategy
  standardize: True                   # Standardize the data or not. Options: [True, False]
  data_split_dir: ${paths.data_dir}/data_splits/${dataset.loader.parameters.data_name}

Finally the dataloader parameters:

# Dataloader parameters
dataloader_params:
  batch_size: 1       # Number of graphs per batch. In sace of transductive always 1 as there is only one graph.
  num_workers: 0      # Number of workers for data loading
  pin_memory: False   # Pin memory for data loading

Notes:#

The paths section in the configuration file is automatically populated with the paths to the data directory and the data splits directory.
Some of the dataset parameters are used to configure the model.yaml and other files. Hence we suggest always include the above parameters in the dataset configuration file.

Here’s the markdown for easy copying:

Preparing to Load the Custom Dataset: Understanding Configuration Imports#

Before loading our dataset, it’s crucial to understand the configuration imports, particularly those from the topobenchmark.utils.config_resolvers module. These utility functions play a key role in dynamically configuring your machine learning pipeline.

Key Imports for Dynamic Configuration#

Let’s import the essential configuration resolver functions:

from topobenchmark.utils.config_resolvers import (
    get_default_transform,
    get_monitor_metric,
    get_monitor_mode,
    infer_in_channels,
)

Why These Imports Matter#

In our previous step, we explored configuration variables that use dynamic lookups, such as:

data_dir: ${paths.data_dir}/${dataset.loader.parameters.data_domain}/${dataset.loader.parameters.data_type}

However, some configurations require more advanced automation, which is where these imported functions become invaluable.

Practical Example: Dynamic Transforms#

Consider the configuration in projects/TopoBenchmark/configs/run.yaml, where the transforms parameter uses the get_default_transform function:

transforms: ${get_default_transform:${dataset},${model}}

This syntax allows for automatic transformation selection based on the dataset and model, demonstrating the power of these configuration resolver functions.

By importing and utilizing these functions, you gain:

Flexible configuration management
Automatic parameter inference
Reduced manual configuration overhead

These facilitate seamless dataset loading and preprocessing for multiple topological domains and provide an easy and intuitive interface for incorporating novel functionality. ```

 In [1]:

from hydra import compose, initialize
from hydra.utils import instantiate



from topobenchmark.utils.config_resolvers import (
    get_default_transform,
    get_monitor_metric,
    get_monitor_mode,
    infer_in_channels,
)


initialize(config_path="../configs", job_name="job")
cfg = compose(
    config_name="run.yaml",
    overrides=[
        "model=hypergraph/unignn2",
        "dataset=graph/US-county-demos",
    ],
    return_hydra_config=True
)
loader = instantiate(cfg.dataset.loader)


dataset, dataset_dir = loader.load()

/tmp/ipykernel_1170891/1713955081.py:14: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  initialize(config_path="../configs", job_name="job")

 In [2]:

print(dataset)

US-county-demos(self.root=/home/lev/projects/TopoBenchmark/datasets/graph/cornel, self.name=US-county-demos, self.parameters={'data_domain': 'graph', 'data_type': 'cornel', 'data_name': 'US-county-demos', 'year': 2012, 'task_variable': 'Election', 'data_dir': '/home/lev/projects/TopoBenchmark/datasets/graph/cornel'}, self.force_reload=False)

 In [3]:

print(dataset[0])

Data(x=[3224, 6], edge_index=[2, 18966], y=[3224])

Step 4.1: Default Data Transformations ⚙️#

While most datasets can be used directly after integration, some require specific preprocessing transformations. These transformations might vary depending on the task, model, or other conditions.

Example Case: US-county-demos Dataset#

Let’s look at our language dataset’s structure the compose function.

cfg = compose(
    config_name="run.yaml",
    overrides=[
        "model=hypergraph/unignn2",
        "dataset=graph/US-county-demos",
    ],
    return_hydra_config=True
)

we can see that the model is hypergraph/unignn2 from hypergraph domain while the dataset is from graph domain. This implied that the discussed above get_default_transform function:

transforms: ${get_default_transform:${dataset},${model}}

Inferred a default transform from graph to hypegraph domain.

 In [4]:

print('Transform name:', cfg.transforms.keys())
print('Transform parameters:', cfg.transforms['graph2hypergraph_lifting'])

Transform name: dict_keys(['graph2hypergraph_lifting'])
Transform parameters: {'_target_': 'topobenchmark.transforms.data_transform.DataTransform', 'transform_type': 'lifting', 'transform_name': 'HypergraphKHopLifting', 'k_value': 1, 'feature_lifting': 'ProjectionSum', 'neighborhoods': '${oc.select:model.backbone.neighborhoods,null}'}

Some datasets require might require default transforms which are applied whenever it is nedded to model the data.

The topobenchmark library provides a simple way to define custom transformations and apply them to the dataset. Take a look at TopoBenchmark/configs/transforms/dataset_defaults folder where you can find some default transformations for different datasets.

For example, REDDIT-BINARY does not have initial node features and it is a common practice to define initial features as gaussian noise. Hence the TopoBenchmark/configs/transforms/dataset_defaults/REDDIT-BINARY.yaml file incorporates the gaussian_noise transform by default. Hence whenver you choose to uplodad the REDDIT-BINARY dataset (and do not modify transforms parameter), the gaussian_noise transform will be applied to the dataset.

defaults:
  - data_manipulations: equal_gaus_features
  - liftings@_here_: ${get_required_lifting:graph,${model}}

Below we provide an quick tutorial on how to create a data transformations and create a sequence of default transformations that will be executed whener you use the defined dataset config file.

 In [5]:

# Avoid override transforms
cfg = compose(
    config_name="run.yaml",
    overrides=[
        "model=hypergraph/unignn2",
        "dataset=graph/REDDIT-BINARY",
    ],
    return_hydra_config=True
)
loader = instantiate(cfg.dataset.loader)


dataset, dataset_dir = loader.load()

REDDIT_BINARY dataset does not have any initial node features

 In [6]:

dataset[0]

 Out [6]:

Data(edge_index=[2, 480], y=[1], num_nodes=218)

Take a look at the default transforms and the parameters of equal_gaus_features transform

 In [7]:

print('Transform name:', cfg.transforms.keys())
print('Transform parameters:', cfg.transforms['equal_gaus_features'])

Transform name: dict_keys(['equal_gaus_features', 'graph2hypergraph_lifting'])
Transform parameters: {'_target_': 'topobenchmark.transforms.data_transform.DataTransform', 'transform_name': 'EqualGausFeatures', 'transform_type': 'data manipulation', 'mean': 0, 'std': 0.1, 'num_features': '${dataset.parameters.num_features}'}

 In [8]:

from topobenchmark.data.preprocessor import PreProcessor
preprocessed_dataset = PreProcessor(dataset, dataset_dir, cfg['transforms'])

Processing...
Done!

 In [9]:

preprocessed_dataset[0]

 Out [9]:

Data(x=[218, 10], edge_index=[2, 480], y=[1], incidence_hyperedges=[218, 218], num_hyperedges=[1], x_0=[218, 10], x_hyperedges=[218, 10], num_nodes=218)

The preprocessed dataset has the features generated by the preprocessor. And the connectivity of the dataset has been transformed into hypegraph domain.

Creating your own default transforms#

Now when we have seen how to add custom dataset and how does the default transform works. One might want to reate your own default transforms for new dataset that will be executed always whenwever the dataset under default configuration is used.

To configure the deafult transform navigate to configs/transforms/dataset_defaults create <def_transforms.yaml> and the follwoing .yaml file:

defaults:
  - transform_1: transform_1
  - transform_2: transform_2
  - transform_3: transform_3

Important There are different types of transforms, including data_manipulation, liftings, and feature_liftings. In case you want to use multiple transforms from the same categoty, let’s say from data_manipulation, then it is required to stick to a special syntaxis. `See hydra configuration for more information <>`__ or the example below:

defaults:
  - data_manipulation@first_usage: transform_1
  - data_manipulation@second_usage: transform_2

Notes:#

Transforms from the same category: If There are a two transforms from the same catgory, for example, data_manipulations, it is required to use operator @ to assign new diffrerent names first_usage and second_usage to each transform.
In the case of equal_gaus_features we have to override the initial number of features as the equal_gaus_features.yaml which uses a special register to infer the feature dimension (the registed logic descrived in Step 3.) However by some reason we want to specify num_features parameter we can override it in the default file without the need to change the transform config file.

defaults:
  - data_manipulations@equal_gaus_features: equal_gaus_features
  - data_manipulations@some_transform: some_transform
  - liftings@_here_: ${get_required_lifting:graph,${model}}

equal_gaus_features:
  num_features: 100
some_transform:
  some_param: bla

We recommend to always add liftings@_here_: ${get_required_lifting:graph,${model}} so that a default lifting is applied to run any domain-specific topological model.

Step 4.2: Custom Data Transformations ⚙️#

In general any transfom in the library inherits torch_geometric.transforms.BaseTransform class, which allow to apply a sequency of transforms to the data. Our inderface requires to implement the forward method. The important part of all transforms is that it takes torch_geometric.data.Data object and returns updated torch_geometric.data.Data object.

For language dataset, we have generated the equal_gaus_features transfroms that is a data_manipulation transform hence we place it into topobenchmark/transforms/data_manipulation/ folder. Below you can see th EqualGausFeatures class:

class EqualGausFeatures(torch_geometric.transforms.BaseTransform):
 r"""A transform that generates equal Gaussian features for all nodes.

 Parameters
 ----------
 **kwargs : optional
     Additional arguments for the class. It should contain the following keys:
     - mean (float): The mean of the Gaussian distribution.
     - std (float): The standard deviation of the Gaussian distribution.
     - num_features (int): The number of features to generate.
 """

 def __init__(self, **kwargs):
     super().__init__()
     self.type = "generate_non_informative_features"

     # Torch generate feature vector from gaus distribution
     self.mean = kwargs["mean"]
     self.std = kwargs["std"]
     self.feature_vector = kwargs["num_features"]
     self.feature_vector = torch.normal(
         mean=self.mean, std=self.std, size=(1, self.feature_vector)
     )

 def __repr__(self) -> str:
     return f"{self.__class__.__name__}(type={self.type!r}, mean={self.mean!r}, std={self.std!r}, feature_vector={self.feature_vector!r})"

 def forward(self, data: torch_geometric.data.Data):
     r"""Apply the transform to the input data.

     Parameters
     ----------
     data : torch_geometric.data.Data
         The input data.

     Returns
     -------
     torch_geometric.data.Data
         The transformed data.
     """
     data.x = self.feature_vector.expand(data.num_nodes, -1)
     return data

As we said above the forward function takes as input the torch_geometric.data.Data object, modifies it, and returns it.

Similarly to adding dataset the transformations you have created and placed at right folder are automatically registered.

Now as we have registered the transform we can finally create the configuration file and use it in the framework:

_target_: topobenchmark.transforms.data_transform.DataTransform
transform_name: "EqualGausFeatures"
transform_type: "data manipulation"

mean: 0
std: 0.1
num_features: ${dataset.parameters.num_features}

Please refer to configs/transforms/dataset_defaults/equal_gaus_features.yaml for the example.

Notes:

You might notice an interesting key _target_ in the configuration file. In general for any new transform you the _target_ is always topobenchmark.transforms.data_transform.DataTransform. For more information please refer to hydra documentation “Instantiating objects with Hydra” section..

📚 Adding a Custom Dataset Tutorial#

🎯 Tutorial Overview#

📋 Tutorial Structure#

🛠️ Technical Framework#

🎓 Important Notes#

Step 1: Create a Dataset 🛠️#

Overview#

Required Methods#

Deep Dive: The Download Method#

Implementation Steps#

Code Implementation#

Deep Dive: The Process Method#

Step 2: Integrate with Dataset APIs 🔄#

Step 3: Define Configuration 🔧#

Configuration File Structure#

While creating a configuration file, you will need to specify:#

Notes:#

Preparing to Load the Custom Dataset: Understanding Configuration Imports#

Key Imports for Dynamic Configuration#

Why These Imports Matter#

Practical Example: Dynamic Transforms#

Step 4.1: Default Data Transformations ⚙️#

Example Case: US-county-demos Dataset#

Creating your own default transforms#

Notes:#

Step 4.2: Custom Data Transformations ⚙️#

This Page