topobench.data.loaders.graph.tu_datasets module#

Loaders for TU datasets.

class AbstractLoader(parameters)#

Bases: ABC

Abstract class that provides an interface to load data.

Parameters:
parametersDictConfig

Configuration parameters.

__init__(parameters)#
get_data_dir()#

Get the data directory.

Returns:
Path

The path to the dataset directory.

load(**kwargs)#

Load data.

Parameters:
**kwargsdict

Additional keyword arguments.

Returns:
tuple[torch_geometric.data.Data, str]

Tuple containing the loaded data and the data directory.

abstract load_dataset()#

Load data into a dataset.

Returns:
Union[torch_geometric.data.Dataset, torch.utils.data.Dataset]

The loaded dataset, which could be a PyG or PyTorch dataset.

Raises:
NotImplementedError

If the method is not implemented.

class Dataset(root=None, transform=None, pre_transform=None, pre_filter=None, log=True, force_reload=False)#

Bases: Dataset

Dataset base class for creating graph datasets. See here for the accompanying tutorial.

Parameters:
  • root (str, optional) – Root directory where the dataset should be saved. (optional: None)

  • transform (callable, optional) – A function/transform that takes in a Data or HeteroData object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a Data or HeteroData object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in a Data or HeteroData object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • log (bool, optional) – Whether to print any console output while downloading and processing the dataset. (default: True)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

__init__(root=None, transform=None, pre_transform=None, pre_filter=None, log=True, force_reload=False)#
download()#

Downloads the dataset to the self.raw_dir folder.

get(idx)#

Gets the data object at index idx.

get_summary()#

Collects summary statistics for the dataset.

index_select(idx)#

Creates a subset of the dataset from specified indices idx. Indices idx can be a slicing object, e.g., [2:5], a list, a tuple, or a torch.Tensor or np.ndarray of type long or bool.

indices()#
len()#

Returns the number of data objects stored in the dataset.

print_summary(fmt='psql')#

Prints summary statistics of the dataset to the console.

Parameters:

fmt (str, optional) – Summary tables format. Available table formats can be found here. (default: "psql")

process()#

Processes the dataset to the self.processed_dir folder.

shuffle(return_perm=False)#

Randomly shuffles the examples in the dataset.

Parameters:

return_perm (bool, optional) – If set to True, will also return the random permutation used to shuffle the dataset. (default: False)

to_datapipe()#

Converts the dataset into a torch.utils.data.DataPipe.

The returned instance can then be used with :pyg:`PyG's` built-in DataPipes for batching graphs as follows:

from torch_geometric.datasets import QM9

dp = QM9(root='./data/QM9/').to_datapipe()
dp = dp.batch_graphs(batch_size=2, drop_last=True)

for batch in dp:
    pass

See the PyTorch tutorial for further background on DataPipes.

property has_download: bool#

Checks whether the dataset defines a download() method.

property has_process: bool#

Checks whether the dataset defines a process() method.

property num_classes: int#

Returns the number of classes in the dataset.

property num_edge_features: int#

Returns the number of features per edge in the dataset.

property num_features: int#

Returns the number of features per node in the dataset. Alias for num_node_features.

property num_node_features: int#

Returns the number of features per node in the dataset.

property processed_dir: str#

!! processed by numpydoc !!

property processed_file_names: str | List[str] | Tuple[str, ...]#

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property processed_paths: List[str]#

The absolute filepaths that must be present in order to skip processing.

property raw_dir: str#

!! processed by numpydoc !!

property raw_file_names: str | List[str] | Tuple[str, ...]#

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

property raw_paths: List[str]#

The absolute filepaths that must be present in order to skip downloading.

class DictConfig(content, key=None, parent=None, ref_type=typing.Any, key_type=typing.Any, element_type=typing.Any, is_optional=True, flags=None)#

Bases: BaseContainer, MutableMapping[Any, Any]

__init__(content, key=None, parent=None, ref_type=typing.Any, key_type=typing.Any, element_type=typing.Any, is_optional=True, flags=None)#
copy()#
get(key, default_value=None)#

Return the value for key if key is in the dictionary, else default_value (defaulting to None).

items() a set-like object providing a view on D's items#
items_ex(resolve=True, keys=None)#
keys() a set-like object providing a view on D's keys#
pop(k[, d]) v, remove specified key and return the corresponding value.#

If key is not found, d is returned if given, otherwise KeyError is raised.

setdefault(k[, d]) D.get(k,d), also set D[k]=d if k not in D#
class TUDataset(root, name, transform=None, pre_transform=None, pre_filter=None, force_reload=False, use_node_attr=False, use_edge_attr=False, cleaned=False)#

Bases: InMemoryDataset

A variety of graph kernel benchmark datasets, .e.g., "IMDB-BINARY", "REDDIT-BINARY" or "PROTEINS", collected from the TU Dortmund University. In addition, this dataset wrapper provides cleaned dataset versions as motivated by the “Understanding Isomorphism Bias in Graph Data Sets” paper, containing only non-isomorphic graphs.

Note

Some datasets may not come with any node labels. You can then either make use of the argument use_node_attr to load additional continuous node attributes (if present) or provide synthetic node features using transforms such as torch_geometric.transforms.Constant or torch_geometric.transforms.OneHotDegree.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset.

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

  • use_node_attr (bool, optional) – If True, the dataset will contain additional continuous node attributes (if present). (default: False)

  • use_edge_attr (bool, optional) – If True, the dataset will contain additional continuous edge attributes (if present). (default: False)

  • cleaned (bool, optional) – If True, the dataset will contain only non-isomorphic graphs. (default: False)

STATS:

Name

#graphs

#nodes

#edges

#features

#classes

MUTAG

188

~17.9

~39.6

7

2

ENZYMES

600

~32.6

~124.3

3

6

PROTEINS

1,113

~39.1

~145.6

3

2

COLLAB

5,000

~74.5

~4914.4

0

3

IMDB-BINARY

1,000

~19.8

~193.1

0

2

REDDIT-BINARY

2,000

~429.6

~995.5

0

2

__init__(root, name, transform=None, pre_transform=None, pre_filter=None, force_reload=False, use_node_attr=False, use_edge_attr=False, cleaned=False)#
download()#

Downloads the dataset to the self.raw_dir folder.

process()#

Processes the dataset to the self.processed_dir folder.

cleaned_url = 'https://raw.githubusercontent.com/nd7141/graph_datasets/master/datasets'#
property num_edge_attributes: int#

!! processed by numpydoc !!

property num_edge_labels: int#

!! processed by numpydoc !!

property num_node_attributes: int#

!! processed by numpydoc !!

property num_node_labels: int#

!! processed by numpydoc !!

property processed_dir: str#

!! processed by numpydoc !!

property processed_file_names: str#

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property raw_dir: str#

!! processed by numpydoc !!

property raw_file_names: List[str]#

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

url = 'https://www.chrsmrrs.com/graphkerneldatasets'#
class TUDatasetLoader(parameters)#

Bases: AbstractLoader

Load TU datasets.

Parameters:
parametersDictConfig
Configuration parameters containing:
  • data_dir: Root directory for data

  • data_name: Name of the dataset

  • data_type: Type of the dataset (e.g., “graph_classification”)

__init__(parameters)#
load_dataset()#

Load TU dataset.

Returns:
Dataset

The loaded TU dataset.

Raises:
RuntimeError

If dataset loading fails.