topobench.transforms.data_manipulations.add_gpse_information module#

A transform that adds positional information using PyG 2.7’s GPSE implementation.

class AddGPSEInformation(**kwargs)#

Bases: BaseTransform

A transform that uses PyG 2.7’s pretrained GPSE to add positional and structural information to the graph.

Parameters:
**kwargsoptional

Parameters for the transform.

__init__(**kwargs)#
aggregate_inter_nbhd(x_out_per_route)#

Aggregate the outputs of the GNN for each rank.

While the GNN takes care of intra-nbhd aggregation, this will take care of inter-nbhd aggregation. Default: sum.

Parameters:
x_out_per_routedict

The outputs of the GNN for each route.

Returns:
dict

The aggregated outputs of the GNN for each rank.

forward(data)#

Apply the transform to the input data.

Parameters:
datatorch_geometric.data.Data

The input data.

Returns:
torch_geometric.data.Data

The transformed data.

forward_interank(src_rank, dst_rank, nbhd_cache, data)#

Forward for cells where src_rank!=dst_rank.

Parameters:
src_rankint

Source rank of the transmitting cell.

dst_rankint

Destinatino rank of the transmitting cell.

nbhd_cachedict

Cache of the neighbourhood information.

datatoch_geometric.data.Data

The input data.

Returns:
data

The data object with messages passed.

forward_intrarank(src_rank, route_index, data)#

Forward for cells where src_rank==dst_rank.

Parameters:
src_rankint

Source rank of the transmitting cell.

route_indexint

The index of this particular message passing route.

datatorch_geometric.data.Data

The input data.

Returns:
data

The data object with messages passed.

get_nbhd_cache(params)#

Cache the nbhd information into a dict for the complex at hand.

Parameters:
paramsdict

The parameters of the batch, containing the complex.

Returns:
dict

The neighborhood cache.

interrank_boundary_index(boundary_index, n_dst_nodes)#

Recover lifted graph.

Edge-to-node boundary relationships of a graph with n_nodes and n_edges can be represented as up-adjacency node relations. There are n_nodes+n_edges nodes in this lifted graph. Desgiend to work for regular (edge-to-node and face-to-edge) boundary relationships.

Parameters:
x_srctorch.tensor

Source node features. Shape [n_src_nodes, n_features]. Should represent edge or face features.

boundary_indexlist of lists or list of tensors

List boundary_index[0] stores node ids in the boundary of edge stored in boundary_index[1]. List boundary_index[1] stores list of edges.

n_dst_nodesint

Number of destination nodes.

Returns:
edge_indexlist of lists

The edge_index[0][i] and edge_index[1][i] are the two nodes of edge i.

edge_attrtensor

Edge features are given by feature of bounding node represnting an edge. Shape [n_edges, n_features].

interrank_expand(params, src_rank, dst_rank, nbhd_cache)#

Expand the complex into an interrank Hasse graph.

Parameters:
paramsdict

The parameters of the batch, containting the complex.

src_rankint

The source rank.

dst_rankint

The destination rank.

nbhd_cachedict

The neighborhood cache containing the expanded boundary index and edge attributes.

Returns:
torch_geometric.data.Data

The expanded batch of interrank Hasse graphs for this route.

intrarank_expand(params, src_rank, nbhd)#

Expand the complex into an intrarank Hasse graph.

Parameters:
paramsdict

The parameters of the batch, containting the complex.

src_rankint

The source rank.

nbhdstr

The neighborhood to use.

Returns:
torch_geometric.data.Data

The expanded batch of intrarank Hasse graphs for this route.

class Data(x=None, edge_index=None, edge_attr=None, y=None, pos=None, time=None, **kwargs)#

Bases: BaseData, FeatureStore, GraphStore

A data object describing a homogeneous graph. The data object can hold node-level, link-level and graph-level attributes. In general, Data tries to mimic the behavior of a regular :python:`Python` dictionary. In addition, it provides useful functionality for analyzing graph structures, and provides basic PyTorch tensor functionalities. See here for the accompanying tutorial.

from torch_geometric.data import Data

data = Data(x=x, edge_index=edge_index, ...)

# Add additional arguments to `data`:
data.train_idx = torch.tensor([...], dtype=torch.long)
data.test_mask = torch.tensor([...], dtype=torch.bool)

# Analyzing the graph structure:
data.num_nodes
>>> 23

data.is_directed()
>>> False

# PyTorch tensor functionality:
data = data.pin_memory()
data = data.to('cuda:0', non_blocking=True)
Parameters:
  • x (torch.Tensor, optional) – Node feature matrix with shape [num_nodes, num_node_features]. (default: None)

  • edge_index (LongTensor, optional) – Graph connectivity in COO format with shape [2, num_edges]. (default: None)

  • edge_attr (torch.Tensor, optional) – Edge feature matrix with shape [num_edges, num_edge_features]. (default: None)

  • y (torch.Tensor, optional) – Graph-level or node-level ground-truth labels with arbitrary shape. (default: None)

  • pos (torch.Tensor, optional) – Node position matrix with shape [num_nodes, num_dimensions]. (default: None)

  • time (torch.Tensor, optional) – The timestamps for each event with shape [num_edges] or [num_nodes]. (default: None)

  • **kwargs (optional) – Additional attributes.

classmethod from_dict(mapping)#

Creates a Data object from a dictionary.

__init__(x=None, edge_index=None, edge_attr=None, y=None, pos=None, time=None, **kwargs)#
connected_components()#

Extracts connected components of the graph using a union-find algorithm. The components are returned as a list of Data objects, where each object represents a connected component of the graph.

data = Data()
data.x = torch.tensor([[1.0], [2.0], [3.0], [4.0]])
data.y = torch.tensor([[1.1], [2.1], [3.1], [4.1]])
data.edge_index = torch.tensor(
    [[0, 1, 2, 3], [1, 0, 3, 2]], dtype=torch.long
)

components = data.connected_components()
print(len(components))
>>> 2

print(components[0].x)
>>> Data(x=[2, 1], y=[2, 1], edge_index=[2, 2])
Returns:

A list of disconnected components.

Return type:

List[Data]

debug()#
edge_subgraph(subset)#

Returns the induced subgraph given by the edge indices subset. Will currently preserve all the nodes in the graph, even if they are isolated after subgraph computation.

Parameters:

subset (LongTensor or BoolTensor) – The edges to keep.

get_all_edge_attrs()#

Returns all registered edge attributes.

get_all_tensor_attrs()#

Obtains all feature attributes stored in Data.

is_edge_attr(key)#

Returns True if the object at key key denotes an edge-level tensor attribute.

is_node_attr(key)#

Returns True if the object at key key denotes a node-level tensor attribute.

stores_as(data)#
subgraph(subset)#

Returns the induced subgraph given by the node indices subset.

Parameters:

subset (LongTensor or BoolTensor) – The nodes to keep.

to_dict()#

Returns a dictionary of stored key/value pairs.

to_heterogeneous(node_type=None, edge_type=None, node_type_names=None, edge_type_names=None)#

Converts a Data object to a heterogeneous HeteroData object. For this, node and edge attributes are splitted according to the node-level and edge-level vectors node_type and edge_type, respectively. node_type_names and edge_type_names can be used to give meaningful node and edge type names, respectively. That is, the node_type 0 is given by node_type_names[0]. If the Data object was constructed via to_homogeneous(), the object can be reconstructed without any need to pass in additional arguments.

Parameters:
  • node_type (torch.Tensor, optional) – A node-level vector denoting the type of each node. (default: None)

  • edge_type (torch.Tensor, optional) – An edge-level vector denoting the type of each edge. (default: None)

  • node_type_names (List[str], optional) – The names of node types. (default: None)

  • edge_type_names (List[Tuple[str, str, str]], optional) – The names of edge types. (default: None)

to_namedtuple()#

Returns a NamedTuple of stored key/value pairs.

update(data)#

Updates the data object with the elements from another data object. Added elements will override existing ones (in case of duplicates).

validate(raise_on_error=True)#

Validates the correctness of the data.

property batch: Tensor | None#

!! processed by numpydoc !!

property edge_attr: Tensor | None#

!! processed by numpydoc !!

property edge_index: Tensor | None#

!! processed by numpydoc !!

property edge_stores: List[EdgeStorage]#

!! processed by numpydoc !!

property edge_weight: Tensor | None#

!! processed by numpydoc !!

property face: Tensor | None#

!! processed by numpydoc !!

property node_stores: List[NodeStorage]#

!! processed by numpydoc !!

property num_edge_features: int#

Returns the number of features per edge in the graph.

property num_edge_types: int#

Returns the number of edge types in the graph.

property num_faces: int | None#

Returns the number of faces in the mesh.

property num_features: int#

Returns the number of features per node in the graph. Alias for num_node_features.

property num_node_features: int#

Returns the number of features per node in the graph.

property num_node_types: int#

Returns the number of node types in the graph.

property num_nodes: int | None#

Returns the number of nodes in the graph.

Note

The number of nodes in the data object is automatically inferred in case node-level attributes are present, e.g., data.x. In some cases, however, a graph may only be given without any node-level attributes. :pyg:`PyG` then guesses the number of nodes according to edge_index.max().item() + 1. However, in case there exists isolated nodes, this number does not have to be correct which can result in unexpected behavior. Thus, we recommend to set the number of nodes in your data object explicitly via data.num_nodes = .... You will be given a warning that requests you to do so.

property pos: Tensor | None#

!! processed by numpydoc !!

property stores: List[BaseStorage]#

!! processed by numpydoc !!

property time: Tensor | None#

!! processed by numpydoc !!

property x: Tensor | None#

!! processed by numpydoc !!

property y: Tensor | int | float | None#

!! processed by numpydoc !!

class GPSE(dim_in=20, dim_out=51, dim_inner=512, layer_type='resgatedgcnconv', layers_pre_mp=1, layers_mp=20, layers_post_mp=2, num_node_targets=51, num_graph_targets=11, stage_type='skipsum', has_bn=True, head_bn=False, final_l2norm=True, has_l2norm=True, dropout=0.2, has_act=True, final_act=True, act='relu', virtual_node=True, multi_head_dim_inner=32, graph_pooling='add', use_repr=True, repr_type='no_post_mp', bernoulli_threshold=0.5)#

Bases: Module

The Graph Positional and Structural Encoder (GPSE) model from the “Graph Positional and Structural Encoder” paper.

The GPSE model consists of a (1) deep GNN that consists of stacked message passing layers, and a (2) prediction head to predict pre-computed positional and structural encodings (PSE). When used on downstream datasets, these prediction heads are removed and the final fully-connected layer outputs are used as learned PSE embeddings.

GPSE also provides a static method from_pretrained() to load pre-trained GPSE models trained on a variety of molecular datasets.

from torch_geometric.nn import GPSE, GPSENodeEncoder
from torch_geometric.transforms import AddGPSE
from torch_geometric.nn.models.gpse import precompute_GPSE

gpse_model = GPSE.from_pretrained('molpcba')

# Option 1: Precompute GPSE encodings in-place for a given dataset
dataset = ZINC(path, subset=True, split='train')
precompute_gpse(gpse_model, dataset)

# Option 2: Use the GPSE model with AddGPSE as a pre_transform to save
# the encodings
dataset = ZINC(path, subset=True, split='train',
               pre_transform=AddGPSE(gpse_model, vn=True,
               rand_type='NormalSE'))

Both approaches append the generated encodings to the pestat_GPSE attribute of Data objects. To use the GPSE encodings for a downstream task, one may need to add these encodings to the x attribute of the Data objects. To do so, one can use the GPSENodeEncoder provided to map these encodings to a desired dimension before appending them to x.

Let’s say we have a graph dataset with 64 original node features, and we have generated GPSE encodings of dimension 32, i.e. data.pestat_GPSE = 32. Additionally, we want to use a GNN with an inner dimension of 128. To do so, we can map the 32-dimensional GPSE encodings to a higher dimension of 64, and then append them to the x attribute of the Data objects to obtain a 128-dimensional node feature representation. GPSENodeEncoder handles both this mapping and concatenation to x, the outputs of which can be used as input to a GNN:

encoder = GPSENodeEncoder(dim_emb=128, dim_pe_in=32, dim_pe_out=64,
                          expand_x=False)
gnn = GNN(...)

for batch in loader:
    x = encoder(batch.x, batch.pestat_GPSE)
    out = gnn(x, batch.edge_index)
Parameters:
  • dim_in (int, optional) – Input dimension. (default: 20)

  • dim_out (int, optional) – Output dimension. (default: 51)

  • dim_inner (int, optional) – Width of the encoder layers. (default: 512)

  • layer_type (str, optional) – Type of graph convolutional layer for message-passing. (default: resgatedgcnconv)

  • layers_pre_mp (int, optional) – Number of MLP layers before message-passing. (default: 1)

  • layers_mp (int, optional) – Number of layers for message-passing. (default: 20)

  • layers_post_mp (int, optional) – Number of MLP layers after message-passing. (default: 2)

  • num_node_targets (int, optional) – Number of individual PSEs used as node-level targets in pretraining GPSE. (default: 51)

  • num_graph_targets (int, optional) – Number of graph-level targets used in pretraining GPSE. (default: 11)

  • stage_type (str, optional) – The type of staging to apply. Possible values are: skipsum, skipconcat. Any other value will default to no skip connections. (default: skipsum)

  • has_bn (bool, optional) – Whether to apply batch normalization in the layer. (default: True)

  • final_l2norm (bool, optional) – Whether to apply L2 normalization to the outputs. (default: True)

  • has_l2norm (bool, optional) – Whether to apply L2 normalization after

  • (default (of virtual nodes.) – True)

  • dropout (float, optional) – Dropout ratio at layer output. (default: 0.2)

  • has_act (bool, optional) – Whether has activation after the layer. (default: True)

  • final_act (bool, optional) – Whether to apply activation after the layer stack. (default: True)

  • act (str, optional) – Activation to apply to layer output if has_act is True. (default: relu)

  • virtual_node (bool, optional) – Whether a virtual node is added to graphs in GPSE computation. (default: True)

  • multi_head_dim_inner (int, optional) – Width of MLPs for PSE target prediction heads. (default: 32)

  • graph_pooling (str, optional) – Type of graph pooling applied before post_mp. Options are add, max, mean. (default: add)

  • use_repr (bool, optional) – Whether to use the hidden representation of the final layer as GPSE encodings. (default: True)

  • repr_type (str, optional) – Type of representation to use. Options are no_post_mp, one_layer_before. (default: no_post_mp)

  • bernoulli_threshold (float, optional) – Threshold for Bernoulli sampling

  • (default0.5)

classmethod from_pretrained(name, root='GPSE_pretrained')#

Returns a pretrained GPSE model on a dataset.

Parameters:
  • name (str) – The name of the dataset ("molpcba", "zinc", "pcqm4mv2", "geom", "chembl").

  • root (str, optional) – The root directory to save the pre-trained model. (default: "GPSE_pretrained")

__init__(dim_in=20, dim_out=51, dim_inner=512, layer_type='resgatedgcnconv', layers_pre_mp=1, layers_mp=20, layers_post_mp=2, num_node_targets=51, num_graph_targets=11, stage_type='skipsum', has_bn=True, head_bn=False, final_l2norm=True, has_l2norm=True, dropout=0.2, has_act=True, final_act=True, act='relu', virtual_node=True, multi_head_dim_inner=32, graph_pooling='add', use_repr=True, repr_type='no_post_mp', bernoulli_threshold=0.5)#

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(batch)#

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reset_parameters()#
url_dict = {'chembl': 'https://zenodo.org/record/8145095/files/gpse_model_chembl_1.0.pt', 'geom': 'https://zenodo.org/record/8145095/files/gpse_model_geom_1.0.pt', 'molpcba': 'https://zenodo.org/record/8145095/files/gpse_model_molpcba_1.0.pt', 'pcqm4mv2': 'https://zenodo.org/record/8145095/files/gpse_model_pcqm4mv2_1.0.pt', 'zinc': 'https://zenodo.org/record/8145095/files/gpse_model_zinc_1.0.pt'}#
get_routes_from_neighborhoods(neighborhoods)#

Get the routes from the neighborhoods.

Combination of src_rank, dst_rank. ex: [[0, 0], [1, 0], [1, 1], [1, 1], [2, 1]].

Parameters:
neighborhoodslist

List of neighborhoods of interest.

Returns:
list

List of routes.

interrank_boundary_index(x_src, boundary_index, n_dst_nodes)#

Recover lifted graph.

Edge-to-node boundary relationships of a graph with n_nodes and n_edges can be represented as up-adjacency node relations. There are n_nodes+n_edges nodes in this lifted graph. Desgiend to work for regular (edge-to-node and face-to-edge) boundary relationships.

Parameters:
x_srctorch.tensor

Source node features. Shape [n_src_nodes, n_features]. Should represent edge or face features.

boundary_indexlist of lists or list of tensors

List boundary_index[0] stores node ids in the boundary of edge stored in boundary_index[1]. List boundary_index[1] stores list of edges.

n_dst_nodesint

Number of destination nodes.

Returns:
edge_indexlist of lists

The edge_index[0][i] and edge_index[1][i] are the two nodes of edge i.

edge_attrtensor

Edge features are given by feature of bounding node represnting an edge. Shape [n_edges, n_features].