topobench.data.loaders.graph.adme_datasets module#
Loaders for TDC (Therapeutics Data Commons) ADME datasets with SMILES to graph conversion.
- class ADME(name, path='./data', label_name=None, print_stats=False, convert_format=None)#
Bases:
DataLoaderData loader class to load datasets in ADME task. More info: https://tdcommons.ai/single_pred_tasks/adme/
- Parameters:
name (str) – the dataset name.
path (str, optional) – The path to save the data file, defaults to ‘./data’
label_name (str, optional) – For multi-label dataset, specify the label name, defaults to None
print_stats (bool, optional) – Whether to print basic statistics of the dataset, defaults to False
convert_format (str, optional) – Automatic conversion of SMILES to other molecular formats in MolConvert class. Stored as separate column in dataframe, defaults to None
- __init__(name, path='./data', label_name=None, print_stats=False, convert_format=None)#
Create ADME dataloader object.
- get_approved_set()#
- get_other_species(species=None)#
- harmonize(mode=None)#
Removing duplicated experimental readouts.
- class ADMEDatasetLoader(parameters)#
Bases:
AbstractLoaderLoad TDC ADME datasets with SMILES to graph conversion using OGB featurization.
This loader: 1. Loads ADME datasets from TDC (Therapeutics Data Commons) 2. Converts SMILES strings to PyG graphs using OGB’s standard featurization 3. Uses fixed scaffold splits from TDC 4. Returns graphs compatible with OGB molecular property prediction
- Node features (9-dimensional):
Atomic number
Chirality
Degree
Formal charge
Number of hydrogens
Number of radical electrons
Hybridization
Is aromatic
Is in ring
- Edge features (3-dimensional):
Bond type
Bond stereochemistry
Is conjugated
- Parameters:
- parametersDictConfig
- Configuration parameters containing:
data_dir: Root directory for data
data_name: Name of the ADME dataset
data_type: Type of the dataset (e.g., “ADME”)
- __init__(parameters)#
- get_data_dir()#
Get the data directory.
- Returns:
- Path
The path to the dataset directory. Format: {root_data_dir}/{dataset_name}/. Example: data/graph/ADME/BBB_Martins/.
- load_dataset()#
Load the ADME dataset with predefined scaffold splits.
- Returns:
- InMemoryDataset
The dataset with converted graphs and predefined splits.
- Raises:
- RuntimeError
If dataset loading or SMILES conversion fails.
- ValueError
If invalid SMILES strings are encountered.
- ImportError
If PyTDC or rdkit (via ogb) are not installed.
- class AbstractLoader(parameters)#
Bases:
ABCAbstract class that provides an interface to load data.
- Parameters:
- parametersDictConfig
Configuration parameters.
- __init__(parameters)#
- get_data_dir()#
Get the data directory.
- Returns:
- Path
The path to the dataset directory.
- load(**kwargs)#
Load data.
- Parameters:
- **kwargsdict
Additional keyword arguments.
- Returns:
- tuple[torch_geometric.data.Data, str]
Tuple containing the loaded data and the data directory.
- abstractmethod load_dataset()#
Load data into a dataset.
- Returns:
- Union[torch_geometric.data.Dataset, torch.utils.data.Dataset]
The loaded dataset, which could be a PyG or PyTorch dataset.
- Raises:
- NotImplementedError
If the method is not implemented.
- class Data(x=None, edge_index=None, edge_attr=None, y=None, pos=None, time=None, **kwargs)#
Bases:
BaseData,FeatureStore,GraphStoreA data object describing a homogeneous graph. The data object can hold node-level, link-level and graph-level attributes. In general,
Datatries to mimic the behavior of a regular :python:`Python` dictionary. In addition, it provides useful functionality for analyzing graph structures, and provides basic PyTorch tensor functionalities. See here for the accompanying tutorial.from torch_geometric.data import Data data = Data(x=x, edge_index=edge_index, ...) # Add additional arguments to `data`: data.train_idx = torch.tensor([...], dtype=torch.long) data.test_mask = torch.tensor([...], dtype=torch.bool) # Analyzing the graph structure: data.num_nodes >>> 23 data.is_directed() >>> False # PyTorch tensor functionality: data = data.pin_memory() data = data.to('cuda:0', non_blocking=True)
- Parameters:
x (torch.Tensor, optional) – Node feature matrix with shape
[num_nodes, num_node_features]. (default:None)edge_index (LongTensor, optional) – Graph connectivity in COO format with shape
[2, num_edges]. (default:None)edge_attr (torch.Tensor, optional) – Edge feature matrix with shape
[num_edges, num_edge_features]. (default:None)y (torch.Tensor, optional) – Graph-level or node-level ground-truth labels with arbitrary shape. (default:
None)pos (torch.Tensor, optional) – Node position matrix with shape
[num_nodes, num_dimensions]. (default:None)time (torch.Tensor, optional) – The timestamps for each event with shape
[num_edges]or[num_nodes]. (default:None)**kwargs (optional) – Additional attributes.
- classmethod from_dict(mapping)#
Creates a
Dataobject from a dictionary.
- __init__(x=None, edge_index=None, edge_attr=None, y=None, pos=None, time=None, **kwargs)#
- connected_components()#
Extracts connected components of the graph using a union-find algorithm. The components are returned as a list of
Dataobjects, where each object represents a connected component of the graph.data = Data() data.x = torch.tensor([[1.0], [2.0], [3.0], [4.0]]) data.y = torch.tensor([[1.1], [2.1], [3.1], [4.1]]) data.edge_index = torch.tensor( [[0, 1, 2, 3], [1, 0, 3, 2]], dtype=torch.long ) components = data.connected_components() print(len(components)) >>> 2 print(components[0].x) >>> Data(x=[2, 1], y=[2, 1], edge_index=[2, 2])
- Returns:
A list of disconnected components.
- Return type:
List[Data]
- debug()#
- edge_subgraph(subset)#
Returns the induced subgraph given by the edge indices
subset. Will currently preserve all the nodes in the graph, even if they are isolated after subgraph computation.- Parameters:
subset (LongTensor or BoolTensor) – The edges to keep.
- get_all_edge_attrs()#
Returns all registered edge attributes.
- get_all_tensor_attrs()#
Obtains all feature attributes stored in Data.
- stores_as(data)#
- subgraph(subset)#
Returns the induced subgraph given by the node indices
subset.- Parameters:
subset (LongTensor or BoolTensor) – The nodes to keep.
- to_dict()#
Returns a dictionary of stored key/value pairs.
- to_heterogeneous(node_type=None, edge_type=None, node_type_names=None, edge_type_names=None)#
Converts a
Dataobject to a heterogeneousHeteroDataobject. For this, node and edge attributes are splitted according to the node-level and edge-level vectorsnode_typeandedge_type, respectively.node_type_namesandedge_type_namescan be used to give meaningful node and edge type names, respectively. That is, the node_type0is given bynode_type_names[0]. If theDataobject was constructed viato_homogeneous(), the object can be reconstructed without any need to pass in additional arguments.- Parameters:
node_type (torch.Tensor, optional) – A node-level vector denoting the type of each node. (default:
None)edge_type (torch.Tensor, optional) – An edge-level vector denoting the type of each edge. (default:
None)node_type_names (List[str], optional) – The names of node types. (default:
None)edge_type_names (List[Tuple[str, str, str]], optional) – The names of edge types. (default:
None)
- to_namedtuple()#
Returns a
NamedTupleof stored key/value pairs.
- update(data)#
Updates the data object with the elements from another data object. Added elements will override existing ones (in case of duplicates).
- validate(raise_on_error=True)#
Validates the correctness of the data.
- property num_features: int#
Returns the number of features per node in the graph. Alias for
num_node_features.
- property num_nodes: int | None#
Returns the number of nodes in the graph.
Note
The number of nodes in the data object is automatically inferred in case node-level attributes are present, e.g.,
data.x. In some cases, however, a graph may only be given without any node-level attributes. :pyg:`PyG` then guesses the number of nodes according toedge_index.max().item() + 1. However, in case there exists isolated nodes, this number does not have to be correct which can result in unexpected behavior. Thus, we recommend to set the number of nodes in your data object explicitly viadata.num_nodes = .... You will be given a warning that requests you to do so.
- class DictConfig(content, key=None, parent=None, ref_type=typing.Any, key_type=typing.Any, element_type=typing.Any, is_optional=True, flags=None)#
Bases:
BaseContainer,MutableMapping[Any,Any]- __init__(content, key=None, parent=None, ref_type=typing.Any, key_type=typing.Any, element_type=typing.Any, is_optional=True, flags=None)#
- copy()#
- get(key, default_value=None)#
Return the value for key if key is in the dictionary, else default_value (defaulting to None).
- items() a set-like object providing a view on D's items#
- items_ex(resolve=True, keys=None)#
- keys() a set-like object providing a view on D's keys#
- pop(k[, d]) v, remove specified key and return the corresponding value.#
If key is not found, d is returned if given, otherwise KeyError is raised.
- setdefault(k[, d]) D.get(k,d), also set D[k]=d if k not in D#
- class InMemoryDataset(root=None, transform=None, pre_transform=None, pre_filter=None, log=True, force_reload=False)#
Bases:
DatasetDataset base class for creating graph datasets which easily fit into CPU memory. See here for the accompanying tutorial.
- Parameters:
root (str, optional) – Root directory where the dataset should be saved. (optional:
None)transform (callable, optional) – A function/transform that takes in a
DataorHeteroDataobject and returns a transformed version. The data object will be transformed before every access. (default:None)pre_transform (callable, optional) – A function/transform that takes in a
DataorHeteroDataobject and returns a transformed version. The data object will be transformed before being saved to disk. (default:None)pre_filter (callable, optional) – A function that takes in a
DataorHeteroDataobject and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None)log (bool, optional) – Whether to print any console output while downloading and processing the dataset. (default:
True)force_reload (bool, optional) – Whether to re-process the dataset. (default:
False)
- classmethod save(data_list, path)#
Saves a list of data objects to the file path
path.
- static collate(data_list)#
Collates a list of
DataorHeteroDataobjects to the internal storage format ofInMemoryDataset.
- __init__(root=None, transform=None, pre_transform=None, pre_filter=None, log=True, force_reload=False)#
- copy(idx=None)#
Performs a deep-copy of the dataset. If
idxis not given, will clone the full dataset. Otherwise, will only clone a subset of the dataset from indicesidx. Indices can be slices, lists, tuples, and atorch.Tensorornp.ndarrayof type long or bool.
- cpu(*args)#
Moves the dataset to CPU memory.
- cuda(device=None)#
Moves the dataset toto CUDA memory.
- get(idx)#
Gets the data object at index
idx.
- len()#
Returns the number of data objects stored in the dataset.
- load(path, data_cls=<class 'torch_geometric.data.data.Data'>)#
Loads the dataset from the file path
path.
- to(device)#
Performs device conversion of the whole dataset.
- to_on_disk_dataset(root=None, backend='sqlite', log=True)#
Converts the
InMemoryDatasetto aOnDiskDatasetvariant. Useful for distributed training and hardware instances with limited amount of shared memory.- root (str, optional): Root directory where the dataset should be saved.
If set to
None, will save the dataset inroot/on_disk. Note that it is important to specifyrootto account for different dataset splits. (optional:None)- backend (str): The
Databasebackend to use. (default:
"sqlite")- log (bool, optional): Whether to print any console output while
processing the dataset. (default:
True)
- class Path(*args, **kwargs)#
Bases:
PurePathPurePath subclass that can make system calls.
Path represents a filesystem path but unlike PurePath, also offers methods to do system calls on path objects. Depending on your system, instantiating a Path will return either a PosixPath or a WindowsPath object. You can also instantiate a PosixPath or WindowsPath directly, but cannot instantiate a WindowsPath on a POSIX system or vice versa.
- classmethod cwd()#
Return a new path pointing to the current working directory (as returned by os.getcwd()).
- classmethod home()#
Return a new path pointing to the user’s home directory (as returned by os.path.expanduser(‘~’)).
- absolute()#
Return an absolute version of this path by prepending the current working directory. No normalization or symlink resolution is performed.
Use resolve() to get the canonical path to a file.
- chmod(mode, *, follow_symlinks=True)#
Change the permissions of the path, like os.chmod().
- exists()#
Whether this path exists.
- expanduser()#
Return a new path with expanded ~ and ~user constructs (as returned by os.path.expanduser)
- glob(pattern)#
Iterate over this subtree and yield all existing files (of any kind, including directories) matching the given relative pattern.
- group()#
Return the group name of the file gid.
- hardlink_to(target)#
Make this path a hard link pointing to the same file as target.
Note the order of arguments (self, target) is the reverse of os.link’s.
- is_block_device()#
Whether this path is a block device.
- is_char_device()#
Whether this path is a character device.
- is_dir()#
Whether this path is a directory.
- is_fifo()#
Whether this path is a FIFO.
- is_file()#
Whether this path is a regular file (also True for symlinks pointing to regular files).
- is_mount()#
Check if this path is a POSIX mount point
- is_socket()#
Whether this path is a socket.
- is_symlink()#
Whether this path is a symbolic link.
- iterdir()#
Iterate over the files in this directory. Does not yield any result for the special paths ‘.’ and ‘..’.
- lchmod(mode)#
Like chmod(), except if the path points to a symlink, the symlink’s permissions are changed, rather than its target’s.
- link_to(target)#
Make the target path a hard link pointing to this path.
Note this function does not make this path a hard link to target, despite the implication of the function and argument names. The order of arguments (target, link) is the reverse of Path.symlink_to, but matches that of os.link.
Deprecated since Python 3.10 and scheduled for removal in Python 3.12. Use hardlink_to() instead.
- lstat()#
Like stat(), except if the path points to a symlink, the symlink’s status information is returned, rather than its target’s.
- mkdir(mode=511, parents=False, exist_ok=False)#
Create a new directory at this given path.
- open(mode='r', buffering=-1, encoding=None, errors=None, newline=None)#
Open the file pointed by this path and return a file object, as the built-in open() function does.
- owner()#
Return the login name of the file owner.
- read_bytes()#
Open the file in bytes mode, read it, and close the file.
- read_text(encoding=None, errors=None)#
Open the file in text mode, read it, and close the file.
- readlink()#
Return the path to which the symbolic link points.
- rename(target)#
Rename this path to the target path.
The target path may be absolute or relative. Relative paths are interpreted relative to the current working directory, not the directory of the Path object.
Returns the new Path instance pointing to the target path.
- replace(target)#
Rename this path to the target path, overwriting if that path exists.
The target path may be absolute or relative. Relative paths are interpreted relative to the current working directory, not the directory of the Path object.
Returns the new Path instance pointing to the target path.
- resolve(strict=False)#
Make the path absolute, resolving all symlinks on the way and also normalizing it.
- rglob(pattern)#
Recursively yield all existing files (of any kind, including directories) matching the given relative pattern, anywhere in this subtree.
- rmdir()#
Remove this directory. The directory must be empty.
- samefile(other_path)#
Return whether other_path is the same or not as this file (as returned by os.path.samefile()).
- stat(*, follow_symlinks=True)#
Return the result of the stat() system call on this path, like os.stat() does.
- symlink_to(target, target_is_directory=False)#
Make this path a symlink pointing to the target path. Note the order of arguments (link, target) is the reverse of os.symlink.
- touch(mode=438, exist_ok=True)#
Create this file with the given access mode, if it doesn’t exist.
- unlink(missing_ok=False)#
Remove this file or link. If the path is a directory, use rmdir() instead.
- write_bytes(data)#
Open the file in bytes mode, write to it, and close the file.
- write_text(data, encoding=None, errors=None, newline=None)#
Open the file in text mode, write to it, and close the file.
- smiles2graph(smiles_string)#
Converts SMILES string to graph Data object :input: SMILES string (str) :return: graph object