topobench.data.datasets.us_county_demos_dataset module#
Dataset class for US County Demographics dataset.
- class Data(x=None, edge_index=None, edge_attr=None, y=None, pos=None, time=None, **kwargs)#
Bases:
BaseData,FeatureStore,GraphStoreA data object describing a homogeneous graph. The data object can hold node-level, link-level and graph-level attributes. In general,
Datatries to mimic the behavior of a regular :python:`Python` dictionary. In addition, it provides useful functionality for analyzing graph structures, and provides basic PyTorch tensor functionalities. See here for the accompanying tutorial.from torch_geometric.data import Data data = Data(x=x, edge_index=edge_index, ...) # Add additional arguments to `data`: data.train_idx = torch.tensor([...], dtype=torch.long) data.test_mask = torch.tensor([...], dtype=torch.bool) # Analyzing the graph structure: data.num_nodes >>> 23 data.is_directed() >>> False # PyTorch tensor functionality: data = data.pin_memory() data = data.to('cuda:0', non_blocking=True)
- Parameters:
x (torch.Tensor, optional) – Node feature matrix with shape
[num_nodes, num_node_features]. (default:None)edge_index (LongTensor, optional) – Graph connectivity in COO format with shape
[2, num_edges]. (default:None)edge_attr (torch.Tensor, optional) – Edge feature matrix with shape
[num_edges, num_edge_features]. (default:None)y (torch.Tensor, optional) – Graph-level or node-level ground-truth labels with arbitrary shape. (default:
None)pos (torch.Tensor, optional) – Node position matrix with shape
[num_nodes, num_dimensions]. (default:None)time (torch.Tensor, optional) – The timestamps for each event with shape
[num_edges]or[num_nodes]. (default:None)**kwargs (optional) – Additional attributes.
- __init__(x=None, edge_index=None, edge_attr=None, y=None, pos=None, time=None, **kwargs)#
- connected_components()#
Extracts connected components of the graph using a union-find algorithm. The components are returned as a list of
Dataobjects, where each object represents a connected component of the graph.data = Data() data.x = torch.tensor([[1.0], [2.0], [3.0], [4.0]]) data.y = torch.tensor([[1.1], [2.1], [3.1], [4.1]]) data.edge_index = torch.tensor( [[0, 1, 2, 3], [1, 0, 3, 2]], dtype=torch.long ) components = data.connected_components() print(len(components)) >>> 2 print(components[0].x) >>> Data(x=[2, 1], y=[2, 1], edge_index=[2, 2])
- Returns:
A list of disconnected components.
- Return type:
List[Data]
- debug()#
- edge_subgraph(subset)#
Returns the induced subgraph given by the edge indices
subset. Will currently preserve all the nodes in the graph, even if they are isolated after subgraph computation.- Parameters:
subset (LongTensor or BoolTensor) – The edges to keep.
- classmethod from_dict(mapping)#
Creates a
Dataobject from a dictionary.
- get_all_edge_attrs()#
Returns all registered edge attributes.
- get_all_tensor_attrs()#
Obtains all feature attributes stored in Data.
- stores_as(data)#
- subgraph(subset)#
Returns the induced subgraph given by the node indices
subset.- Parameters:
subset (LongTensor or BoolTensor) – The nodes to keep.
- to_dict()#
Returns a dictionary of stored key/value pairs.
- to_heterogeneous(node_type=None, edge_type=None, node_type_names=None, edge_type_names=None)#
Converts a
Dataobject to a heterogeneousHeteroDataobject. For this, node and edge attributes are splitted according to the node-level and edge-level vectorsnode_typeandedge_type, respectively.node_type_namesandedge_type_namescan be used to give meaningful node and edge type names, respectively. That is, the node_type0is given bynode_type_names[0]. If theDataobject was constructed viato_homogeneous(), the object can be reconstructed without any need to pass in additional arguments.- Parameters:
node_type (torch.Tensor, optional) – A node-level vector denoting the type of each node. (default:
None)edge_type (torch.Tensor, optional) – An edge-level vector denoting the type of each edge. (default:
None)node_type_names (List[str], optional) – The names of node types. (default:
None)edge_type_names (List[Tuple[str, str, str]], optional) – The names of edge types. (default:
None)
- to_namedtuple()#
Returns a
NamedTupleof stored key/value pairs.
- update(data)#
Updates the data object with the elements from another data object. Added elements will override existing ones (in case of duplicates).
- validate(raise_on_error=True)#
Validates the correctness of the data.
- property num_features: int#
Returns the number of features per node in the graph. Alias for
num_node_features.
- property num_nodes: int | None#
Returns the number of nodes in the graph.
Note
The number of nodes in the data object is automatically inferred in case node-level attributes are present, e.g.,
data.x. In some cases, however, a graph may only be given without any node-level attributes. :pyg:`PyG` then guesses the number of nodes according toedge_index.max().item() + 1. However, in case there exists isolated nodes, this number does not have to be correct which can result in unexpected behavior. Thus, we recommend to set the number of nodes in your data object explicitly viadata.num_nodes = .... You will be given a warning that requests you to do so.
- class DictConfig(content, key=None, parent=None, ref_type=typing.Any, key_type=typing.Any, element_type=typing.Any, is_optional=True, flags=None)#
Bases:
BaseContainer,MutableMapping[Any,Any]- __init__(content, key=None, parent=None, ref_type=typing.Any, key_type=typing.Any, element_type=typing.Any, is_optional=True, flags=None)#
- copy()#
- get(key, default_value=None)#
Return the value for key if key is in the dictionary, else default_value (defaulting to None).
- items() a set-like object providing a view on D's items#
- items_ex(resolve=True, keys=None)#
- keys() a set-like object providing a view on D's keys#
- pop(k[, d]) v, remove specified key and return the corresponding value.#
If key is not found, d is returned if given, otherwise KeyError is raised.
- setdefault(k[, d]) D.get(k,d), also set D[k]=d if k not in D#
- class InMemoryDataset(root=None, transform=None, pre_transform=None, pre_filter=None, log=True, force_reload=False)#
Bases:
DatasetDataset base class for creating graph datasets which easily fit into CPU memory. See here for the accompanying tutorial.
- Parameters:
root (str, optional) – Root directory where the dataset should be saved. (optional:
None)transform (callable, optional) – A function/transform that takes in a
DataorHeteroDataobject and returns a transformed version. The data object will be transformed before every access. (default:None)pre_transform (callable, optional) – A function/transform that takes in a
DataorHeteroDataobject and returns a transformed version. The data object will be transformed before being saved to disk. (default:None)pre_filter (callable, optional) – A function that takes in a
DataorHeteroDataobject and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None)log (bool, optional) – Whether to print any console output while downloading and processing the dataset. (default:
True)force_reload (bool, optional) – Whether to re-process the dataset. (default:
False)
- __init__(root=None, transform=None, pre_transform=None, pre_filter=None, log=True, force_reload=False)#
- static collate(data_list)#
Collates a list of
DataorHeteroDataobjects to the internal storage format ofInMemoryDataset.
- copy(idx=None)#
Performs a deep-copy of the dataset. If
idxis not given, will clone the full dataset. Otherwise, will only clone a subset of the dataset from indicesidx. Indices can be slices, lists, tuples, and atorch.Tensorornp.ndarrayof type long or bool.
- cpu(*args)#
Moves the dataset to CPU memory.
- cuda(device=None)#
Moves the dataset toto CUDA memory.
- get(idx)#
Gets the data object at index
idx.
- len()#
Returns the number of data objects stored in the dataset.
- load(path, data_cls=<class 'torch_geometric.data.data.Data'>)#
Loads the dataset from the file path
path.
- classmethod save(data_list, path)#
Saves a list of data objects to the file path
path.
- to(device)#
Performs device conversion of the whole dataset.
- to_on_disk_dataset(root=None, backend='sqlite', log=True)#
Converts the
InMemoryDatasetto aOnDiskDatasetvariant. Useful for distributed training and hardware instances with limited amount of shared memory.- root (str, optional): Root directory where the dataset should be saved.
If set to
None, will save the dataset inroot/on_disk. Note that it is important to specifyrootto account for different dataset splits. (optional:None)- backend (str): The
Databasebackend to use. (default:
"sqlite")- log (bool, optional): Whether to print any console output while
processing the dataset. (default:
True)
- class USCountyDemosDataset(root, name, parameters)#
Bases:
InMemoryDatasetDataset class for US County Demographics dataset.
- Parameters:
- rootstr
Root directory where the dataset will be saved.
- namestr
Name of the dataset.
- parametersDictConfig
Configuration parameters for the dataset.
- Attributes:
- URLS (dict): Dictionary containing the URLs for downloading the dataset.
- FILE_FORMAT (dict): Dictionary containing the file formats for the dataset.
- RAW_FILE_NAMES (dict): Dictionary containing the raw file names for the dataset.
- __init__(root, name, parameters)#
- download()#
Download the dataset from a URL and saves it to the raw directory.
- Raises:
FileNotFoundError – If the dataset URL is not found.
- process()#
Handle the data for the dataset.
This method loads the US county demographics data, applies any pre- processing transformations if specified, and saves the processed data to the appropriate location.
- URLS: ClassVar = {'US-county-demos': 'https://drive.google.com/file/d/1FNF_LbByhYNICPNdT6tMaJI9FxuSvvLK/view?usp=sharing'}#
- property processed_dir: str#
Return the path to the processed directory of the dataset.
- Returns:
- str
Path to the processed directory.
- property processed_file_names: str#
Return the processed file name for the dataset.
- Returns:
- str
Processed file name.
- download_file_from_drive(file_link, path_to_save, dataset_name, file_format='tar.gz')#
Download a file from a Google Drive link and saves it to the specified path.
- Parameters:
- file_linkstr
The Google Drive link of the file to download.
- path_to_savestr
The path where the downloaded file will be saved.
- dataset_namestr
The name of the dataset.
- file_formatstr, optional
The format of the downloaded file. Defaults to “tar.gz”.
- Raises:
- None
- extract_zip(path, folder, log=True)#
Extracts a zip archive to a specific folder.
- read_us_county_demos(path, year=2012, y_col='Election')#
Load US County Demos dataset.
- Parameters:
- pathstr
Path to the dataset.
- yearint, optional
Year to load the features (default: 2012).
- y_colstr, optional
Column to use as label. Can be one of [‘Election’, ‘MedianIncome’, ‘MigraRate’, ‘BirthRate’, ‘DeathRate’, ‘BachelorRate’, ‘UnemploymentRate’] (default: “Election”).
- Returns:
- torch_geometric.data.Data
Data object of the graph for the US County Demos dataset.