topobench.transforms.liftings.pointcloud2hypergraph.mogmst_lifting module#
Mixture of Gaussians and Minimum Spanning Tree (MoGMST) Lifting.
- class GaussianMixture(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)#
Bases:
BaseMixtureGaussian Mixture.
Representation of a Gaussian mixture model probability distribution. This class allows to estimate the parameters of a Gaussian mixture distribution.
Read more in the User Guide.
Added in version 0.18.
- Parameters:
- n_componentsint, default=1
The number of mixture components.
- covariance_type{‘full’, ‘tied’, ‘diag’, ‘spherical’}, default=’full’
String describing the type of covariance parameters to use. Must be one of:
‘full’: each component has its own general covariance matrix.
‘tied’: all components share the same general covariance matrix.
‘diag’: each component has its own diagonal covariance matrix.
‘spherical’: each component has its own single variance.
For an example of using covariance_type, refer to sphx_glr_auto_examples_mixture_plot_gmm_selection.py.
- tolfloat, default=1e-3
The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.
- reg_covarfloat, default=1e-6
Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.
- max_iterint, default=100
The number of EM iterations to perform.
- n_initint, default=1
The number of initializations to perform. The best results are kept.
- init_params{‘kmeans’, ‘k-means++’, ‘random’, ‘random_from_data’}, default=’kmeans’
The method used to initialize the weights, the means and the precisions. String must be one of:
‘kmeans’ : responsibilities are initialized using kmeans.
‘k-means++’ : use the k-means++ method to initialize.
‘random’ : responsibilities are initialized randomly.
‘random_from_data’ : initial means are randomly selected data points.
Changed in version v1.1: init_params now accepts ‘random_from_data’ and ‘k-means++’ as initialization methods.
- weights_initarray-like of shape (n_components, ), default=None
The user-provided initial weights. If it is None, weights are initialized using the init_params method.
- means_initarray-like of shape (n_components, n_features), default=None
The user-provided initial means, If it is None, means are initialized using the init_params method.
- precisions_initarray-like, default=None
The user-provided initial precisions (inverse of the covariance matrices). If it is None, precisions are initialized using the ‘init_params’ method. The shape depends on ‘covariance_type’:
(n_components,) if 'spherical', (n_features, n_features) if 'tied', (n_components, n_features) if 'diag', (n_components, n_features, n_features) if 'full'
- random_stateint, RandomState instance or None, default=None
Controls the random seed given to the method chosen to initialize the parameters (see init_params). In addition, it controls the generation of random samples from the fitted distribution (see the method sample). Pass an int for reproducible output across multiple function calls. See Glossary.
- warm_startbool, default=False
If ‘warm_start’ is True, the solution of the last fitting is used as initialization for the next call of fit(). This can speed up convergence when fit is called several times on similar problems. In that case, ‘n_init’ is ignored and only a single initialization occurs upon the first call. See the Glossary.
- verboseint, default=0
Enable verbose output. If 1 then it prints the current initialization and each iteration step. If greater than 1 then it prints also the log probability and the time needed for each step.
- verbose_intervalint, default=10
Number of iteration done before the next print.
- Attributes:
- weights_array-like of shape (n_components,)
The weights of each mixture components.
- means_array-like of shape (n_components, n_features)
The mean of each mixture component.
- covariances_array-like
The covariance of each mixture component. The shape depends on covariance_type:
(n_components,) if 'spherical', (n_features, n_features) if 'tied', (n_components, n_features) if 'diag', (n_components, n_features, n_features) if 'full'
For an example of using covariances, refer to sphx_glr_auto_examples_mixture_plot_gmm_covariances.py.
- precisions_array-like
The precision matrices for each component in the mixture. A precision matrix is the inverse of a covariance matrix. A covariance matrix is symmetric positive definite so the mixture of Gaussian can be equivalently parameterized by the precision matrices. Storing the precision matrices instead of the covariance matrices makes it more efficient to compute the log-likelihood of new samples at test time. The shape depends on covariance_type:
(n_components,) if 'spherical', (n_features, n_features) if 'tied', (n_components, n_features) if 'diag', (n_components, n_features, n_features) if 'full'
- precisions_cholesky_array-like
The cholesky decomposition of the precision matrices of each mixture component. A precision matrix is the inverse of a covariance matrix. A covariance matrix is symmetric positive definite so the mixture of Gaussian can be equivalently parameterized by the precision matrices. Storing the precision matrices instead of the covariance matrices makes it more efficient to compute the log-likelihood of new samples at test time. The shape depends on covariance_type:
(n_components,) if 'spherical', (n_features, n_features) if 'tied', (n_components, n_features) if 'diag', (n_components, n_features, n_features) if 'full'
- converged_bool
True when convergence of the best fit of EM was reached, False otherwise.
- n_iter_int
Number of step used by the best fit of EM to reach the convergence.
- lower_bound_float
Lower bound value on the log-likelihood (of the training data with respect to the model) of the best fit of EM.
- lower_bounds_array-like of shape (n_iter_,)
The list of lower bound values on the log-likelihood from each iteration of the best fit of EM.
- n_features_in_int
Number of features seen during fit.
Added in version 0.24.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
Added in version 1.0.
See also
BayesianGaussianMixtureGaussian mixture model fit with a variational inference.
Examples
>>> import numpy as np >>> from sklearn.mixture import GaussianMixture >>> X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) >>> gm = GaussianMixture(n_components=2, random_state=0).fit(X) >>> gm.means_ array([[10., 2.], [ 1., 2.]]) >>> gm.predict([[0, 0], [12, 3]]) array([1, 0])
For a comparison of Gaussian Mixture with other clustering algorithms, see sphx_glr_auto_examples_cluster_plot_cluster_comparison.py
- __init__(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)#
- aic(X)#
Akaike information criterion for the current model on the input X.
You can refer to this mathematical section for more details regarding the formulation of the AIC used.
- Parameters:
- Xarray of shape (n_samples, n_dimensions)
The input samples.
- Returns:
- aicfloat
The lower the better.
- bic(X)#
Bayesian information criterion for the current model on the input X.
You can refer to this mathematical section for more details regarding the formulation of the BIC used.
For an example of GMM selection using bic information criterion, refer to sphx_glr_auto_examples_mixture_plot_gmm_selection.py.
- Parameters:
- Xarray of shape (n_samples, n_dimensions)
The input samples.
- Returns:
- bicfloat
The lower the better.
- class MoGMSTLifting(min_components=None, max_components=None, random_state=None, **kwargs)#
Bases:
PointCloud2HypergraphLiftingLift a point cloud to a hypergraph.
We find a Mixture of Gaussians and then create a Minimum Spanning Tree (MST) between the means of the Gaussians.
- Parameters:
- min_componentsint or None, optional
The minimum number of components for the Mixture of Gaussians model. It needs to be at least 1 (default: None).
- max_componentsint or None, optional
The maximum number of components for the Mixture of Gaussians model. It needs to be greater or equal than min_components (default: None).
- random_stateint, optional
The random state for the Mixture of Gaussians model (default: None).
- **kwargsoptional
Additional arguments for the class.
- __init__(min_components=None, max_components=None, random_state=None, **kwargs)#
- find_mog(data)#
Find the best number of components for a Mixture of Gaussians model.
- Parameters:
- datanp.ndarray
The input data to be fitted.
- Returns:
- tuple[np.ndarray, int, np.ndarray]
The labels of the data, the number of components and the means of the components.
- lift_topology(data)#
Lift the topology of a graph to a hypergraph.
- Parameters:
- datatorch_geometric.data.Data
The input data to be lifted.
- Returns:
- dict
The lifted topology.
- class PointCloud2HypergraphLifting(**kwargs)#
Bases:
PointCloudLiftingAbstract class for lifting pointclouds to hypergraphs.
- Parameters:
- **kwargsoptional
Additional arguments for the class.
- __init__(**kwargs)#
- from_numpy_array(A, parallel_edges=False, create_using=None)#
Returns a graph from a 2D NumPy array.
The 2D NumPy array is interpreted as an adjacency matrix for the graph.
- Parameters:
- Aa 2D numpy.ndarray
An adjacency matrix representation of a graph
- parallel_edgesBoolean
If this is True, create_using is a multigraph, and A is an integer array, then entry (i, j) in the array is interpreted as the number of parallel edges joining vertices i and j in the graph. If it is False, then the entries in the array are interpreted as the weight of a single edge joining the vertices.
- create_usingNetworkX graph constructor, optional (default=nx.Graph)
Graph type to create. If graph instance, then cleared before populated.
See also
to_numpy_array
Notes
For directed graphs, explicitly mention create_using=nx.DiGraph, and entry i,j of A corresponds to an edge from i to j.
If create_using is
networkx.MultiGraphornetworkx.MultiDiGraph, parallel_edges is True, and the entries of A are of typeint, then this function returns a multigraph (of the same type as create_using) with parallel edges.If create_using indicates an undirected multigraph, then only the edges indicated by the upper triangle of the array A will be added to the graph.
If the NumPy array has a single data type for each array entry it will be converted to an appropriate Python data type.
If the NumPy array has a user-specified compound data type the names of the data fields will be used as attribute keys in the resulting NetworkX graph.
Examples
Simple integer weights on edges:
>>> import numpy as np >>> A = np.array([[1, 1], [2, 1]]) >>> G = nx.from_numpy_array(A) >>> G.edges(data=True) EdgeDataView([(0, 0, {'weight': 1}), (0, 1, {'weight': 2}), (1, 1, {'weight': 1})])
If create_using indicates a multigraph and the array has only integer entries and parallel_edges is False, then the entries will be treated as weights for edges joining the nodes (without creating parallel edges):
>>> A = np.array([[1, 1], [1, 2]]) >>> G = nx.from_numpy_array(A, create_using=nx.MultiGraph) >>> G[1][1] AtlasView({0: {'weight': 2}})
If create_using indicates a multigraph and the array has only integer entries and parallel_edges is True, then the entries will be treated as the number of parallel edges joining those two vertices:
>>> A = np.array([[1, 1], [1, 2]]) >>> temp = nx.MultiGraph() >>> G = nx.from_numpy_array(A, parallel_edges=True, create_using=temp) >>> G[1][1] AtlasView({0: {'weight': 1}, 1: {'weight': 1}})
User defined compound data type on edges:
>>> dt = [("weight", float), ("cost", int)] >>> A = np.array([[(1.0, 2)]], dtype=dt) >>> G = nx.from_numpy_array(A) >>> G.edges() EdgeView([(0, 0)]) >>> G[0][0]["cost"] 2 >>> G[0][0]["weight"] 1.0
- minimum_spanning_tree(G, weight='weight', algorithm='kruskal', ignore_nan=False)#
Returns a minimum spanning tree or forest on an undirected graph G.
- Parameters:
- Gundirected graph
An undirected graph. If G is connected, then the algorithm finds a spanning tree. Otherwise, a spanning forest is found.
- weightstr
Data key to use for edge weights.
- algorithmstring
The algorithm to use when finding a minimum spanning tree. Valid choices are ‘kruskal’, ‘prim’, or ‘boruvka’. The default is ‘kruskal’.
- ignore_nanbool (default: False)
If a NaN is found as an edge weight normally an exception is raised. If ignore_nan is True then that edge is ignored instead.
- Returns:
- GNetworkX Graph
A minimum spanning tree or forest.
Notes
For Borůvka’s algorithm, each edge must have a weight attribute, and each edge weight must be distinct.
For the other algorithms, if the graph edges do not have a weight attribute a default weight of 1 will be used.
There may be more than one tree with the same minimum or maximum weight. See
networkx.tree.recognitionfor more detailed definitions.Isolated nodes with self-loops are in the tree as edgeless isolated nodes.
Examples
>>> G = nx.cycle_graph(4) >>> G.add_edge(0, 3, weight=2) >>> T = nx.minimum_spanning_tree(G) >>> sorted(T.edges(data=True)) [(0, 1, {}), (1, 2, {}), (2, 3, {})]
- pairwise_distances(X, Y=None, metric='euclidean', *, n_jobs=None, force_all_finite='deprecated', ensure_all_finite=None, **kwds)#
Compute the distance matrix from a feature array X and optional Y.
This function takes one or two feature arrays or a distance matrix, and returns a distance matrix.
If X is a feature array, of shape (n_samples_X, n_features), and:
Y is None and metric is not ‘precomputed’, the pairwise distances between X and itself are returned.
Y is a feature array of shape (n_samples_Y, n_features), the pairwise distances between X and Y is returned.
If X is a distance matrix, of shape (n_samples_X, n_samples_X), metric should be ‘precomputed’. Y is thus ignored and X is returned as is.
If the input is a collection of non-numeric data (e.g. a list of strings or a boolean array), a custom metric must be passed.
This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array.
Valid values for metric are:
From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘nan_euclidean’]. All metrics support sparse matrix inputs except ‘nan_euclidean’.
From
scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]. These metrics do not support sparse matrix inputs.
Note
‘kulsinski’ is deprecated from SciPy 1.9 and will be removed in SciPy 1.11.
Note
‘matching’ has been removed in SciPy 1.9 (use ‘hamming’ instead).
Note that in the case of ‘cityblock’, ‘cosine’ and ‘euclidean’ (which are valid
scipy.spatial.distancemetrics), the scikit-learn implementation will be used, which is faster and has support for sparse matrices (except for ‘cityblock’). For a verbose description of the metrics from scikit-learn, seesklearn.metrics.pairwise.distance_metrics()function.Read more in the User Guide.
- Parameters:
- X{array-like, sparse matrix} of shape (n_samples_X, n_samples_X) or (n_samples_X, n_features)
Array of pairwise distances between samples, or a feature array. The shape of the array should be (n_samples_X, n_samples_X) if metric == “precomputed” and (n_samples_X, n_features) otherwise.
- Y{array-like, sparse matrix} of shape (n_samples_Y, n_features), default=None
An optional second feature array. Only allowed if metric != “precomputed”.
- metricstr or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by
scipy.spatial.distance.pdist()for its metric parameter, or a metric listed inpairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.- n_jobsint, default=None
The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them using multithreading.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. See Glossary for more details.The “euclidean” and “cosine” metrics rely heavily on BLAS which is already multithreaded. So, increasing n_jobs would likely cause oversubscription and quickly degrade performance.
- force_all_finitebool or ‘allow-nan’, default=True
Whether to raise an error on np.inf, np.nan, pd.NA in array. Ignored for a metric listed in
pairwise.PAIRWISE_DISTANCE_FUNCTIONS. The possibilities are:True: Force all values of array to be finite.
False: accepts np.inf, np.nan, pd.NA in array.
‘allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.
Added in version 0.22:
force_all_finiteaccepts the string'allow-nan'.Changed in version 0.23: Accepts pd.NA and converts it into np.nan.
Deprecated since version 1.6: force_all_finite was renamed to ensure_all_finite and will be removed in 1.8.
- ensure_all_finitebool or ‘allow-nan’, default=True
Whether to raise an error on np.inf, np.nan, pd.NA in array. Ignored for a metric listed in
pairwise.PAIRWISE_DISTANCE_FUNCTIONS. The possibilities are:True: Force all values of array to be finite.
False: accepts np.inf, np.nan, pd.NA in array.
‘allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.
Added in version 1.6: force_all_finite was renamed to ensure_all_finite.
- **kwdsoptional keyword parameters
Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.
- Returns:
- Dndarray of shape (n_samples_X, n_samples_X) or (n_samples_X, n_samples_Y)
A distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then D_{i, j} is the distance between the ith array from X and the jth array from Y.
See also
pairwise_distances_chunkedPerforms the same calculation as this function, but returns a generator of chunks of the distance matrix, in order to limit memory usage.
sklearn.metrics.pairwise.paired_distancesComputes the distances between corresponding elements of two arrays.
Notes
If metric is a callable, no restrictions are placed on X and Y dimensions.
Examples
>>> from sklearn.metrics.pairwise import pairwise_distances >>> X = [[0, 0, 0], [1, 1, 1]] >>> Y = [[1, 0, 0], [1, 1, 0]] >>> pairwise_distances(X, Y, metric='sqeuclidean') array([[1., 2.], [2., 1.]])