topobench.transforms.liftings.pointcloud2hypergraph.mogmst_lifting module#

Mixture of Gaussians and Minimum Spanning Tree (MoGMST) Lifting.

class GaussianMixture(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)#

Bases: BaseMixture

Gaussian Mixture.

Representation of a Gaussian mixture model probability distribution. This class allows to estimate the parameters of a Gaussian mixture distribution.

Read more in the User Guide.

Added in version 0.18.

Parameters:
n_componentsint, default=1

The number of mixture components.

covariance_type{‘full’, ‘tied’, ‘diag’, ‘spherical’}, default=’full’

String describing the type of covariance parameters to use. Must be one of:

  • ‘full’: each component has its own general covariance matrix.

  • ‘tied’: all components share the same general covariance matrix.

  • ‘diag’: each component has its own diagonal covariance matrix.

  • ‘spherical’: each component has its own single variance.

For an example of using covariance_type, refer to sphx_glr_auto_examples_mixture_plot_gmm_selection.py.

tolfloat, default=1e-3

The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.

reg_covarfloat, default=1e-6

Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.

max_iterint, default=100

The number of EM iterations to perform.

n_initint, default=1

The number of initializations to perform. The best results are kept.

init_params{‘kmeans’, ‘k-means++’, ‘random’, ‘random_from_data’}, default=’kmeans’

The method used to initialize the weights, the means and the precisions. String must be one of:

  • ‘kmeans’ : responsibilities are initialized using kmeans.

  • ‘k-means++’ : use the k-means++ method to initialize.

  • ‘random’ : responsibilities are initialized randomly.

  • ‘random_from_data’ : initial means are randomly selected data points.

Changed in version v1.1: init_params now accepts ‘random_from_data’ and ‘k-means++’ as initialization methods.

weights_initarray-like of shape (n_components, ), default=None

The user-provided initial weights. If it is None, weights are initialized using the init_params method.

means_initarray-like of shape (n_components, n_features), default=None

The user-provided initial means, If it is None, means are initialized using the init_params method.

precisions_initarray-like, default=None

The user-provided initial precisions (inverse of the covariance matrices). If it is None, precisions are initialized using the ‘init_params’ method. The shape depends on ‘covariance_type’:

(n_components,)                        if 'spherical',
(n_features, n_features)               if 'tied',
(n_components, n_features)             if 'diag',
(n_components, n_features, n_features) if 'full'
random_stateint, RandomState instance or None, default=None

Controls the random seed given to the method chosen to initialize the parameters (see init_params). In addition, it controls the generation of random samples from the fitted distribution (see the method sample). Pass an int for reproducible output across multiple function calls. See Glossary.

warm_startbool, default=False

If ‘warm_start’ is True, the solution of the last fitting is used as initialization for the next call of fit(). This can speed up convergence when fit is called several times on similar problems. In that case, ‘n_init’ is ignored and only a single initialization occurs upon the first call. See the Glossary.

verboseint, default=0

Enable verbose output. If 1 then it prints the current initialization and each iteration step. If greater than 1 then it prints also the log probability and the time needed for each step.

verbose_intervalint, default=10

Number of iteration done before the next print.

Attributes:
weights_array-like of shape (n_components,)

The weights of each mixture components.

means_array-like of shape (n_components, n_features)

The mean of each mixture component.

covariances_array-like

The covariance of each mixture component. The shape depends on covariance_type:

(n_components,)                        if 'spherical',
(n_features, n_features)               if 'tied',
(n_components, n_features)             if 'diag',
(n_components, n_features, n_features) if 'full'

For an example of using covariances, refer to sphx_glr_auto_examples_mixture_plot_gmm_covariances.py.

precisions_array-like

The precision matrices for each component in the mixture. A precision matrix is the inverse of a covariance matrix. A covariance matrix is symmetric positive definite so the mixture of Gaussian can be equivalently parameterized by the precision matrices. Storing the precision matrices instead of the covariance matrices makes it more efficient to compute the log-likelihood of new samples at test time. The shape depends on covariance_type:

(n_components,)                        if 'spherical',
(n_features, n_features)               if 'tied',
(n_components, n_features)             if 'diag',
(n_components, n_features, n_features) if 'full'
precisions_cholesky_array-like

The cholesky decomposition of the precision matrices of each mixture component. A precision matrix is the inverse of a covariance matrix. A covariance matrix is symmetric positive definite so the mixture of Gaussian can be equivalently parameterized by the precision matrices. Storing the precision matrices instead of the covariance matrices makes it more efficient to compute the log-likelihood of new samples at test time. The shape depends on covariance_type:

(n_components,)                        if 'spherical',
(n_features, n_features)               if 'tied',
(n_components, n_features)             if 'diag',
(n_components, n_features, n_features) if 'full'
converged_bool

True when convergence of the best fit of EM was reached, False otherwise.

n_iter_int

Number of step used by the best fit of EM to reach the convergence.

lower_bound_float

Lower bound value on the log-likelihood (of the training data with respect to the model) of the best fit of EM.

lower_bounds_array-like of shape (n_iter_,)

The list of lower bound values on the log-likelihood from each iteration of the best fit of EM.

n_features_in_int

Number of features seen during fit.

Added in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

Added in version 1.0.

See also

BayesianGaussianMixture

Gaussian mixture model fit with a variational inference.

Examples

>>> import numpy as np
>>> from sklearn.mixture import GaussianMixture
>>> X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
>>> gm = GaussianMixture(n_components=2, random_state=0).fit(X)
>>> gm.means_
array([[10.,  2.],
       [ 1.,  2.]])
>>> gm.predict([[0, 0], [12, 3]])
array([1, 0])

For a comparison of Gaussian Mixture with other clustering algorithms, see sphx_glr_auto_examples_cluster_plot_cluster_comparison.py

__init__(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)#
aic(X)#

Akaike information criterion for the current model on the input X.

You can refer to this mathematical section for more details regarding the formulation of the AIC used.

Parameters:
Xarray of shape (n_samples, n_dimensions)

The input samples.

Returns:
aicfloat

The lower the better.

bic(X)#

Bayesian information criterion for the current model on the input X.

You can refer to this mathematical section for more details regarding the formulation of the BIC used.

For an example of GMM selection using bic information criterion, refer to sphx_glr_auto_examples_mixture_plot_gmm_selection.py.

Parameters:
Xarray of shape (n_samples, n_dimensions)

The input samples.

Returns:
bicfloat

The lower the better.

class MoGMSTLifting(min_components=None, max_components=None, random_state=None, **kwargs)#

Bases: PointCloud2HypergraphLifting

Lift a point cloud to a hypergraph.

We find a Mixture of Gaussians and then create a Minimum Spanning Tree (MST) between the means of the Gaussians.

Parameters:
min_componentsint or None, optional

The minimum number of components for the Mixture of Gaussians model. It needs to be at least 1 (default: None).

max_componentsint or None, optional

The maximum number of components for the Mixture of Gaussians model. It needs to be greater or equal than min_components (default: None).

random_stateint, optional

The random state for the Mixture of Gaussians model (default: None).

**kwargsoptional

Additional arguments for the class.

__init__(min_components=None, max_components=None, random_state=None, **kwargs)#
find_mog(data)#

Find the best number of components for a Mixture of Gaussians model.

Parameters:
datanp.ndarray

The input data to be fitted.

Returns:
tuple[np.ndarray, int, np.ndarray]

The labels of the data, the number of components and the means of the components.

lift_topology(data)#

Lift the topology of a graph to a hypergraph.

Parameters:
datatorch_geometric.data.Data

The input data to be lifted.

Returns:
dict

The lifted topology.

class PointCloud2HypergraphLifting(**kwargs)#

Bases: PointCloudLifting

Abstract class for lifting pointclouds to hypergraphs.

Parameters:
**kwargsoptional

Additional arguments for the class.

__init__(**kwargs)#
from_numpy_array(A, parallel_edges=False, create_using=None)#

Returns a graph from a 2D NumPy array.

The 2D NumPy array is interpreted as an adjacency matrix for the graph.

Parameters:
Aa 2D numpy.ndarray

An adjacency matrix representation of a graph

parallel_edgesBoolean

If this is True, create_using is a multigraph, and A is an integer array, then entry (i, j) in the array is interpreted as the number of parallel edges joining vertices i and j in the graph. If it is False, then the entries in the array are interpreted as the weight of a single edge joining the vertices.

create_usingNetworkX graph constructor, optional (default=nx.Graph)

Graph type to create. If graph instance, then cleared before populated.

See also

to_numpy_array

Notes

For directed graphs, explicitly mention create_using=nx.DiGraph, and entry i,j of A corresponds to an edge from i to j.

If create_using is networkx.MultiGraph or networkx.MultiDiGraph, parallel_edges is True, and the entries of A are of type int, then this function returns a multigraph (of the same type as create_using) with parallel edges.

If create_using indicates an undirected multigraph, then only the edges indicated by the upper triangle of the array A will be added to the graph.

If the NumPy array has a single data type for each array entry it will be converted to an appropriate Python data type.

If the NumPy array has a user-specified compound data type the names of the data fields will be used as attribute keys in the resulting NetworkX graph.

Examples

Simple integer weights on edges:

>>> import numpy as np
>>> A = np.array([[1, 1], [2, 1]])
>>> G = nx.from_numpy_array(A)
>>> G.edges(data=True)
EdgeDataView([(0, 0, {'weight': 1}), (0, 1, {'weight': 2}), (1, 1, {'weight': 1})])

If create_using indicates a multigraph and the array has only integer entries and parallel_edges is False, then the entries will be treated as weights for edges joining the nodes (without creating parallel edges):

>>> A = np.array([[1, 1], [1, 2]])
>>> G = nx.from_numpy_array(A, create_using=nx.MultiGraph)
>>> G[1][1]
AtlasView({0: {'weight': 2}})

If create_using indicates a multigraph and the array has only integer entries and parallel_edges is True, then the entries will be treated as the number of parallel edges joining those two vertices:

>>> A = np.array([[1, 1], [1, 2]])
>>> temp = nx.MultiGraph()
>>> G = nx.from_numpy_array(A, parallel_edges=True, create_using=temp)
>>> G[1][1]
AtlasView({0: {'weight': 1}, 1: {'weight': 1}})

User defined compound data type on edges:

>>> dt = [("weight", float), ("cost", int)]
>>> A = np.array([[(1.0, 2)]], dtype=dt)
>>> G = nx.from_numpy_array(A)
>>> G.edges()
EdgeView([(0, 0)])
>>> G[0][0]["cost"]
2
>>> G[0][0]["weight"]
1.0
minimum_spanning_tree(G, weight='weight', algorithm='kruskal', ignore_nan=False)#

Returns a minimum spanning tree or forest on an undirected graph G.

Parameters:
Gundirected graph

An undirected graph. If G is connected, then the algorithm finds a spanning tree. Otherwise, a spanning forest is found.

weightstr

Data key to use for edge weights.

algorithmstring

The algorithm to use when finding a minimum spanning tree. Valid choices are ‘kruskal’, ‘prim’, or ‘boruvka’. The default is ‘kruskal’.

ignore_nanbool (default: False)

If a NaN is found as an edge weight normally an exception is raised. If ignore_nan is True then that edge is ignored instead.

Returns:
GNetworkX Graph

A minimum spanning tree or forest.

Notes

For Borůvka’s algorithm, each edge must have a weight attribute, and each edge weight must be distinct.

For the other algorithms, if the graph edges do not have a weight attribute a default weight of 1 will be used.

There may be more than one tree with the same minimum or maximum weight. See networkx.tree.recognition for more detailed definitions.

Isolated nodes with self-loops are in the tree as edgeless isolated nodes.

Examples

>>> G = nx.cycle_graph(4)
>>> G.add_edge(0, 3, weight=2)
>>> T = nx.minimum_spanning_tree(G)
>>> sorted(T.edges(data=True))
[(0, 1, {}), (1, 2, {}), (2, 3, {})]
pairwise_distances(X, Y=None, metric='euclidean', *, n_jobs=None, force_all_finite='deprecated', ensure_all_finite=None, **kwds)#

Compute the distance matrix from a feature array X and optional Y.

This function takes one or two feature arrays or a distance matrix, and returns a distance matrix.

  • If X is a feature array, of shape (n_samples_X, n_features), and:

    • Y is None and metric is not ‘precomputed’, the pairwise distances between X and itself are returned.

    • Y is a feature array of shape (n_samples_Y, n_features), the pairwise distances between X and Y is returned.

  • If X is a distance matrix, of shape (n_samples_X, n_samples_X), metric should be ‘precomputed’. Y is thus ignored and X is returned as is.

If the input is a collection of non-numeric data (e.g. a list of strings or a boolean array), a custom metric must be passed.

This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array.

Valid values for metric are:

  • From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘nan_euclidean’]. All metrics support sparse matrix inputs except ‘nan_euclidean’.

  • From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]. These metrics do not support sparse matrix inputs.

Note

‘kulsinski’ is deprecated from SciPy 1.9 and will be removed in SciPy 1.11.

Note

‘matching’ has been removed in SciPy 1.9 (use ‘hamming’ instead).

Note that in the case of ‘cityblock’, ‘cosine’ and ‘euclidean’ (which are valid scipy.spatial.distance metrics), the scikit-learn implementation will be used, which is faster and has support for sparse matrices (except for ‘cityblock’). For a verbose description of the metrics from scikit-learn, see sklearn.metrics.pairwise.distance_metrics() function.

Read more in the User Guide.

Parameters:
X{array-like, sparse matrix} of shape (n_samples_X, n_samples_X) or (n_samples_X, n_features)

Array of pairwise distances between samples, or a feature array. The shape of the array should be (n_samples_X, n_samples_X) if metric == “precomputed” and (n_samples_X, n_features) otherwise.

Y{array-like, sparse matrix} of shape (n_samples_Y, n_features), default=None

An optional second feature array. Only allowed if metric != “precomputed”.

metricstr or callable, default=’euclidean’

The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist() for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

n_jobsint, default=None

The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them using multithreading.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

The “euclidean” and “cosine” metrics rely heavily on BLAS which is already multithreaded. So, increasing n_jobs would likely cause oversubscription and quickly degrade performance.

force_all_finitebool or ‘allow-nan’, default=True

Whether to raise an error on np.inf, np.nan, pd.NA in array. Ignored for a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. The possibilities are:

  • True: Force all values of array to be finite.

  • False: accepts np.inf, np.nan, pd.NA in array.

  • ‘allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.

Added in version 0.22: force_all_finite accepts the string 'allow-nan'.

Changed in version 0.23: Accepts pd.NA and converts it into np.nan.

Deprecated since version 1.6: force_all_finite was renamed to ensure_all_finite and will be removed in 1.8.

ensure_all_finitebool or ‘allow-nan’, default=True

Whether to raise an error on np.inf, np.nan, pd.NA in array. Ignored for a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. The possibilities are:

  • True: Force all values of array to be finite.

  • False: accepts np.inf, np.nan, pd.NA in array.

  • ‘allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.

Added in version 1.6: force_all_finite was renamed to ensure_all_finite.

**kwdsoptional keyword parameters

Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Returns:
Dndarray of shape (n_samples_X, n_samples_X) or (n_samples_X, n_samples_Y)

A distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then D_{i, j} is the distance between the ith array from X and the jth array from Y.

See also

pairwise_distances_chunked

Performs the same calculation as this function, but returns a generator of chunks of the distance matrix, in order to limit memory usage.

sklearn.metrics.pairwise.paired_distances

Computes the distances between corresponding elements of two arrays.

Notes

If metric is a callable, no restrictions are placed on X and Y dimensions.

Examples

>>> from sklearn.metrics.pairwise import pairwise_distances
>>> X = [[0, 0, 0], [1, 1, 1]]
>>> Y = [[1, 0, 0], [1, 1, 0]]
>>> pairwise_distances(X, Y, metric='sqeuclidean')
array([[1., 2.],
       [2., 1.]])