hoops_ai.dataset.dataset_explorer

Functions

process_bins(data_array, bin_edges, ...)

Helper function used by create_distribution and create_distribution_incore to group file_id_codes into numeric bins.

Classes

DatasetExplorer([flow_output_file, ...])

Provides methods to explore queries/filters on a merged Zarr dataset and retrieve metadata from a Parquet file.

class hoops_ai.dataset.dataset_explorer.DatasetExplorer(flow_output_file=None, merged_store_path=None, parquet_file_path=None, parquet_file_attribs=None, dask_client_params=None)

Bases: object

Provides methods to explore queries/filters on a merged Zarr dataset and retrieve metadata from a Parquet file. This class focuses on read/analysis logic.

Parameters:

flow_output_file (str | None)
merged_store_path (str | None)
parquet_file_path (str | None)
parquet_file_attribs (str | None)
dask_client_params (Dict[str, Any] | None)

available_arrays(group_name)

Returns the set of available arrays in the specified group.

Parameters:

group_namestr: Name of the group to inspect

Returns:

set: Set of array names in the group

Parameters:: group_name (str)
Return type:: set

available_groups()

Returns the set of available groups in the dataset.

Return type:: set

build_membership_matrix(group, key, bins_or_categories, as_counts=False)

Builds a file-by-bin matrix (or file-by-category) for a given numeric or categorical variable. If as_counts=False, each cell is 1 if the file has at least one item in that bin/category, else 0. If as_counts=True, each cell is the count of items in that bin/category.

Parameters:

group (str)
key (str)
bins_or_categories (int | List | numpy.ndarray)
as_counts (bool)

Return type:

tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

close(close_dask=True)

Close all resources used by the DatasetExplorer, including ZipStores and Dask resources.

Parameters:

close_daskbool: Whether to also close Dask resources

Parameters:: close_dask (bool)
Return type:: None

close_all_file_stores()

Closes all currently opened single-file ZipStores.

Return type:: None

close_dask_resources(close_global=False)

Close the Dask client and cluster that this class manages. If close_global is True, also close any globally active Dask client.

Parameters:: close_global (bool)
Return type:: None

close_file_store(file_id)

Closes a single-file ZipStore previously opened (if you had such a function).

Parameters:: file_id (str)
Return type:: None

create_distribution(key, bins=10, group=None)

Uses Dask to compute histogram distribution of key in the specified group, then returns the bin edges, histogram counts, and the file_id_codes in each bin.

Parameters:

keystr: Name of the array to analyze
binsint: Number of bins for the histogram
groupstr, optional: Group name to search in. If None, searches all available groups.

Parameters:

key (str)
bins (int)
group (str)

Return type:

Dict[str, Any]

create_distribution_incore(key, bins=10, group=None)

Non-Dask version: load data fully into memory, then compute histogram and bin mapping.

Parameters:

keystr: Name of the array to analyze
binsint: Number of bins for the histogram
groupstr, optional: Group name to search in. If None, searches all available groups.

Parameters:

key (str)
bins (int)
group (str)

decode_file_id_code(code)

Converts an integer file_id_code back to the file ID (string) if we loaded them from Parquet or have them in memory.

Parameters:: code (int)
Return type:: str

edgegroup()

Access the ‘edges’ group. Maintained for backward compatibility.

Return type:: xarray.Dataset

facefacegroup()

Access the ‘faceface’ group. Maintained for backward compatibility.

Return type:: xarray.Dataset

facegroup()

Access the ‘faces’ group. Maintained for backward compatibility.

Return type:: xarray.Dataset

file_dataset(file_id_code, group)

Returns a sub-Dataset filtered by the specified file_id_code within a group. Automatically detects dimension names for custom groups.

Parameters:

file_id_code (int)
group (str)

Return type:

xarray.Dataset

filegroup()

Access the ‘file’ group. Maintained for backward compatibility.

Return type:: xarray.Dataset

filter_by_array_condition(group_name, array_name, condition)

Generic filtering method that works with any array in any group.

Parameters:

group_namestr: Name of the group containing the array
array_namestr: Name of the array to filter on
conditionCallable: Function that takes a DataArray and returns a boolean mask

Returns:

List[int]: List of file_id_codes that match the condition

Parameters:

group_name (str)
array_name (str)
condition (Callable[[xarray.DataArray], xarray.DataArray])

Return type:

List[int]

find_subgraph(reference_graph, chunk_size=1000)

Example approach for subgraph-isomorphism, scanning each file code in ‘graphgroup’.

Parameters:

reference_graph (networkx.Graph)
chunk_size (int)

Return type:

list

get_array_data(group_name, array_name)

Get data for a specific array within a group.

Parameters:

group_namestr: Name of the group containing the array
array_namestr: Name of the array to retrieve

Returns:

Optional[xr.DataArray]: The data array, or None if not found

Parameters:

group_name (str)
array_name (str)

Return type:

xarray.DataArray | None

get_array_statistics(group_name, array_name)

Compute basic statistics for any numeric array.

Parameters:

group_namestr: Name of the group containing the array
array_namestr: Name of the array to analyze

Returns:

Dict[str, Any]: Dictionary containing basic statistics (min, max, mean, std, etc.)

Parameters:

group_name (str)
array_name (str)

Return type:

Dict[str, Any]

get_descriptions(table_name, key_id=None, use_wildchar=False)

Retrieves metadata from Parquet. E.g. “label_descriptions”, “face_types”, “edge_types”.

Parameters:

table_name (str)
key_id (int | None)
use_wildchar (bool | None)

Return type:

pandas.DataFrame

get_file_info_all()

Loads the entire ‘file_info’ table from Parquet (once), returning it as a DataFrame.

Return type:: pandas.DataFrame

get_file_list(group, where)

Returns unique file_id_codes matching a provided boolean filter in the specified group.

Parameters:

group (str)
where (Callable[[xarray.Dataset], xarray.DataArray])

Return type:

List[str]

get_group_data(group_name)

Get the complete dataset for a specified group.

Parameters:

group_namestr: Name of the group to retrieve

Returns:

Optional[xr.Dataset]: The dataset for the group, or None if not found

Parameters:: group_name (str)
Return type:: xarray.Dataset | None

get_parquet_info_by_code(file_id_code)

Returns rows from the Parquet file matching a given file_id_code, or None if none found.

Parameters:: file_id_code (int)

get_stream_cache_paths(file_id_code=None)

Get stream cache paths (PNG images and 3D representations) for files.

Parameters:

file_id_codeint, optional: If provided, returns cache paths only for this specific file ID code. If None, returns cache paths for all files.

Returns:

pd.DataFrame: DataFrame with columns: id, name, stream_cache_png, stream_cache_3d

Parameters:: file_id_code (int | None)
Return type:: pandas.DataFrame

global_cluster = None

global_dask_client = None

graphgroup()

Access the ‘graph’ group. Maintained for backward compatibility.

Return type:: xarray.Dataset

print_dataset_structure()

Print a comprehensive overview of the dataset structure.

Return type:: None

print_table_of_contents(): Generic method enumerating dataset contents, printing shapes, etc. Works with any discovered groups and arrays.

query_cross_group(primary_group, secondary_group, join_strategy='file_id_code')

Perform cross-group queries by joining data from different groups.

Parameters:

primary_groupstr: Name of the primary group
secondary_groupstr: Name of the secondary group to join with
join_strategystr: Strategy for joining (‘file_id_code’ for now)

Returns:

Optional[xr.Dataset]: Combined dataset with data from both groups

Parameters:

primary_group (str)
secondary_group (str)
join_strategy (str)

Return type:

xarray.Dataset | None

classmethod start_cluster(dask_client_params=None)

Starts the global Dask client if not already running. If dask_client_params is provided, uses those parameters; otherwise, starts a default LocalCluster. Returns the global Dask client.

Parameters:: dask_client_params (Dict[str, Any] | None)

hoops_ai.dataset.dataset_explorer.process_bins(data_array, bin_edges, file_id_codes, code_to_file_id): Helper function used by create_distribution and create_distribution_incore to group file_id_codes into numeric bins.

hoops_ai.dataset.dataset_explorer

Parameters:

Returns:

Parameters:

Parameters:

Parameters:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Hello! I'm HOOPSY