hoops_ai.dataset.dataset_explorer

Functions

process_bins(data_array, bin_edges, ...)

Helper function used by create_distribution and create_distribution_incore to group file_id_codes into numeric bins.

Classes

DatasetExplorer([flow_output_file, ...])

Provides methods to explore queries/filters on a merged Zarr dataset and retrieve metadata from a Parquet file.

class hoops_ai.dataset.dataset_explorer.DatasetExplorer(flow_output_file=None, merged_store_path=None, parquet_file_path=None, parquet_file_attribs=None, dask_client_params=None)

Bases: object

Provides methods to explore queries/filters on a merged Zarr dataset and retrieve metadata from a Parquet file. This class focuses on read/analysis logic.

Parameters:
  • flow_output_file (str | None)

  • merged_store_path (str | None)

  • parquet_file_path (str | None)

  • parquet_file_attribs (str | None)

  • dask_client_params (Dict[str, Any] | None)

available_arrays(group_name)

Returns the set of available arrays in the specified group.

Parameters:

group_namestr

Name of the group to inspect

Returns:

set

Set of array names in the group

Parameters:

group_name (str)

Return type:

set

available_groups()

Returns the set of available groups in the dataset.

Return type:

set

build_membership_matrix(group, key, bins_or_categories, as_counts=False)

Builds a file-by-bin matrix (or file-by-category) for a given numeric or categorical variable. If as_counts=False, each cell is 1 if the file has at least one item in that bin/category, else 0. If as_counts=True, each cell is the count of items in that bin/category.

Parameters:
Return type:

tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

close(close_dask=True)

Close all resources used by the DatasetExplorer, including ZipStores and Dask resources.

Parameters:

close_daskbool

Whether to also close Dask resources

Parameters:

close_dask (bool)

Return type:

None

close_all_file_stores()

Closes all currently opened single-file ZipStores.

Return type:

None

close_dask_resources(close_global=False)

Close the Dask client and cluster that this class manages. If close_global is True, also close any globally active Dask client.

Parameters:

close_global (bool)

Return type:

None

close_file_store(file_id)

Closes a single-file ZipStore previously opened (if you had such a function).

Parameters:

file_id (str)

Return type:

None

create_distribution(key, bins=10, group=None)

Uses Dask to compute histogram distribution of key in the specified group, then returns the bin edges, histogram counts, and the file_id_codes in each bin.

Parameters:

keystr

Name of the array to analyze

binsint

Number of bins for the histogram

groupstr, optional

Group name to search in. If None, searches all available groups.

Parameters:
Return type:

Dict[str, Any]

create_distribution_incore(key, bins=10, group=None)

Non-Dask version: load data fully into memory, then compute histogram and bin mapping.

Parameters:

keystr

Name of the array to analyze

binsint

Number of bins for the histogram

groupstr, optional

Group name to search in. If None, searches all available groups.

Parameters:
decode_file_id_code(code)

Converts an integer file_id_code back to the file ID (string) if we loaded them from Parquet or have them in memory.

Parameters:

code (int)

Return type:

str

edgegroup()

Access the ‘edges’ group. Maintained for backward compatibility.

Return type:

xarray.Dataset

facefacegroup()

Access the ‘faceface’ group. Maintained for backward compatibility.

Return type:

xarray.Dataset

facegroup()

Access the ‘faces’ group. Maintained for backward compatibility.

Return type:

xarray.Dataset

file_dataset(file_id_code, group)

Returns a sub-Dataset filtered by the specified file_id_code within a group.

Parameters:
  • file_id_code (int)

  • group (str)

Return type:

xarray.Dataset

filegroup()

Access the ‘file’ group. Maintained for backward compatibility.

Return type:

xarray.Dataset

filter_by_array_condition(group_name, array_name, condition)

Generic filtering method that works with any array in any group.

Parameters:

group_namestr

Name of the group containing the array

array_namestr

Name of the array to filter on

conditionCallable

Function that takes a DataArray and returns a boolean mask

Returns:

List[int]

List of file_id_codes that match the condition

Parameters:
  • group_name (str)

  • array_name (str)

  • condition (Callable[[xarray.DataArray], xarray.DataArray])

Return type:

List[int]

find_subgraph(reference_graph, chunk_size=1000)

Example approach for subgraph-isomorphism, scanning each file code in ‘graphgroup’.

Parameters:
Return type:

list

get_array_data(group_name, array_name)

Get data for a specific array within a group.

Parameters:

group_namestr

Name of the group containing the array

array_namestr

Name of the array to retrieve

Returns:

Optional[xr.DataArray]

The data array, or None if not found

Parameters:
  • group_name (str)

  • array_name (str)

Return type:

xarray.DataArray | None

get_array_statistics(group_name, array_name)

Compute basic statistics for any numeric array.

Parameters:

group_namestr

Name of the group containing the array

array_namestr

Name of the array to analyze

Returns:

Dict[str, Any]

Dictionary containing basic statistics (min, max, mean, std, etc.)

Parameters:
  • group_name (str)

  • array_name (str)

Return type:

Dict[str, Any]

get_descriptions(table_name, key_id=None, use_wildchar=False)

Retrieves metadata from Parquet. E.g. “label_descriptions”, “face_types”, “edge_types”.

Parameters:
  • table_name (str)

  • key_id (int | None)

  • use_wildchar (bool | None)

Return type:

pandas.DataFrame

get_file_info_all()

Loads the entire ‘file_info’ table from Parquet (once), returning it as a DataFrame.

Return type:

pandas.DataFrame

get_file_list(group, where)

Returns unique file_id_codes matching a provided boolean filter in the specified group.

Parameters:
  • group (str)

  • where (Callable[[xarray.Dataset], xarray.DataArray])

Return type:

List[str]

get_group_data(group_name)

Get the complete dataset for a specified group.

Parameters:

group_namestr

Name of the group to retrieve

Returns:

Optional[xr.Dataset]

The dataset for the group, or None if not found

Parameters:

group_name (str)

Return type:

xarray.Dataset | None

get_parquet_info_by_code(file_id_code)

Returns rows from the Parquet file matching a given file_id_code, or None if none found.

Parameters:

file_id_code (int)

get_stream_cache_paths(file_id_code=None)

Get stream cache paths (PNG images and 3D representations) for files.

Parameters:

file_id_codeint, optional

If provided, returns cache paths only for this specific file ID code. If None, returns cache paths for all files.

Returns:

pd.DataFrame

DataFrame with columns: id, name, stream_cache_png, stream_cache_3d

Parameters:

file_id_code (int | None)

Return type:

pandas.DataFrame

global_cluster = None
global_dask_client = None
graphgroup()

Access the ‘graph’ group. Maintained for backward compatibility.

Return type:

xarray.Dataset

print_dataset_structure()

Print a comprehensive overview of the dataset structure.

Return type:

None

print_table_of_contents()

Generic method enumerating dataset contents, printing shapes, etc. Works with any discovered groups and arrays.

query_cross_group(primary_group, secondary_group, join_strategy='file_id_code')

Perform cross-group queries by joining data from different groups.

Parameters:

primary_groupstr

Name of the primary group

secondary_groupstr

Name of the secondary group to join with

join_strategystr

Strategy for joining (‘file_id_code’ for now)

Returns:

Optional[xr.Dataset]

Combined dataset with data from both groups

Parameters:
  • primary_group (str)

  • secondary_group (str)

  • join_strategy (str)

Return type:

xarray.Dataset | None

classmethod start_cluster(dask_client_params=None)

Starts the global Dask client if not already running. If dask_client_params is provided, uses those parameters; otherwise, starts a default LocalCluster. Returns the global Dask client.

Parameters:

dask_client_params (Dict[str, Any] | None)

hoops_ai.dataset.dataset_explorer.process_bins(data_array, bin_edges, file_id_codes, code_to_file_id)

Helper function used by create_distribution and create_distribution_incore to group file_id_codes into numeric bins.