hoops_ai.dataset.DatasetExplorer
- class hoops_ai.dataset.DatasetExplorer(flow_output_file=None, merged_store_path=None, parquet_file_path=None, parquet_file_attribs=None, dask_client_params=None)
Bases:
objectProvides methods to explore queries/filters on a merged Zarr dataset and retrieve metadata from a Parquet file. This class focuses on read/analysis logic.
- Parameters:
- available_arrays(group_name)
Returns the set of available arrays in the specified group.
- build_membership_matrix(group, key, bins_or_categories, as_counts=False)
Builds a file-by-bin matrix (or file-by-category) for a given numeric or categorical variable. If as_counts=False, each cell is 1 if the file has at least one item in that bin/category, else 0. If as_counts=True, each cell is the count of items in that bin/category.
- Parameters:
- Return type:
- close(close_dask=True)
Close all resources used by the DatasetExplorer, including ZipStores and Dask resources.
- Parameters:
close_dask (bool) – bool Whether to also close Dask resources
- Return type:
None
- close_all_file_stores()
Closes all currently opened single-file ZipStores.
- Return type:
None
- close_dask_resources(close_global=False)
Close the Dask client and cluster that this class manages. If close_global is True, also close any globally active Dask client.
- Parameters:
close_global (bool)
- Return type:
None
- close_file_store(file_id)
Closes a single-file ZipStore previously opened (if you had such a function).
- Parameters:
file_id (str)
- Return type:
None
- create_distribution(key, bins=10, group=None)
Uses Dask to compute histogram distribution of key in the specified group, then returns the bin edges, histogram counts, and the file_id_codes in each bin.
- create_distribution_incore(key, bins=10, group=None)
Non-Dask version: load data fully into memory, then compute histogram and bin mapping.
- decode_file_id_code(code)
Converts an integer file_id_code back to the file ID (string) if we loaded them from Parquet or have them in memory.
- edgegroup()
Access the ‘edges’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- facefacegroup()
Access the ‘faceface’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- facegroup()
Access the ‘faces’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- file_dataset(file_id_code, group)
Returns a sub-Dataset filtered by the specified file_id_code within a group. Automatically detects dimension names for custom groups.
- filegroup()
Access the ‘file’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- filter_by_array_condition(group_name, array_name, condition)
Generic filtering method that works with any array in any group.
- Parameters:
- Returns:
- List[int]
List of file_id_codes that match the condition
- Return type:
- find_subgraph(reference_graph, chunk_size=1000)
Example approach for subgraph-isomorphism, scanning each file code in ‘graphgroup’.
- Parameters:
reference_graph (networkx.Graph)
chunk_size (int)
- Return type:
- get_array_data(group_name, array_name)
Get data for a specific array within a group.
- get_array_statistics(group_name, array_name)
Compute basic statistics for any numeric array.
- get_descriptions(table_name, key_id=None, use_wildchar=False)
Retrieves metadata from Parquet. E.g. “label_descriptions”, “face_types”, “edge_types”.
- get_file_info_all()
Loads the entire ‘file_info’ table from Parquet (once), returning it as a DataFrame.
- Return type:
pandas.DataFrame
- get_file_list(group, where)
Returns unique file_id_codes matching a provided boolean filter in the specified group.
- get_group_data(group_name)
Get the complete dataset for a specified group.
- Parameters:
group_name (str) – str Name of the group to retrieve
- Returns:
- Optional[xr.Dataset]
The dataset for the group, or None if not found
- Return type:
xarray.Dataset | None
- get_parquet_info_by_code(file_id_code)
Returns rows from the Parquet file matching a given file_id_code, or None if none found.
- Parameters:
file_id_code (int)
- get_stream_cache_paths(file_id_code=None)
Get stream cache paths (PNG images and 3D representations) for files.
- Parameters:
file_id_code (int | None) – int, optional If provided, returns cache paths only for this specific file ID code. If None, returns cache paths for all files.
- Returns:
- pd.DataFrame
DataFrame with columns: id, name, stream_cache_png, stream_cache_3d
- Return type:
pandas.DataFrame
- graphgroup()
Access the ‘graph’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- print_dataset_structure()
Print a comprehensive overview of the dataset structure.
- Return type:
None
- print_table_of_contents()
Generic method enumerating dataset contents, printing shapes, etc. Works with any discovered groups and arrays.
- query_cross_group(primary_group, secondary_group, join_strategy='file_id_code')
Perform cross-group queries by joining data from different groups.
- Parameters:
- Returns:
- Optional[xr.Dataset]
Combined dataset with data from both groups
- Return type:
xarray.Dataset | None
- classmethod start_cluster(dask_client_params=None)
Starts the global Dask client if not already running. If dask_client_params is provided, uses those parameters; otherwise, starts a default LocalCluster. Returns the global Dask client.
- global_cluster = None
- global_dask_client = None