hoops_ai.dataset.dataset_explorer
Functions
|
Helper function used by create_distribution and create_distribution_incore to group file_id_codes into numeric bins. |
Classes
|
Provides methods to explore queries/filters on a merged Zarr dataset and retrieve metadata from a Parquet file. |
- class hoops_ai.dataset.dataset_explorer.DatasetExplorer(flow_output_file=None, merged_store_path=None, parquet_file_path=None, parquet_file_attribs=None, dask_client_params=None)
Bases:
objectProvides methods to explore queries/filters on a merged Zarr dataset and retrieve metadata from a Parquet file. This class focuses on read/analysis logic.
- Parameters:
- available_arrays(group_name)
Returns the set of available arrays in the specified group.
Parameters:
- group_namestr
Name of the group to inspect
Returns:
- set
Set of array names in the group
- build_membership_matrix(group, key, bins_or_categories, as_counts=False)
Builds a file-by-bin matrix (or file-by-category) for a given numeric or categorical variable. If as_counts=False, each cell is 1 if the file has at least one item in that bin/category, else 0. If as_counts=True, each cell is the count of items in that bin/category.
- Parameters:
- Return type:
- close(close_dask=True)
Close all resources used by the DatasetExplorer, including ZipStores and Dask resources.
Parameters:
- close_daskbool
Whether to also close Dask resources
- Parameters:
close_dask (bool)
- Return type:
None
- close_all_file_stores()
Closes all currently opened single-file ZipStores.
- Return type:
None
- close_dask_resources(close_global=False)
Close the Dask client and cluster that this class manages. If close_global is True, also close any globally active Dask client.
- Parameters:
close_global (bool)
- Return type:
None
- close_file_store(file_id)
Closes a single-file ZipStore previously opened (if you had such a function).
- Parameters:
file_id (str)
- Return type:
None
- create_distribution(key, bins=10, group=None)
Uses Dask to compute histogram distribution of key in the specified group, then returns the bin edges, histogram counts, and the file_id_codes in each bin.
Parameters:
- keystr
Name of the array to analyze
- binsint
Number of bins for the histogram
- groupstr, optional
Group name to search in. If None, searches all available groups.
- create_distribution_incore(key, bins=10, group=None)
Non-Dask version: load data fully into memory, then compute histogram and bin mapping.
Parameters:
- keystr
Name of the array to analyze
- binsint
Number of bins for the histogram
- groupstr, optional
Group name to search in. If None, searches all available groups.
- decode_file_id_code(code)
Converts an integer file_id_code back to the file ID (string) if we loaded them from Parquet or have them in memory.
- edgegroup()
Access the ‘edges’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- facefacegroup()
Access the ‘faceface’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- facegroup()
Access the ‘faces’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- file_dataset(file_id_code, group)
Returns a sub-Dataset filtered by the specified file_id_code within a group.
- filegroup()
Access the ‘file’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- filter_by_array_condition(group_name, array_name, condition)
Generic filtering method that works with any array in any group.
Parameters:
- group_namestr
Name of the group containing the array
- array_namestr
Name of the array to filter on
- conditionCallable
Function that takes a DataArray and returns a boolean mask
Returns:
- List[int]
List of file_id_codes that match the condition
- find_subgraph(reference_graph, chunk_size=1000)
Example approach for subgraph-isomorphism, scanning each file code in ‘graphgroup’.
- Parameters:
reference_graph (networkx.Graph)
chunk_size (int)
- Return type:
- get_array_data(group_name, array_name)
Get data for a specific array within a group.
Parameters:
- group_namestr
Name of the group containing the array
- array_namestr
Name of the array to retrieve
Returns:
- Optional[xr.DataArray]
The data array, or None if not found
- get_array_statistics(group_name, array_name)
Compute basic statistics for any numeric array.
Parameters:
- group_namestr
Name of the group containing the array
- array_namestr
Name of the array to analyze
Returns:
- Dict[str, Any]
Dictionary containing basic statistics (min, max, mean, std, etc.)
- get_descriptions(table_name, key_id=None, use_wildchar=False)
Retrieves metadata from Parquet. E.g. “label_descriptions”, “face_types”, “edge_types”.
- get_file_info_all()
Loads the entire ‘file_info’ table from Parquet (once), returning it as a DataFrame.
- Return type:
pandas.DataFrame
- get_file_list(group, where)
Returns unique file_id_codes matching a provided boolean filter in the specified group.
- get_group_data(group_name)
Get the complete dataset for a specified group.
Parameters:
- group_namestr
Name of the group to retrieve
Returns:
- Optional[xr.Dataset]
The dataset for the group, or None if not found
- Parameters:
group_name (str)
- Return type:
xarray.Dataset | None
- get_parquet_info_by_code(file_id_code)
Returns rows from the Parquet file matching a given file_id_code, or None if none found.
- Parameters:
file_id_code (int)
- get_stream_cache_paths(file_id_code=None)
Get stream cache paths (PNG images and 3D representations) for files.
Parameters:
- file_id_codeint, optional
If provided, returns cache paths only for this specific file ID code. If None, returns cache paths for all files.
Returns:
- pd.DataFrame
DataFrame with columns: id, name, stream_cache_png, stream_cache_3d
- Parameters:
file_id_code (int | None)
- Return type:
pandas.DataFrame
- global_cluster = None
- global_dask_client = None
- graphgroup()
Access the ‘graph’ group. Maintained for backward compatibility.
- Return type:
xarray.Dataset
- print_dataset_structure()
Print a comprehensive overview of the dataset structure.
- Return type:
None
- print_table_of_contents()
Generic method enumerating dataset contents, printing shapes, etc. Works with any discovered groups and arrays.
- query_cross_group(primary_group, secondary_group, join_strategy='file_id_code')
Perform cross-group queries by joining data from different groups.
Parameters:
- primary_groupstr
Name of the primary group
- secondary_groupstr
Name of the secondary group to join with
- join_strategystr
Strategy for joining (‘file_id_code’ for now)
Returns:
- Optional[xr.Dataset]
Combined dataset with data from both groups
- hoops_ai.dataset.dataset_explorer.process_bins(data_array, bin_edges, file_id_codes, code_to_file_id)
Helper function used by create_distribution and create_distribution_incore to group file_id_codes into numeric bins.