12b9fd8a1fbe49108beb46eef13aabf7

HOOPS AI: Use the Dataset Explorer to navigate the dataset

The dataset module provides a comprehensive framework for exploring, navigating, and loading CAD model datasets for machine learning applications. It consists of two primary components that work together to simplify data handling:

  1. DatasetExplorer - For exploring and querying dataset contents

  2. DatasetLoader - For loading and preparing datasets for machine learning training

These components are designed to work with the processed data from the cadaccess and cadencoder modules, as well as the outputs from the flow pipeline system. They provide high-level abstractions that allow users to focus on machine learning tasks rather than data handling complexities.

DatasetExplorer

The DatasetExplorer class (dataset_explorer.py) provides methods for exploring and querying datasets stored in Zarr format (.dataset) with accompanying metadata (.infoset) in Parquet files. This class focuses on data discovery, filtering, and statistical analysis.

Key Methods

Data Discovery and Metadata

  • available_groups() -> set: Returns the set of available dataset groups (faces, edges, file, etc.)

  • get_descriptions(table_name: str, key_id: Optional[int] = None, use_wildchar: Optional[bool] = False) -> pd.DataFrame: Retrieves metadata descriptions (labels, face types, edge types, etc.)

  • get_parquet_info_by_code(file_id_code: int): Returns rows from the Parquet file for a specific file ID code

  • get_file_info_all() -> pd.DataFrame: Returns all file info from the Parquet metadata

Data Distribution Analysis

  • create_distribution(key: str, bins: int = 10, group: str = "faces") -> Dict[str, Any]: Computes histograms of data distributions using Dask for parallel processing

Data Filtering and Selection

  • get_file_list(group: str, where: Callable[[xr.Dataset], xr.DataArray]) -> List[str]: Returns file IDs matching a boolean filter condition

  • file_dataset(file_id_code: int, group: str) -> xr.Dataset: Returns a subset of the dataset for a specific file

  • build_membership_matrix(group: str, key: str, bins_or_categories: Union[int, List, np.ndarray], as_counts: bool = False) -> tuple[np.ndarray, np.ndarray, np.ndarray]: Builds a file-by-bin membership matrix for stratified splitting

  • decode_file_id_code(code: int) -> str: Converts an integer file ID code to the original string identifier

[1]:
import hoops_ai
import os
import sys

license_key = os.environ.get("HOOPS_AI_LICENSE")
if not license_key:
    sys.exit("HOOPS_AI_LICENSE environment variable is required.")

hoops_ai.set_license(license_key, validate=True)
------------------------------------------------------------
HOOPS AI
------------------------------------------------------------
  Platform      : Windows 11
  Architecture  : AMD64
  Python        : 3.9.21
------------------------------------------------------------
  Core          : hoops-ai             1.0.0  (build: 39b99a8 2026-03-23T19:25:21Z)
  CAD Access    : hoops-exchange       26.2.0  (build: 1e11169 2026-03-23T19:16:49Z)
  Conversion    : hoops-converter      26.1.0  (build: 39b99a8 2026-03-23T19:15:42Z)
  Insights      : hoops-web-viewer     26.1.0  (build: 25137b2 2026-03-23T19:20:34Z)
------------------------------------------------------------
======================================================================
[OK] HOOPS AI License: Valid
======================================================================
[2]:
from hoops_ai.dataset import DatasetExplorer
import pathlib
# Define paths
flow_name = "cadsynth10k"

flow_root_dir = pathlib.Path.cwd().parent.joinpath("packages", "flows", flow_name)
print(flow_root_dir)

parquet_file_path        = str(flow_root_dir.joinpath(f"{flow_name}.infoset"))
merged_store_path     = str(flow_root_dir.joinpath(f"{flow_name}.dataset"))
parquet_file_attribs  = str(flow_root_dir.joinpath(f"{flow_name}.attribset"))


explorer = DatasetExplorer(merged_store_path=merged_store_path, parquet_file_path=parquet_file_path, parquet_file_attribs=parquet_file_attribs)
explorer.print_table_of_contents()
INFO:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:State start
C:\Users\LuisSalazar.LY-LS-LEGION\Documents\repos\HOOPS-AI-tutorials\packages\flows\cadsynth10k
INFO:  Scheduler at:     tcp://127.0.0.1:58441
INFO:  dashboard at:  http://127.0.0.1:61426/status
INFO:Registering Worker plugin shuffle
INFO:        Start Nanny at: 'tcp://127.0.0.1:58444'
INFO:Register worker <WorkerState 'tcp://127.0.0.1:60434', name: 0, status: init, memory: 0, processing: 0>
INFO:Starting worker compute stream, tcp://127.0.0.1:60434
INFO:Starting established connection to tcp://127.0.0.1:60436
INFO:Receive client connection: Client-5b221936-2854-11f1-9ee0-f4289de57fc2
INFO:Starting established connection to tcp://127.0.0.1:52117
[DatasetExplorer] Default local cluster started: <Client: 'tcp://127.0.0.1:58441' processes=1 threads=16, memory=7.45 GiB>

--- Dataset Table of Contents ---

LABELS_GROUP:
  FACE_LABELS_DATA: Shape: (275567,), Dims: ('faces',), Size: 275567
  FILE_ID_CODE_LABELS_DATA: Shape: (275567,), Dims: ('faces',), Size: 275567

EDGES_GROUP:
  EDGE_CONVEXITIES_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
  EDGE_DIHEDRAL_ANGLES_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
  EDGE_INDICES_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
  EDGE_LENGTHS_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
  EDGE_TYPES_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
  EDGE_U_GRIDS_DATA: Shape: (719473, 5, 6), Dims: ('edge', 'u', 'component'), Size: 21584190
  FILE_ID_CODE_EDGES_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473

FACEFACE_GROUP:
  A3_DISTANCE_DATA: Shape: (8418913, 64), Dims: ('facepair', 'bin'), Size: 538810432
  D2_DISTANCE_DATA: Shape: (8418913, 64), Dims: ('facepair', 'bin'), Size: 538810432
  EXTENDED_ADJACENCY_DATA: Shape: (8418913,), Dims: ('facepair',), Size: 8418913
  FACE_PAIR_EDGES_PATH_DATA: Shape: (8418913, 32), Dims: ('facepair', 'dim_path'), Size: 269405216
  FILE_ID_CODE_FACEFACE_DATA: Shape: (8418913,), Dims: ('facepair',), Size: 8418913

FACES_GROUP:
  FACE_AREAS_DATA: Shape: (275567,), Dims: ('face',), Size: 275567
  FACE_CENTROIDS_DATA: Shape: (275567, 3), Dims: ('face', 'dim'), Size: 826701
  FACE_DISCRETIZATION_DATA: Shape: (275567, 25, 7), Dims: ('face', 'sample', 'component'), Size: 48224225
  FACE_INDICES_DATA: Shape: (275567,), Dims: ('face',), Size: 275567
  FACE_LOOPS_DATA: Shape: (275567,), Dims: ('face',), Size: 275567
  FACE_NEIGHBORSCOUNT_DATA: Shape: (275567,), Dims: ('face',), Size: 275567
  FACE_TYPES_DATA: Shape: (275567,), Dims: ('face',), Size: 275567
  FILE_ID_CODE_FACES_DATA: Shape: (275567,), Dims: ('face',), Size: 275567

GRAPH_GROUP:
  EDGES_DESTINATION_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
  EDGES_SOURCE_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
  FILE_ID_CODE_GRAPH_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
  NUM_NODES_DATA: Shape: (719473,), Dims: ('edge',), Size: 719473
==================================
Columns in file_info:
                                    name    id                             description subset table_name
0     000080b997b0b9c0d859ae5b74addcf4_0     0  ...data\Cadsynth_aag\step\00034194.stp    N/A  file_info
1     0003f988487ad0a1f1d50aed28f05dcf_0     1  ...data\Cadsynth_aag\step\00053007.stp    N/A  file_info
2     001a37b15ea4bb7065f1166e643ffa68_0     2  ...data\Cadsynth_aag\step\00087640.stp    N/A  file_info
3     001c8424af4a9f35cc2ab2444eeedf56_0     3  ...data\Cadsynth_aag\step\00035105.stp    N/A  file_info
4     0026e60735c03ea48773ebc71ebe7367_0     4  ...data\Cadsynth_aag\step\00000274.stp    N/A  file_info
5     002f896e5c3e6b6c8737cb3f5d0e04ba_0     5  ...data\Cadsynth_aag\step\00076740.stp    N/A  file_info
6     004ec648abea38ca4f44301fc54a87d0_0     6  ...data\Cadsynth_aag\step\00081606.stp    N/A  file_info
7     0054c9bf69eb357ca4459f8ee4afd27a_0     7  ...data\Cadsynth_aag\step\00092144.stp    N/A  file_info
8     005aa928503056bb5e16a988d549c050_0     8  ...data\Cadsynth_aag\step\00042088.stp    N/A  file_info
9     00618fd168908e96f8fd13fc0854b08d_0     9  ...data\Cadsynth_aag\step\00064074.stp    N/A  file_info
...                                  ...   ...                                     ...    ...        ...
9982  ff7dce83dbebb5b174a21089bbd80bd8_0  9982  ...data\Cadsynth_aag\step\00064217.stp    N/A  file_info
9983  ff977a7101f72db521365bfe91c1c473_0  9983  ...aag\step\20221121_154648_19729.step    N/A  file_info
9984  ffa764d164969163013fb276f64c1c51_0  9984  ...aag\step\20221124_154714_12931.step    N/A  file_info
9985  ffb2f1fcb6c66c59c303aa5b3ea56807_0  9985  ...aag\step\20221124_154714_15489.step    N/A  file_info
9986  ffc9c0cc5a87664ee3908adfdc50dbc5_0  9986  ...data\Cadsynth_aag\step\00021727.stp    N/A  file_info
9987  ffd53e1d4911d3d6b51c4f9b8f17f4e5_0  9987  ...data\Cadsynth_aag\step\00084454.stp    N/A  file_info
9988  ffdd225c98a1627f37c5f1ae6730d816_0  9988  ...data\Cadsynth_aag\step\00052773.stp    N/A  file_info
9989  ffdd519826ffe790c3fe27fff924ba72_0  9989  ...data\Cadsynth_aag\step\00050745.stp    N/A  file_info
9990  ffe278e0d87153c84363f6769d93659c_0  9990  ...data\Cadsynth_aag\step\00043168.stp    N/A  file_info
9991  fffecfe9bda0ca767f8b4379af10578f_0  9991  ..._aag\step\20221124_154714_5326.step    N/A  file_info
[ ]:

[3]:
groups = explorer.available_groups()
print(groups)
{'edges', 'graph', 'faces', 'faceface', 'Labels'}
[4]:
face_type_description = explorer.get_descriptions("face_types")
print(type(face_type_description), face_type_description)
<class 'pandas.core.frame.DataFrame'>        id        name description  table_name
9993    0       Plane     not set  face_types
9994    1    Cylinder     not set  face_types
9995    3      Sphere     not set  face_types
9996    2        Cone     not set  face_types
9997    4       Torus     not set  face_types
9998   14   Extrusion     not set  face_types
9999    5       Nurbs     not set  face_types
10000  13  Revolution     not set  face_types
[ ]:

[5]:
# Get and print meta data information
file_id = 25
df_info = explorer.get_parquet_info_by_code(file_id)
#print(type(df_info), df_info)
[6]:
# Access various dataset groups
file_datasetGroup = {grp: explorer.file_dataset(file_id_code=file_id, group=grp) for grp in groups}

print(f"Datasets (Table of Contents) for file ID '{file_id}':")
for grp, ds in file_datasetGroup.items():
    for name, da in ds.data_vars.items():
        print(f"  [{grp}] DATA: {name}, Shape: {da.shape}, Dims: {da.dims}, Size: {da.size}")
print()

file_dataset = file_datasetGroup["faces"]
print("type of file_data_arrays", type(file_dataset))

# Print the areas of each face
array_areas = file_dataset["face_areas"].data.compute()
print("type of array_areas", type(array_areas))
print("brep surfaces", array_areas.shape)
Datasets (Table of Contents) for file ID '25':
  [edges] DATA: edge_convexities, Shape: (33,), Dims: ('edge',), Size: 33
  [edges] DATA: edge_dihedral_angles, Shape: (33,), Dims: ('edge',), Size: 33
  [edges] DATA: edge_indices, Shape: (33,), Dims: ('edge',), Size: 33
  [edges] DATA: edge_lengths, Shape: (33,), Dims: ('edge',), Size: 33
  [edges] DATA: edge_types, Shape: (33,), Dims: ('edge',), Size: 33
  [edges] DATA: edge_u_grids, Shape: (33, 5, 6), Dims: ('edge', 'u', 'component'), Size: 990
  [edges] DATA: file_id_code_edges, Shape: (33,), Dims: ('edge',), Size: 33
  [graph] DATA: edges_destination, Shape: (33,), Dims: ('edge',), Size: 33
  [graph] DATA: edges_source, Shape: (33,), Dims: ('edge',), Size: 33
  [graph] DATA: file_id_code_graph, Shape: (33,), Dims: ('edge',), Size: 33
  [graph] DATA: num_nodes, Shape: (33,), Dims: ('edge',), Size: 33
  [faces] DATA: face_areas, Shape: (14,), Dims: ('face',), Size: 14
  [faces] DATA: face_centroids, Shape: (14, 3), Dims: ('face', 'dim'), Size: 42
  [faces] DATA: face_discretization, Shape: (14, 25, 7), Dims: ('face', 'sample', 'component'), Size: 2450
  [faces] DATA: face_indices, Shape: (14,), Dims: ('face',), Size: 14
  [faces] DATA: face_loops, Shape: (14,), Dims: ('face',), Size: 14
  [faces] DATA: face_neighborscount, Shape: (14,), Dims: ('face',), Size: 14
  [faces] DATA: face_types, Shape: (14,), Dims: ('face',), Size: 14
  [faces] DATA: file_id_code_faces, Shape: (14,), Dims: ('face',), Size: 14
  [faceface] DATA: a3_distance, Shape: (196, 64), Dims: ('facepair', 'bin'), Size: 12544
  [faceface] DATA: d2_distance, Shape: (196, 64), Dims: ('facepair', 'bin'), Size: 12544
  [faceface] DATA: extended_adjacency, Shape: (196,), Dims: ('facepair',), Size: 196
  [faceface] DATA: face_pair_edges_path, Shape: (196, 32), Dims: ('facepair', 'dim_path'), Size: 6272
  [faceface] DATA: file_id_code_faceface, Shape: (196,), Dims: ('facepair',), Size: 196
  [Labels] DATA: face_labels, Shape: (14,), Dims: ('faces',), Size: 14
  [Labels] DATA: file_id_code_Labels, Shape: (14,), Dims: ('faces',), Size: 14

type of file_data_arrays <class 'xarray.core.dataset.Dataset'>
type of array_areas <class 'numpy.ndarray'>
brep surfaces (14,)
[ ]:

[7]:
# This example assumes some familiarity with pandas and Dask.
discretization_data = file_dataset["face_discretization"].data.compute()


print("numpy array shape", discretization_data.shape)
numpy array shape (14, 25, 7)
[8]:
#print(discretization_data)
[9]:
explorer.get_file_info_all()
[9]:
name id description subset table_name
0 000080b997b0b9c0d859ae5b74addcf4_0 0 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
1 0003f988487ad0a1f1d50aed28f05dcf_0 1 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
2 001a37b15ea4bb7065f1166e643ffa68_0 2 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
3 001c8424af4a9f35cc2ab2444eeedf56_0 3 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
4 0026e60735c03ea48773ebc71ebe7367_0 4 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
... ... ... ... ... ...
9987 ffd53e1d4911d3d6b51c4f9b8f17f4e5_0 9987 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
9988 ffdd225c98a1627f37c5f1ae6730d816_0 9988 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
9989 ffdd519826ffe790c3fe27fff924ba72_0 9989 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
9990 ffe278e0d87153c84363f6769d93659c_0 9990 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info
9991 fffecfe9bda0ca767f8b4379af10578f_0 9991 C:\Users\LuisSalazar.LY-LS-LEGION\Documents\da... N/A file_info

9992 rows × 5 columns

[10]:
# Visualization libraries
import matplotlib.pyplot as plt

def print_distribution_info(dist, title="Distribution"):
    """Helper function to print and visualize distribution data."""
    list_filecount = list()
    for i, bin_files in enumerate(dist['file_id_codes_in_bins']):
        list_filecount.append(bin_files.size)

    dist['file_count'] =list_filecount
    # Visualization with matplotlib
    fig, ax = plt.subplots(figsize=(12, 4))

    bin_centers = 0.5 * (dist['bin_edges'][1:] + dist['bin_edges'][:-1])
    ax.bar(bin_centers, dist['file_count'], width=(dist['bin_edges'][1] - dist['bin_edges'][0]),
           alpha=0.7, color='steelblue', edgecolor='black', linewidth=1)

    # Add file count annotations
    for i, count in enumerate(dist['file_count']):
        if count > 0:  # Only annotate non-empty bins
            ax.text(bin_centers[i], count + 0.5, f"{count}",
                    ha='center', va='bottom', fontsize=8)

    ax.set_xlabel('Value')
    ax.set_ylabel('Count')
    ax.set_title(f'{title} Histogram')
    ax.grid(True, linestyle='--', alpha=0.7)

    plt.tight_layout()
    plt.show()
[11]:
import time
start_time = time.time()
face_dist = explorer.create_distribution(key="face_labels", bins=None, group="faces")
print(f"Face-label distribution created in {(time.time() - start_time):.2f} seconds\n")
print_distribution_info(face_dist, title="Labels")

Variable 'face_labels' found in fallback group 'Labels'.
Face-label distribution created in 0.23 seconds

../../../_images/tutorials_hoops_ai_tutorials_notebooks_4b_162k_dataserving_15_1.png
[ ]:

[12]:
start_time = time.time()
dist = explorer.create_distribution(key="num_nodes", bins=12, group="graph")
print(f"BRep face-count distribution created in {(time.time() - start_time):.2f} seconds\n")
print_distribution_info(dist, title="BRep face-count distribution")
BRep face-count distribution created in 0.16 seconds

../../../_images/tutorials_hoops_ai_tutorials_notebooks_4b_162k_dataserving_17_1.png

Gather files that satisfy a given condition

[ ]:

[13]:
start_time = time.time()

# condition
label_is_pipefittings = lambda ds: ds['face_labels'] == 15

filelist = explorer.get_file_list(group="Labels", where=label_is_pipefittings)
print(f"Filtering completed in {(time.time() - start_time):.2f} seconds")
print(f"Found {len(filelist)} files with face_labels == 15 (chamfer)\n")
print(filelist)
Filtering completed in 0.05 seconds
Found 2088 files with face_labels == 15 (chamfer)

[   0    2    3 ... 9984 9986 9987]

Query data for single file

[14]:
def demo_query_single_file(explorer, file_id):
    """Show how to access and query dataset details for a single file."""
    print("=== Single File Dataset Access ===")
    import time
    # Get and print parquet info
    df_info = explorer.get_parquet_info_by_code(file_id)
    print("File info:")
    for column in df_info.columns:
        print(f"Column: {column}")
        for value in df_info[column]:
            print(f"  {value}")
    print()

    # Access various dataset groups
    datasets = {grp: explorer.file_dataset(file_id_code=file_id, group=grp) for grp in groups}

    print(f"Datasets for file ID '{file_id}':")
    for grp, ds in datasets.items():
        for name, da in ds.data_vars.items():
            print(f"  [{grp}] VARIABLE: {name}, Shape: {da.shape}, Dims: {da.dims}, Size: {da.size}")
    print()

    # Query face discretization data for a specific face
    start_time = time.time()
    face_discretization_data = datasets["faces"]["face_discretization"].isel(face=2)
    print("Face discretization data for face index 2:")
    np_face_discretization = face_discretization_data.data.compute()
    print(f"Query took {(time.time() - start_time):.2f} seconds\n")
[15]:
demo_query_single_file(explorer,file_id=4500)
=== Single File Dataset Access ===
File info:
Column: name
  7369dda24b8737152ef738380c3f76c9_0
Column: id
  4500
Column: description
  C:\Users\LuisSalazar.LY-LS-LEGION\Documents\data\Cadsynth_aag\step\00027023.stp
Column: subset
  N/A
Column: table_name
  file_info

Datasets for file ID '4500':
  [edges] VARIABLE: edge_convexities, Shape: (92,), Dims: ('edge',), Size: 92
  [edges] VARIABLE: edge_dihedral_angles, Shape: (92,), Dims: ('edge',), Size: 92
  [edges] VARIABLE: edge_indices, Shape: (92,), Dims: ('edge',), Size: 92
  [edges] VARIABLE: edge_lengths, Shape: (92,), Dims: ('edge',), Size: 92
  [edges] VARIABLE: edge_types, Shape: (92,), Dims: ('edge',), Size: 92
  [edges] VARIABLE: edge_u_grids, Shape: (92, 5, 6), Dims: ('edge', 'u', 'component'), Size: 2760
  [edges] VARIABLE: file_id_code_edges, Shape: (92,), Dims: ('edge',), Size: 92
  [graph] VARIABLE: edges_destination, Shape: (92,), Dims: ('edge',), Size: 92
  [graph] VARIABLE: edges_source, Shape: (92,), Dims: ('edge',), Size: 92
  [graph] VARIABLE: file_id_code_graph, Shape: (92,), Dims: ('edge',), Size: 92
  [graph] VARIABLE: num_nodes, Shape: (92,), Dims: ('edge',), Size: 92
  [faces] VARIABLE: face_areas, Shape: (34,), Dims: ('face',), Size: 34
  [faces] VARIABLE: face_centroids, Shape: (34, 3), Dims: ('face', 'dim'), Size: 102
  [faces] VARIABLE: face_discretization, Shape: (34, 25, 7), Dims: ('face', 'sample', 'component'), Size: 5950
  [faces] VARIABLE: face_indices, Shape: (34,), Dims: ('face',), Size: 34
  [faces] VARIABLE: face_loops, Shape: (34,), Dims: ('face',), Size: 34
  [faces] VARIABLE: face_neighborscount, Shape: (34,), Dims: ('face',), Size: 34
  [faces] VARIABLE: face_types, Shape: (34,), Dims: ('face',), Size: 34
  [faces] VARIABLE: file_id_code_faces, Shape: (34,), Dims: ('face',), Size: 34
  [faceface] VARIABLE: a3_distance, Shape: (1156, 64), Dims: ('facepair', 'bin'), Size: 73984
  [faceface] VARIABLE: d2_distance, Shape: (1156, 64), Dims: ('facepair', 'bin'), Size: 73984
  [faceface] VARIABLE: extended_adjacency, Shape: (1156,), Dims: ('facepair',), Size: 1156
  [faceface] VARIABLE: face_pair_edges_path, Shape: (1156, 32), Dims: ('facepair', 'dim_path'), Size: 36992
  [faceface] VARIABLE: file_id_code_faceface, Shape: (1156,), Dims: ('facepair',), Size: 1156
  [Labels] VARIABLE: face_labels, Shape: (34,), Dims: ('faces',), Size: 34
  [Labels] VARIABLE: file_id_code_Labels, Shape: (34,), Dims: ('faces',), Size: 34

Face discretization data for face index 2:
Query took 0.06 seconds

[ ]: