6ea1856ee9a0421c868966170540c045

Use DatasetExplorer to navigate the dataset

The dataset module provides objects for exploring, navigating, and loading CAD-encoded datasets for machine learning applications. It consists of two primary components that work together to simplify data handling:

  1. DatasetExplorer - For exploring and querying dataset contents

  2. DatasetLoader - For loading and preparing datasets for machine learning training

These components are designed to work with processed data from the cadaccess and cadencoder modules, as well as the outputs from the flow pipeline system. They provide high-level abstractions that let users focus on machine learning tasks rather than data-handling complexity.

DatasetExplorer

The DatasetExplorer class (dataset_explorer.py) provides methods for exploring and querying datasets stored in a compressed and optimized format (.dataset) with accompanying metadata (.infoset) files. This class focuses on data discovery, filtering, and statistical analysis.

Key Methods

Data Discovery and Metadata

  • available_groups() -> set: Returns the set of available dataset groups (faces, edges, file, etc.)

  • get_descriptions(table_name: str, key_id: Optional[int] = None, use_wildchar: Optional[bool] = False) -> pd.DataFrame: Retrieves metadata descriptions (labels, face types, edge types, etc.)

  • get_parquet_info_by_code(file_id_code: int): Returns rows from the Parquet file for a specific file ID code

  • get_file_info_all() -> pd.DataFrame: Returns all file info from the Parquet metadata

Data Distribution Analysis

  • create_distribution(key: str, bins: int = 10, group: str = "faces") -> Dict[str, Any]: Computes histograms of data distributions using Dask for parallel processing

Data Filtering and Selection

  • get_file_list(group: str, where: Callable[[xr.Dataset], xr.DataArray]) -> List[str]: Returns file IDs matching a boolean filter condition

  • file_dataset(file_id_code: int, group: str) -> xr.Dataset: Returns a subset of the dataset for a specific file

  • build_membership_matrix(group: str, key: str, bins_or_categories: Union[int, List, np.ndarray], as_counts: bool = False) -> tuple[np.ndarray, np.ndarray, np.ndarray]: Builds a file-by-bin membership matrix for stratified splitting

  • decode_file_id_code(code: int) -> str: Converts an integer file ID code to the original string identifier

[ ]:
from scripts.helper_tutorials import load_environement_variables

load_environement_variables()
True
[25]:
import hoops_ai
import os
import sys

license_key = os.environ.get("HOOPS_AI_LICENSE")
if not license_key:
    sys.exit("HOOPS_AI_LICENSE environment variable is required.")

hoops_ai.set_license(license_key, validate=True)
[OK] HOOPS AI License: Already validated in this session. Skipping re-validation.
[26]:
from hoops_ai.dataset import DatasetExplorer
import pathlib
# Define paths
flow_name = "ETL_Fabwave_training_b2"

flow_root_dir = pathlib.Path.cwd().parent.joinpath("packages", "flows", flow_name)
print(flow_root_dir)

parquet_file_path        = str(flow_root_dir.joinpath(f"{flow_name}.infoset"))
merged_store_path     = str(flow_root_dir.joinpath(f"{flow_name}.dataset"))
parquet_file_attribs  = str(flow_root_dir.joinpath(f"{flow_name}.attribset"))


explorer = DatasetExplorer(merged_store_path=merged_store_path, parquet_file_path=parquet_file_path, parquet_file_attribs=parquet_file_attribs)
explorer.print_table_of_contents()
INFO:State start
INFO:  Scheduler at:     tcp://127.0.0.1:35035
INFO:  dashboard at:  http://127.0.0.1:38683/status
INFO:Registering Worker plugin shuffle
INFO:        Start Nanny at: 'tcp://127.0.0.1:34499'
/home/maxime.marechal/Projects/HAI-Tutorials/packages/flows/ETL_Fabwave_training_b2
INFO:Register worker <WorkerState 'tcp://127.0.0.1:38669', name: 0, status: init, memory: 0, processing: 0>
INFO:Starting worker compute stream, tcp://127.0.0.1:38669
INFO:Starting established connection to tcp://127.0.0.1:60650
INFO:Receive client connection: Client-eae95386-544a-11f1-aec6-e89744867d30
INFO:Starting established connection to tcp://127.0.0.1:60662
[DatasetExplorer] Default local cluster started: <Client: 'tcp://127.0.0.1:35035' processes=1 threads=16, memory=7.45 GiB>

--- Dataset Table of Contents ---

LABELS_GROUP:
  FILE_ID_CODE_LABELS_DATA: Shape: (4546,), Dims: ('part',), Size: 4546
  PART_LABEL_DATA: Shape: (4546,), Dims: ('part',), Size: 4546

EDGES_GROUP:
  EDGE_CONVEXITIES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_DIHEDRAL_ANGLES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_INDICES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_LENGTHS_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_TYPES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_U_GRIDS_DATA: Shape: (337065, 10, 6), Dims: ('edge', 'u', 'component'), Size: 20223900
  FILE_ID_CODE_EDGES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065

FACE_MESH_GROUP:
  FACE_MESH_ADJ_DATA: Shape: (130923, 10000), Dims: ('face', 'adjacency_flat'), Size: 1309230000
  FILE_ID_CODE_FACE_MESH_DATA: Shape: (130923,), Dims: ('face',), Size: 130923

FACEFACE_GROUP:
  FACE_PAIR_EDGES_PATH_DATA: Shape: (30829551, 16), Dims: ('facepair', 'dim_path'), Size: 493272816
  FILE_ID_CODE_FACEFACE_DATA: Shape: (30829551,), Dims: ('facepair',), Size: 30829551

FACES_GROUP:
  FACE_AREAS_DATA: Shape: (130923,), Dims: ('face',), Size: 130923
  FACE_CENTROIDS_DATA: Shape: (130923, 3), Dims: ('face', 'dim'), Size: 392769
  FACE_DISCRETIZATION_DATA: Shape: (130923, 100, 7), Dims: ('face', 'sample', 'component'), Size: 91646100
  FACE_INDICES_DATA: Shape: (130923,), Dims: ('face',), Size: 130923
  FACE_LOOPS_DATA: Shape: (130923,), Dims: ('face',), Size: 130923
  FACE_TYPES_DATA: Shape: (130923,), Dims: ('face',), Size: 130923
  FILE_ID_CODE_FACES_DATA: Shape: (130923,), Dims: ('face',), Size: 130923

GRAPH_GROUP:
  EDGES_DESTINATION_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGES_SOURCE_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  FILE_ID_CODE_GRAPH_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  NUM_NODES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
==================================
Columns in file_info:
                                    name    id                             description subset table_name
0     00167e547565435944db7f387a002921_0     0  ...f42-23a1-4b7d-a6c3-af5b3b760d51.stp    N/A  file_info
1     0019acb61ba91dd1ebb65c343d86f128_0     1  ...53a-2097-4e45-8d11-06df8e19ddc7.stp    N/A  file_info
2     001eeef8382dbcfc00e1bfce137e369f_0     2  ...bf1-29f5-4c33-aa04-134f5c5e43c9.stp    N/A  file_info
3     002c05eda3f3c4037ae686db679418df_0     3  ...f6e-3134-4385-9865-a30a04313a30.stp    N/A  file_info
4     0038012c22d354c843b5065a227392af_0     4  ...34e-d49f-4435-a7c1-7ae9d827c536.stp    N/A  file_info
5     00514cb06a191a83e3bc0b1f3df1f95c_0     5  ...732-6c63-4c08-8e9c-6bf18c43a753.stp    N/A  file_info
6     00544cebbb7a7ce97e2bf66258419436_0     6  ...081-590f-43a1-b345-4805bc04731d.stp    N/A  file_info
7     0060fbc260390cc51f88c3495f28c1f0_0     7  ...f55-ff40-42aa-a22e-ffcec267ee8b.stp    N/A  file_info
8     0064446bfb36aa489351a195264f1d0b_0     8  .../Hex_Head_Screws/STEP/stlfile12.stp    N/A  file_info
9     006e6ea34dcf22720da7cb8ca8c9fb22_0     9  ...fa1-bb1c-4d7a-b9b0-fc539dbe15ac.stp    N/A  file_info
...                                  ...   ...                                     ...    ...        ...
4536  ff12cdc31a505e45eb31b44ca70294ed_0  4536  ...571-5d8c-43ea-b8e5-544fadb86f44.stp    N/A  file_info
4537  ff1392cd629c37f955938bcd224bb7f3_0  4537  ...s/STEP/jazukassocketheadscrew70.stp    N/A  file_info
4538  ff34b29e1e432c102c58d653b4d3e4c0_0  4538  ...dcf-9536-47b2-8269-ce41a86486bd.stp    N/A  file_info
4539  ff67ece1fefefbf1afc87c068d122f30_0  4539  ...847-7d59-457b-878e-314c93b7dab8.stp    N/A  file_info
4540  ff90bbcff07840c839e3c4bd868464e1_0  4540  ...f12-6335-4a0d-a1df-17df600fc30a.stp    N/A  file_info
4541  ffb938b31feaa5f68ede4e5ff393b19e_0  4541  ...xternal Retaining Rings/STEP/47.stp    N/A  file_info
4542  ffc616bd22a7e17b7e58d6955b53c1df_0  4542  ...2a5-e2bf-4988-8e2f-173ae5f77584.stp    N/A  file_info
4543  ffd8cb22904f9846c332817f17d6d45f_0  4543  ...a27-45b8-4a3d-9fdd-4147900d03c7.stp    N/A  file_info
4544  ffe5b38465928009f9f8343de9af1d20_0  4544  ...44f-89c4-4aa4-883e-20fb78b77360.stp    N/A  file_info
4545  fffb060726399b0a057a355dbedc7d9d_0  4545  ...683-3941-40db-a4cb-8d9867831b51.stp    N/A  file_info
[27]:
explorer.get_file_info_all()
[27]:
name id description subset table_name
0 00167e547565435944db7f387a002921_0 0 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
1 0019acb61ba91dd1ebb65c343d86f128_0 1 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
2 001eeef8382dbcfc00e1bfce137e369f_0 2 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
3 002c05eda3f3c4037ae686db679418df_0 3 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
4 0038012c22d354c843b5065a227392af_0 4 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
... ... ... ... ... ...
4541 ffb938b31feaa5f68ede4e5ff393b19e_0 4541 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
4542 ffc616bd22a7e17b7e58d6955b53c1df_0 4542 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
4543 ffd8cb22904f9846c332817f17d6d45f_0 4543 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
4544 ffe5b38465928009f9f8343de9af1d20_0 4544 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info
4545 fffb060726399b0a057a355dbedc7d9d_0 4545 /home/maxime.marechal/Projects/HAI-repo/ML-Ini... N/A file_info

4546 rows × 5 columns

[28]:
groups = explorer.available_groups()
print(groups)
{'face_mesh', 'graph', 'faces', 'edges', 'faceface', 'Labels'}
[29]:
face_type_description = explorer.get_descriptions("face_types")
print(type(face_type_description), face_type_description)
<class 'pandas.core.frame.DataFrame'>      id      name description  table_name
4547  1  Cylinder     not set  face_types
4548  0     Plane     not set  face_types
4549  4     Torus     not set  face_types
4550  2      Cone     not set  face_types
4551  5     Nurbs     not set  face_types
4552  3    Sphere     not set  face_types
[30]:
import time
from scripts import helper_tutorials
start_time = time.time()
dist = explorer.create_distribution(key="num_nodes", bins=12, group="graph")
print(f"BRep face-count distribution created in {(time.time() - start_time):.2f} seconds\n")
helper_tutorials.print_distribution_info(dist, title="BRep face-count distribution")
BRep face-count distribution created in 0.67 seconds

../../../_images/tutorials_hoops_ai_tutorials_notebooks_4a_5k_dataserving_8_1.png
[31]:
largest_files = dist['file_id_codes_in_bins'][9]

Dataset Visualization with DatasetViewer

The DatasetViewer is a powerful visualization tool that bridges dataset queries and visual analysis. It enables you to quickly visualize query results in two ways:

  1. Image Grids: Generate collages of PNG previews for rapid visual scanning

  2. Interactive 3D Views: Open inline 3D viewers for detailed model inspection

[32]:
# Import the DatasetViewer from the insights module
from hoops_ai.insights import DatasetViewer

dataset_viewer = DatasetViewer.from_explorer(explorer)
[33]:
# condition
brepcount_is_large = lambda ds: ds['num_nodes'] > 200
filelist = explorer.get_file_list(group="graph", where=brepcount_is_large)
print(len(filelist))
107
[34]:
fig = dataset_viewer.show_preview_as_image(filelist[80:100])

../../../_images/tutorials_hoops_ai_tutorials_notebooks_4a_5k_dataserving_13_0.png
[35]:
import time
start_time = time.time()
face_dist = explorer.create_distribution(key="part_label", bins=None, group="Labels")
print(f"Label distribution created in {(time.time() - start_time):.2f} seconds\n")
helper_tutorials.print_distribution_info(face_dist, title="Labels")

Label distribution created in 0.18 seconds

../../../_images/tutorials_hoops_ai_tutorials_notebooks_4a_5k_dataserving_14_1.png
[36]:
# condition
label_is_pipefittings = lambda ds: ds['part_label'] == 15
filelist = explorer.get_file_list(group="Labels", where=label_is_pipefittings)
print(len(filelist))
107
[37]:
fig = dataset_viewer.show_preview_as_image(filelist[0:36])
../../../_images/tutorials_hoops_ai_tutorials_notebooks_4a_5k_dataserving_16_0.png
[38]:
# Get and print metadata information
file_id = 33

df_info = explorer.get_parquet_info_by_code(file_id)
print(type(df_info), df_info)
<class 'pandas.core.frame.DataFrame'>                                  name  id  \
0  01c1cdddc78f24db9098e829de118c5b_0  33

                                         description subset table_name
0  /home/maxime.marechal/Projects/HAI-repo/ML-Ini...    N/A  file_info
[39]:
print(explorer.get_descriptions("part_label_description"))
[DatasetExplorer] No records found for part_label_description.
Empty DataFrame
Columns: []
Index: []
[40]:
start_time = time.time()
dist = explorer.create_distribution(key="num_nodes", bins=12, group="graph")
print(f"BRep face-count distribution created in {(time.time() - start_time):.2f} seconds\n")
helper_tutorials.print_distribution_info(dist, title="BRep face-count distribution")
BRep face-count distribution created in 0.61 seconds

../../../_images/tutorials_hoops_ai_tutorials_notebooks_4a_5k_dataserving_19_1.png

Gather files that satisfy a given condition

[41]:
start_time = time.time()

# condition
label_is_pipefittings = lambda ds: ds['part_label'] == 15

filelist = explorer.get_file_list(group="Labels", where=label_is_pipefittings)
print(f"Filtering completed in {(time.time() - start_time):.2f} seconds")
print(f"Found {len(filelist)} files with file_labels == 15 (Pipe Fittings)\n")
print(filelist)
Filtering completed in 0.06 seconds
Found 107 files with file_labels == 15 (Pipe Fittings)

[   8   32  114  193  254  267  344  347  352  367  402  405  411  461
  492  544  561  571  572  673  675  710  794  799  856  862  879  890
  946  961  978 1017 1031 1051 1053 1085 1091 1146 1205 1223 1320 1404
 1449 1479 1497 1508 1560 1565 1595 1658 1712 1722 1874 1952 1965 2012
 2036 2040 2042 2172 2203 2209 2273 2344 2380 2460 2462 2466 2543 2875
 3012 3016 3032 3040 3098 3129 3157 3194 3221 3230 3252 3332 3379 3387
 3472 3636 3661 3677 3725 3745 3769 3810 3872 3890 3962 3991 4046 4078
 4111 4112 4191 4263 4276 4303 4308 4334 4473]

Query data for single file

[42]:
def demo_query_single_file(explorer, file_id):
    """Show how to access and query dataset details for a single file."""
    print("=== Single File Dataset Access ===")
    import time
    # Get and print parquet info
    df_info = explorer.get_parquet_info_by_code(file_id)
    print("File info:")
    for column in df_info.columns:
        print(f"Column: {column}")
        for value in df_info[column]:
            print(f"  {value}")
    print()

    # Access various dataset groups
    groups = ["faces", "Labels", "edges", "graph"]
    datasets = {grp: explorer.file_dataset(file_id_code=file_id, group=grp) for grp in groups}

    print(f"Datasets for file ID '{file_id}':")
    for grp, ds in datasets.items():
        for name, da in ds.data_vars.items():
            print(f"  [{grp}] VARIABLE: {name}, Shape: {da.shape}, Dims: {da.dims}, Size: {da.size}")
    print()

    # Query UV-grid data for a specific face
    start_time = time.time()
    face_da = datasets["faces"]["face_areas"]
    face_index = min(2, face_da.sizes["face"] - 1)
    uv_grid_data = face_da.isel(face=face_index)
    print(f"face grid data for face index {face_index}:")
    np_uvgrid = uv_grid_data.data.compute()
    print(f"Query took {(time.time() - start_time):.2f} seconds\n")
    # print(np_uvgrid)

[43]:
demo_query_single_file(explorer,file_id=4500)
=== Single File Dataset Access ===
File info:
Column: name
  fc2c360dbd9f8ab6968702bc468647ac_0
Column: id
  4500
Column: description
  /home/maxime.marechal/Projects/HAI-repo/ML-Initiative/test_packages/cadfiles/fabwave/CAD_1_15_Classes/O_Rings/STEP/4be34a5d-20fd-47b9-9ee9-02fa697ceb83.stp
Column: subset
  N/A
Column: table_name
  file_info

Datasets for file ID '4500':
  [faces] VARIABLE: face_areas, Shape: (1,), Dims: ('face',), Size: 1
  [faces] VARIABLE: face_centroids, Shape: (1, 3), Dims: ('face', 'dim'), Size: 3
  [faces] VARIABLE: face_discretization, Shape: (1, 100, 7), Dims: ('face', 'sample', 'component'), Size: 700
  [faces] VARIABLE: face_indices, Shape: (1,), Dims: ('face',), Size: 1
  [faces] VARIABLE: face_loops, Shape: (1,), Dims: ('face',), Size: 1
  [faces] VARIABLE: face_types, Shape: (1,), Dims: ('face',), Size: 1
  [faces] VARIABLE: file_id_code_faces, Shape: (1,), Dims: ('face',), Size: 1
  [Labels] VARIABLE: file_id_code_Labels, Shape: (1,), Dims: ('part',), Size: 1
  [Labels] VARIABLE: part_label, Shape: (1,), Dims: ('part',), Size: 1
  [edges] VARIABLE: edge_convexities, Shape: (1,), Dims: ('edge',), Size: 1
  [edges] VARIABLE: edge_dihedral_angles, Shape: (1,), Dims: ('edge',), Size: 1
  [edges] VARIABLE: edge_indices, Shape: (1,), Dims: ('edge',), Size: 1
  [edges] VARIABLE: edge_lengths, Shape: (1,), Dims: ('edge',), Size: 1
  [edges] VARIABLE: edge_types, Shape: (1,), Dims: ('edge',), Size: 1
  [edges] VARIABLE: edge_u_grids, Shape: (1, 10, 6), Dims: ('edge', 'u', 'component'), Size: 60
  [edges] VARIABLE: file_id_code_edges, Shape: (1,), Dims: ('edge',), Size: 1
  [graph] VARIABLE: edges_destination, Shape: (1,), Dims: ('edge',), Size: 1
  [graph] VARIABLE: edges_source, Shape: (1,), Dims: ('edge',), Size: 1
  [graph] VARIABLE: file_id_code_graph, Shape: (1,), Dims: ('edge',), Size: 1
  [graph] VARIABLE: num_nodes, Shape: (1,), Dims: ('edge',), Size: 1

face grid data for face index 0:
Query took 0.03 seconds