50a7fcd1e42c4d97adc3094aaca82265

HOOPS AI - Minimal ETL Demo

This notebook demonstrates the core features of the HOOPS AI data-engineering workflows.

This demo uses a small subset of the CADSynth dataset, with 101 files prepared for this tutorial. See the citation cell below for the required dataset reference.

Key Components

  • Schema-Based Dataset Organization: Define structured data schemas for consistent data merging

  • Parallel Task Decorators: Simplify CAD processing with type-safe task definitions

  • Generic Flow Orchestration: Automatically handle task dependencies and data flow

  • Automatic Dataset Merging: Process multiple files into a unified dataset structure

  • Integrated Exploration Tools: Analyze and prepare data for ML workflows

The framework automatically generates visualization assets and stream-cache data to support downstream applications.

CADSynth Dataset Citation

This notebook uses a 101-file subset derived from the CADSynth dataset for demonstration purposes.

Required citation Zhang, Shuming (2024). CADSynth: A Dataset for Machining Feature Recognition in B-rep Models. V1. Science Data Bank. https://doi.org/10.57760/sciencedb.17011

Dataset page https://www.scidb.cn/en/detail?dataSetId=931c088fd44f4d3e82891a5180f10d90

Subset note The local tutorial folder packages/cadfiles/cadsynth100/step is a small sample prepared for this ETL demo. It is not the full CADSynth release.

[1]:
import hoops_ai
import os
import sys

license_key = os.environ.get("HOOPS_AI_LICENSE")
if not license_key:
    sys.exit("HOOPS_AI_LICENSE environment variable is required.")

hoops_ai.set_license(license_key, validate=True)
------------------------------------------------------------
HOOPS AI
------------------------------------------------------------
  Platform      : Windows 11
  Architecture  : AMD64
  Python        : 3.9.21
------------------------------------------------------------
  Core          : hoops-ai             1.0.0  (build: 39b99a8 2026-03-23T19:25:21Z)
  CAD Access    : hoops-exchange       26.2.0  (build: 1e11169 2026-03-23T19:16:49Z)
  Conversion    : hoops-converter      26.1.0  (build: 39b99a8 2026-03-23T19:15:42Z)
  Insights      : hoops-web-viewer     26.1.0  (build: 25137b2 2026-03-23T19:20:34Z)
------------------------------------------------------------
======================================================================
[OK] HOOPS AI License: Valid
======================================================================

Import Dependencies

The HOOPS AI framework provides several key modules:

  • flowmanager: Core orchestration engine with task decorators

  • cadaccess: CAD file loading and model access utilities

  • storage: Data persistence and retrieval components

  • dataset: Tools for exploring and preparing merged datasets

[2]:
import os
import pathlib
from typing import Tuple, List

# Import the flow builder framework from the library
import hoops_ai
from hoops_ai.flowmanager import flowtask


from hoops_ai.cadaccess import HOOPSLoader, HOOPSTools
from hoops_ai.cadencoder import BrepEncoder
from hoops_ai.dataset import DatasetExplorer
from hoops_ai.storage import DataStorage, CADFileRetriever, LocalStorageProvider
from hoops_ai.storage.datasetstorage.schema_builder import SchemaBuilder

Configuration Setup

Define input and output paths for CAD processing:

  • Input directory containing source CAD files

  • Output directory for processed results

  • Source directory with a small CADSynth subset used by this demo

The framework will automatically organize outputs into structured directories.

[3]:
# Configuration - Using simpler paths
nb_dir = pathlib.Path.cwd()

datasources_dir = nb_dir.parent.joinpath("packages","cadfiles","cadsynth100","step")

if not datasources_dir.exists():
    print("Data source directory does not exist. Please check the path.")
    exit(-1)

flows_outputdir = nb_dir.joinpath("out")

Schema Definition - The Foundation of Dataset Organization

The SchemaBuilder defines a structured blueprint for how CAD data should be organized:

  • Domain & Version: Namespace and versioning for schema tracking

  • Groups: Logical data categories (e.g., “machining”, “faces”, “edges”)

  • Arrays: Typed data containers with defined dimensions

  • Metadata Routing: Rules for routing metadata to appropriate storage

Schemas ensure consistent data organization across all processed files, enabling automatic merging and exploration.

[4]:
# Schema is now defined in cad_tasks.py for ProcessPoolExecutor compatibility
# Import it from there to view or customize
from scripts.cad_tasks import cad_schema
print(cad_schema)
{'version': '1.0', 'domain': 'Manufacturing_Analysis', 'groups': {'machining': {'primary_dimension': 'part', 'arrays': {'machining_category': {'dims': ['part'], 'dtype': 'int32', 'description': 'Machining complexity category (1-5)'}, 'material_type': {'dims': ['part'], 'dtype': 'int32', 'description': 'Material type (1-5)'}, 'estimated_machining_time': {'dims': ['part'], 'dtype': 'float32', 'description': 'Estimated machining time in hours'}}, 'description': 'Manufacturing and machining classification data'}}, 'description': 'Minimal schema for manufacturing classification', 'metadata': {'metadata': {'file_level': {}, 'categorical': {'material_type_description': {'dtype': 'str', 'required': False, 'description': 'Material classification'}}, 'routing_rules': {'file_level_patterns': [], 'categorical_patterns': ['material_type_description', 'category', 'type'], 'default_numeric': 'file_level', 'default_categorical': 'categorical', 'default_string': 'categorical'}}}}
[5]:
# Import task functions from external module for ProcessPoolExecutor compatibility
from scripts.cad_tasks import gather_files, encode_manufacturing_data

[6]:
from display_utils import display_task_source
display_task_source(gather_files, "gather_files")

gather_files

@flowtask.extract(
    name="gather cad files",
    inputs=["cad_datasources"],
    outputs=["cad_dataset"],
    parallel_execution=True
)
def gather_files(source: str) -> List[str]:
    # Use simple glob pattern matching for ProcessPoolExecutor compatibility
    patterns = ["*.stp", "*.step", "*.iges", "*.igs"]
    source_files = []

    for pattern in patterns:
        search_path = os.path.join(source, pattern)
        files = glob.glob(search_path)
        source_files.extend(files)

    print(f"Found {len(source_files)} CAD files in {source}")
    return source_files
[7]:
display_task_source(encode_manufacturing_data, "encode_manufacturing_data")

encode_manufacturing_data

@flowtask.transform(
    name="Manufacturing data encoding",
    inputs=["cad_dataset"],
    outputs=["cad_files_encoded"],
    parallel_execution=True
)
def encode_manufacturing_data(cad_file: str, cad_loader: HOOPSLoader, storage: DataStorage) -> str:
    # Load CAD model using the process-local HOOPSLoader
    cad_model = cad_loader.create_from_file(cad_file)

    # Set the schema for structured data organization
    # Schema is defined at module level, so it's available in all worker processes
    storage.set_schema(cad_schema)

    # Prepare BREP for feature extraction
    hoopstools = HOOPSTools()
    hoopstools.adapt_brep(cad_model, None)

    # Extract geometric features using BrepEncoder
    brep_encoder = BrepEncoder(cad_model.get_brep(), storage)
    brep_encoder_live = BrepEncoder(cad_model.get_brep())
    graph_data = brep_encoder_live.push_face_adjacency_graph()

     # Topology & Graph
    graph = brep_encoder.push_face_adjacency_graph()
    extended_adj = brep_encoder.push_extended_adjacency()
    neighbors_count = brep_encoder.push_face_neighbors_count()
    edge_paths = brep_encoder.push_face_pair_edges_path(max_allow_edge_length=16)

    # Geometric Indices & Attributes
    face_attrs, face_types_dict = brep_encoder.push_face_attributes()
    face_discretization = brep_encoder.push_face_discretization(pointsamples=100)
    edge_attrs, edge_types_dict = brep_encoder.push_edge_attributes()
    curve_grids = brep_encoder.push_curvegrid(ugrid=10)

    # Face-Pair Histograms
    distance_hists = brep_encoder.push_average_face_pair_distance_histograms(grid=10, num_bins=64)
    angle_hists = brep_encoder.push_average_face_pair_angle_histograms(grid=10, num_bins=64)


    # Generate manufacturing classification data
    file_basename = os.path.basename(cad_file)
    file_name = os.path.splitext(file_basename)[0]

    # Set seed for reproducible results based on filename
    random.seed(hash(file_basename) % 1000)

    # Generate classification values
    machining_category = random.randint(1, 5)
    material_type = random.randint(1, 5)
    estimated_time = random.uniform(0.5, 10.0)

    # Material type descriptions
    material_descriptions = ["Steel", "Aluminum", "Titanium", "Plastic", "Composite"]

    # Save data using the OptStorage API (data_key format: "group/array_name")
    storage.save_data("machining/machining_category", np.array([machining_category], dtype=np.int32))
    storage.save_data("machining/material_type", np.array([material_type], dtype=np.int32))
    storage.save_data("machining/estimated_machining_time", np.array([estimated_time], dtype=np.float32))

    # Save categorical metadata (will be routed to .attribset)
    storage.save_metadata("material_type_description", material_descriptions[material_type - 1])

    # Save file-level metadata (will be routed to .infoset)
    storage.save_metadata("Item", str(cad_file))
    storage.save_metadata("Flow name", "minimal_manufacturing_flow")

    # Compress the storage into a .data file
    storage.compress_store()


    return storage.get_file_path("")

Flow Orchestration and Automatic Dataset Generation

The hoops_ai.create_flow() function orchestrates the data flow execution. The tasks parameters can receive any function defined by the user. This is fully editable, you can write your OWN encoding logic.

[8]:
# Create and run the Data Flow
flow_name = "minimal_manufacturing_flow"
cad_flow = hoops_ai.create_flow(
    name=flow_name,
    tasks=[gather_files, encode_manufacturing_data],  # Imported from cad_tasks.py
    max_workers=12,  # parallel running
    flows_outputdir=str(flows_outputdir),
    ml_task="Manufacturing Classification Demo",
    export_visualization=True
)

# Run the flow to process all files
print("Starting flow execution with ProcessPoolExecutor...")
print("Schema is defined in cad_tasks.py, available to all worker processes")
flow_output, output_dict, flow_file = cad_flow.process(inputs={'cad_datasources': [str(datasources_dir)]})

# Display results
print("\n" + "="*70)
print("FLOW EXECUTION COMPLETED SUCCESSFULLY")
print("="*70)
print(f"\nDataset files created:")
print(f"  Main dataset: {output_dict.get('flow_data', 'N/A')}")
print(f"  Info dataset: {output_dict.get('flow_info', 'N/A')}")
print(f"  Attributes: {output_dict.get('flow_attributes', 'N/A')}")
print(f"  Flow file: {flow_file}")
print(f"\nTotal processing time: {output_dict.get('Duration [seconds]', {}).get('total', 0):.2f} seconds")
print(f"Files processed: {output_dict.get('file_count', 0)}")
Cleaning up existing flow directory: C:\Users\LuisSalazar.LY-LS-LEGION\Documents\repos\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow
Removing all previous outputs for flow 'minimal_manufacturing_flow' to avoid build conflicts.
Starting flow execution with ProcessPoolExecutor...
Schema is defined in cad_tasks.py, available to all worker processes
[DatasetMerger] Saved schema with 5 groups to metadata.json
Sequential Task end=====================

======================================================================
FLOW EXECUTION COMPLETED SUCCESSFULLY
======================================================================

Dataset files created:
  Main dataset: C:\Users\LuisSalazar.LY-LS-LEGION\Documents\repos\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow\minimal_manufacturing_flow.dataset
  Info dataset: C:\Users\LuisSalazar.LY-LS-LEGION\Documents\repos\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow\minimal_manufacturing_flow.infoset
  Attributes: C:\Users\LuisSalazar.LY-LS-LEGION\Documents\repos\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow\minimal_manufacturing_flow.attribset
  Flow file: C:\Users\LuisSalazar.LY-LS-LEGION\Documents\repos\HOOPS-AI-tutorials\notebooks\out/flows/minimal_manufacturing_flow/minimal_manufacturing_flow.flow

Total processing time: 58.36 seconds
Files processed: 101
[ ]:

DATA SERVING : Use the DatasetExplorer to navigate your data

[9]:
# Explore the generated dataset
explorer = DatasetExplorer(flow_output_file=str(flow_file))
explorer.print_table_of_contents()
[DatasetExplorer] Default local cluster started: <Client: 'tcp://127.0.0.1:61678' processes=1 threads=16, memory=7.45 GiB>

--- Dataset Table of Contents ---

EDGES_GROUP:
  EDGE_CONVEXITIES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_DIHEDRAL_ANGLES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_INDICES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_LENGTHS_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_TYPES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_U_GRIDS_DATA: Shape: (7269, 10, 6), Dims: ('edge', 'u', 'component'), Size: 436140
  FILE_ID_CODE_EDGES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269

FACEFACE_GROUP:
  A3_DISTANCE_DATA: Shape: (87426, 64), Dims: ('facepair', 'bin'), Size: 5595264
  D2_DISTANCE_DATA: Shape: (87426, 64), Dims: ('facepair', 'bin'), Size: 5595264
  EXTENDED_ADJACENCY_DATA: Shape: (87426,), Dims: ('facepair',), Size: 87426
  FACE_PAIR_EDGES_PATH_DATA: Shape: (87426, 16), Dims: ('facepair', 'dim_path'), Size: 1398816
  FILE_ID_CODE_FACEFACE_DATA: Shape: (87426,), Dims: ('facepair',), Size: 87426

FACES_GROUP:
  FACE_AREAS_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FACE_CENTROIDS_DATA: Shape: (2796, 3), Dims: ('face', 'dim'), Size: 8388
  FACE_DISCRETIZATION_DATA: Shape: (2796, 100, 7), Dims: ('face', 'sample', 'component'), Size: 1957200
  FACE_INDICES_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FACE_LOOPS_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FACE_NEIGHBORSCOUNT_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FACE_TYPES_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FILE_ID_CODE_FACES_DATA: Shape: (2796,), Dims: ('face',), Size: 2796

GRAPH_GROUP:
  EDGES_DESTINATION_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGES_SOURCE_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  FILE_ID_CODE_GRAPH_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  NUM_NODES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269

MACHINING_GROUP:
  ESTIMATED_MACHINING_TIME_DATA: Shape: (101,), Dims: ('part',), Size: 101
  FILE_ID_CODE_MACHINING_DATA: Shape: (101,), Dims: ('part',), Size: 101
  MACHINING_CATEGORY_DATA: Shape: (101,), Dims: ('part',), Size: 101
  MATERIAL_TYPE_DATA: Shape: (101,), Dims: ('part',), Size: 101
==================================
Columns in file_info:
                                   name   id                             description                   flow_name                        stream_cache_png                         stream_cache_3d subset table_name
0    037038a6fb31e24dd00b8cbc5a2c693a_0    0  ...files\cadsynth100\step\00000043.stp  minimal_manufacturing_flow  ...038a6fb31e24dd00b8cbc5a2c693a_0.png  ...038a6fb31e24dd00b8cbc5a2c693a_0.scs    N/A  file_info
1    047767563b4889323d4c9c5b188df482_0    1  ...files\cadsynth100\step\00000052.stp  minimal_manufacturing_flow  ...767563b4889323d4c9c5b188df482_0.png  ...767563b4889323d4c9c5b188df482_0.scs    N/A  file_info
2    08623406092a2186f2cb344c91bbc04e_0    2  ...files\cadsynth100\step\00000057.stp  minimal_manufacturing_flow  ...23406092a2186f2cb344c91bbc04e_0.png  ...23406092a2186f2cb344c91bbc04e_0.scs    N/A  file_info
3    08b91689a4ce513fec9c4c337b3d0d35_0    3  ...files\cadsynth100\step\00000082.stp  minimal_manufacturing_flow  ...91689a4ce513fec9c4c337b3d0d35_0.png  ...91689a4ce513fec9c4c337b3d0d35_0.scs    N/A  file_info
4    0f0ff2623e3dac9c2d24ea11b577229e_0    4  ...files\cadsynth100\step\00000044.stp  minimal_manufacturing_flow  ...ff2623e3dac9c2d24ea11b577229e_0.png  ...ff2623e3dac9c2d24ea11b577229e_0.scs    N/A  file_info
5    0fd9e97473132a463e64d00753192c94_0    5  ...files\cadsynth100\step\00000061.stp  minimal_manufacturing_flow  ...9e97473132a463e64d00753192c94_0.png  ...9e97473132a463e64d00753192c94_0.scs    N/A  file_info
6    16238c300472c562839efd3ef9d47669_0    6  ...files\cadsynth100\step\00000100.stp  minimal_manufacturing_flow  ...38c300472c562839efd3ef9d47669_0.png  ...38c300472c562839efd3ef9d47669_0.scs    N/A  file_info
7    17ec19f45ce044f4796a596a98a1d443_0    7  ...files\cadsynth100\step\00000051.stp  minimal_manufacturing_flow  ...c19f45ce044f4796a596a98a1d443_0.png  ...c19f45ce044f4796a596a98a1d443_0.scs    N/A  file_info
8    1807f54dff391c2fb4448d3be723ff58_0    8  ...files\cadsynth100\step\00000009.stp  minimal_manufacturing_flow  ...7f54dff391c2fb4448d3be723ff58_0.png  ...7f54dff391c2fb4448d3be723ff58_0.scs    N/A  file_info
9    1911436a9fabe61e0d9ca3c39e693cd9_0    9  ...files\cadsynth100\step\00000070.stp  minimal_manufacturing_flow  ...1436a9fabe61e0d9ca3c39e693cd9_0.png  ...1436a9fabe61e0d9ca3c39e693cd9_0.scs    N/A  file_info
..                                  ...  ...                                     ...                         ...                                     ...                                     ...    ...        ...
91   f0ecf08425d229f076bc71fda2766b3a_0   91  ...files\cadsynth100\step\00000072.stp  minimal_manufacturing_flow  ...cf08425d229f076bc71fda2766b3a_0.png  ...cf08425d229f076bc71fda2766b3a_0.scs    N/A  file_info
92   f2cfaad6fe6f9ae7b714ff69c3117cd2_0   92  ...files\cadsynth100\step\00000001.stp  minimal_manufacturing_flow  ...faad6fe6f9ae7b714ff69c3117cd2_0.png  ...faad6fe6f9ae7b714ff69c3117cd2_0.scs    N/A  file_info
93   f4b70db6d4ebeeddec49870d3a35f562_0   93  ...files\cadsynth100\step\00000089.stp  minimal_manufacturing_flow  ...70db6d4ebeeddec49870d3a35f562_0.png  ...70db6d4ebeeddec49870d3a35f562_0.scs    N/A  file_info
94   f4ea199971bef1d722d2302b0be61daa_0   94  ...files\cadsynth100\step\00000050.stp  minimal_manufacturing_flow  ...a199971bef1d722d2302b0be61daa_0.png  ...a199971bef1d722d2302b0be61daa_0.scs    N/A  file_info
95   f7ba841d211c1a05c691f2c7ca476a60_0   95  ...files\cadsynth100\step\00000013.stp  minimal_manufacturing_flow  ...a841d211c1a05c691f2c7ca476a60_0.png  ...a841d211c1a05c691f2c7ca476a60_0.scs    N/A  file_info
96   f8e6d2067bce2ade90af909449ff6a6f_0   96  ...files\cadsynth100\step\00000078.stp  minimal_manufacturing_flow  ...6d2067bce2ade90af909449ff6a6f_0.png  ...6d2067bce2ade90af909449ff6a6f_0.scs    N/A  file_info
97   f9415c79815249a74fa353dd738d006e_0   97  ...files\cadsynth100\step\00000097.stp  minimal_manufacturing_flow  ...15c79815249a74fa353dd738d006e_0.png  ...15c79815249a74fa353dd738d006e_0.scs    N/A  file_info
98   fae8ce4c304f0c354892f504fca69092_0   98  ...files\cadsynth100\step\00000093.stp  minimal_manufacturing_flow  ...8ce4c304f0c354892f504fca69092_0.png  ...8ce4c304f0c354892f504fca69092_0.scs    N/A  file_info
99   fbc88251063b354f0c28a21c9e435b8e_0   99  ...files\cadsynth100\step\00000077.stp  minimal_manufacturing_flow  ...88251063b354f0c28a21c9e435b8e_0.png  ...88251063b354f0c28a21c9e435b8e_0.scs    N/A  file_info
100  fe94b93ef6dbffc4347decd5afc48917_0  100  ...files\cadsynth100\step\00000083.stp  minimal_manufacturing_flow  ...4b93ef6dbffc4347decd5afc48917_0.png  ...4b93ef6dbffc4347decd5afc48917_0.scs    N/A  file_info

ML-Ready Dataset Preparation

The DatasetLoader provides tools for preparing the merged dataset for machine learning:

Key Capabilities:

  • Stratified Splitting: Create train/validation/test splits while preserving class distributions

  • Subset Tracking: Records file assignments in the dataset metadata

[10]:
# Load and split dataset for machine learning
from hoops_ai.dataset import DatasetLoader

flow_path = pathlib.Path(flow_file)
loader = DatasetLoader(
    merged_store_path=str(flow_path.parent / f"{flow_path.stem}.dataset"),
    parquet_file_path=str(flow_path.parent / f"{flow_path.stem}.infoset")
)

# Split dataset by machining category with explicit group parameter
train_size, val_size, test_size = loader.split(
    key="machining_category",
    group="machining",  # Explicitly specify the group for clarity
    train=0.6,
    validation=0.2,
    test=0.2,
    random_state=42
)

print(f"Dataset split: Train={train_size}, Validation={val_size}, Test={test_size}")

# Access training dataset
train_dataset = loader.get_dataset("train")
print(f"Training dataset ready with {len(train_dataset)} samples")

loader.close_resources()
[DatasetExplorer] Default local cluster started: <Client: 'tcp://127.0.0.1:53482' processes=1 threads=16, memory=7.45 GiB>
DEBUG: Successfully built file lists with 101 files out of 101 original file codes

============================================================
DATASET STRUCTURE OVERVIEW
============================================================

Group: edges
------------------------------
  edge_convexities: (7269,) (int32)
  edge_dihedral_angles: (7269,) (float32)
  edge_indices: (7269,) (int32)
  edge_lengths: (7269,) (float32)
  edge_types: (7269,) (int32)
  edge_u_grids: (7269, 10, 6) (float32)
  file_id_code_edges: (7269,) (int64)

Group: faceface
------------------------------
  a3_distance: (87426, 64) (float32)
  d2_distance: (87426, 64) (float32)
  extended_adjacency: (87426,) (float32)
  face_pair_edges_path: (87426, 16) (int32)
  file_id_code_faceface: (87426,) (int64)

Group: faces
------------------------------
  face_areas: (2796,) (float32)
  face_centroids: (2796, 3) (float32)
  face_discretization: (2796, 100, 7) (float32)
  face_indices: (2796,) (int32)
  face_loops: (2796,) (int32)
  face_neighborscount: (2796,) (int32)
  face_types: (2796,) (int32)
  file_id_code_faces: (2796,) (int64)

Group: graph
------------------------------
  edges_destination: (7269,) (int32)
  edges_source: (7269,) (int32)
  file_id_code_graph: (7269,) (int64)
  num_nodes: (7269,) (int32)

Group: machining
------------------------------
  estimated_machining_time: (101,) (float32)
  file_id_code_machining: (101,) (int64)
  machining_category: (101,) (int32)
  material_type: (101,) (int32)

============================================================
Dataset split by machining_category: Train=59, Validation=21, Test=21
Dataset split: Train=59, Validation=21, Test=21
Training dataset ready with 59 samples
[DatasetExplorer] Shutting down this Dask client...
[DatasetExplorer] Closing the LocalCluster...
[DatasetExplorer] All resources closed.

Summary: The HOOPS AI - Data Flow Advantage

This minimal example demonstrates how HOOPS AI simplifies CAD data processing for ML:

  1. Schema-First Approach: Define your data structure before processing

  2. Decorator-Based Tasks: Easily inject custom processing logic

  3. Automatic Orchestration: Let the framework handle execution complexity

  4. Unified Dataset: Get consistently merged data ready for ML

  5. Built-in Exploration: Analyze and prepare datasets with powerful tools

The framework automatically generates visualization assets and stream caches, making it easy to integrate with downstream visualization tools.

[11]:
# Visualization libraries
import matplotlib.pyplot as plt

def print_distribution_info(dist, title="Distribution"):
    """Helper function to print and visualize distribution data."""
    list_filecount = list()
    for i, bin_files in enumerate(dist['file_id_codes_in_bins']):
        list_filecount.append(bin_files.size)

    dist['file_count'] =list_filecount
    # Visualization with matplotlib
    fig, ax = plt.subplots(figsize=(12, 4))

    bin_centers = 0.5 * (dist['bin_edges'][1:] + dist['bin_edges'][:-1])
    ax.bar(bin_centers, dist['file_count'], width=(dist['bin_edges'][1] - dist['bin_edges'][0]),
           alpha=0.7, color='steelblue', edgecolor='black', linewidth=1)

    # Add file count annotations
    for i, count in enumerate(dist['file_count']):
        if count > 0:  # Only annotate non-empty bins
            ax.text(bin_centers[i], count + 0.5, f"{count}",
                    ha='center', va='bottom', fontsize=8)

    ax.set_xlabel('Value')
    ax.set_ylabel('Count')
    ax.set_title(f'{title} Histogram')
    ax.grid(True, linestyle='--', alpha=0.7)

    plt.tight_layout()
    plt.show()
[12]:
import time
start_time = time.time()
face_dist = explorer.create_distribution(key="machining_category", bins=None, group="machining")
print(f"Machining distribution created in {(time.time() - start_time):.2f} seconds\n")
print_distribution_info(face_dist, title="Machining Category")
Machining distribution created in 0.28 seconds

../../../_images/tutorials_hoops_ai_tutorials_notebooks_3a_ETL_pipeline_using_flow_22_1.png

Dataset Visualization with DatasetViewer

The DatasetViewer is a powerful visualization tool that bridges dataset queries and visual analysis. It enables you to quickly visualize query results in two ways:

  1. Image Grids: Generate collages of PNG previews for rapid visual scanning

  2. Interactive 3D Views: Open inline 3D viewers for detailed model inspection

[13]:
# Import the DatasetViewer from the insights module
from hoops_ai.insights import DatasetViewer

# Create a DatasetViewer using the convenience method from_explorer
# This method queries the explorer and builds the file ID to visualization path mappings
dataset_viewer = DatasetViewer.from_explorer(explorer)
[ ]:

[14]:
start_time = time.time()

# condition
material_is_frequent = lambda ds: ds['material_type'] == 2

filelist = explorer.get_file_list(group="machining", where=material_is_frequent)
print(f"Filtering completed in {(time.time() - start_time):.2f} seconds")
print(filelist)
Filtering completed in 0.02 seconds
[ 1  2  5 11 13 24 35 48 50 51 56 57 59 61 66 73 74 76 78 80 83 85 86 96
 97 99]
[ ]:

Example 1: Visualize Query Results as Image Grid

Now let’s use the query results we obtained earlier and visualize them as a grid of images. This is perfect for quickly scanning through many files to understand patterns or identify specific cases.

[15]:
# Visualize the filtered files as a 5x5 grid with file IDs as labels
fig = dataset_viewer.show_preview_as_image(
    filelist,
    k=len(filelist),                      # Show up to 25 files
    grid_cols=8,               # 5 columns
    label_format='id',         # Show file IDs as labels
    figsize=(15, 5)           # Larger figure size
)

plt.show()
../../../_images/tutorials_hoops_ai_tutorials_notebooks_3a_ETL_pipeline_using_flow_29_0.png