e1f1b057805e4b188313ded5d2be4e27

HOOPS AI - Minimal ETL Demo

This notebook demonstrates the core features of the HOOPS AI data engineering workflows:

Key Components

  • Schema-Based Dataset Organization: Define structured data schemas for consistent data merging

  • Parallel Task Decorators: Simplify CAD processing with type-safe task definitions

  • Generic Flow Orchestration: Automatically handle task dependencies and data flow

  • Automatic Dataset Merging: Process multiple files into a unified dataset structure

  • Integrated Exploration Tools: Analyze and prepare data for ML workflows

The framework automatically generates visualization assets and stream cache data to support downstream applications.

[1]:
import hoops_ai
import os

# Note: License is also set in cad_tasks.py for worker processes
# This is only for the parent process (optional but good practice)
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)
ℹ️ Using TEST LICENSE (expires February 8th, 2026 - 9 days remaining)
   For production use, obtain your own license from Tech Soft 3D
HOOPS AI version :  1.0.0-b2dev12

Import Dependencies

The HOOPS AI framework provides several key modules:

  • flowmanager: Core orchestration engine with task decorators

  • cadaccess: CAD file loading and model access utilities

  • storage: Data persistence and retrieval components

  • dataset: Tools for exploring and preparing merged datasets

[2]:
import os
import pathlib
from typing import Tuple, List

# Import the flow builder framework from the library
import hoops_ai
from hoops_ai.flowmanager import flowtask


from hoops_ai.cadaccess import HOOPSLoader, HOOPSTools
from hoops_ai.cadencoder import BrepEncoder
from hoops_ai.dataset import DatasetExplorer
from hoops_ai.storage import DataStorage, CADFileRetriever, LocalStorageProvider
from hoops_ai.storage.datasetstorage.schema_builder import SchemaBuilder

Configuration Setup

Define input and output paths for CAD processing:

  • Input directory containing source CAD files

  • Output directory for processed results

  • Source directory with specific CAD file formats

The framework will automatically organize outputs into structured directories.

[3]:
# Configuration - Using simpler paths
nb_dir = pathlib.Path.cwd()

datasources_dir = nb_dir.parent.joinpath("packages","cadfiles","cadsynth100","step")

if not datasources_dir.exists():
    print("Data source directory does not exist. Please check the path.")
    exit(-1)

flows_outputdir = nb_dir.joinpath("out")

Schema Definition - The Foundation of Dataset Organization

The SchemaBuilder defines a structured blueprint for how CAD data should be organized:

  • Domain & Version: Namespace and versioning for schema tracking

  • Groups: Logical data categories (e.g., “machining”, “faces”, “edges”)

  • Arrays: Typed data containers with defined dimensions

  • Metadata Routing: Rules for routing metadata to appropriate storage

Schemas ensure consistent data organization across all processed files, enabling automatic merging and exploration.

[4]:
# Schema is now defined in cad_tasks.py for ProcessPoolExecutor compatibility
# Import it from there to view or customize
from scripts.cad_tasks import cad_schema
print(cad_schema)
ℹ️ Using TEST LICENSE (expires February 8th, 2026 - 9 days remaining)
   For production use, obtain your own license from Tech Soft 3D
HOOPS AI version :  1.0.0-b2dev12

{'version': '1.0', 'domain': 'Manufacturing_Analysis', 'groups': {'machining': {'primary_dimension': 'part', 'arrays': {'machining_category': {'dims': ['part'], 'dtype': 'int32', 'description': 'Machining complexity category (1-5)'}, 'material_type': {'dims': ['part'], 'dtype': 'int32', 'description': 'Material type (1-5)'}, 'estimated_machining_time': {'dims': ['part'], 'dtype': 'float32', 'description': 'Estimated machining time in hours'}}, 'description': 'Manufacturing and machining classification data'}}, 'description': 'Minimal schema for manufacturing classification', 'metadata': {'metadata': {'file_level': {}, 'categorical': {'material_type_description': {'dtype': 'str', 'required': False, 'description': 'Material classification'}}, 'routing_rules': {'file_level_patterns': [], 'categorical_patterns': ['material_type_description', 'category', 'type'], 'default_numeric': 'file_level', 'default_categorical': 'categorical', 'default_string': 'categorical'}}}}
[ ]:

[5]:
# Import task functions from external module for ProcessPoolExecutor compatibility
from scripts.cad_tasks import gather_files, encode_manufacturing_data

[6]:
from display_utils import display_task_source
display_task_source(gather_files, "gather_files")

gather_files

@flowtask.extract(
    name="gather cad files",
    inputs=["cad_datasources"],
    outputs=["cad_dataset"],
    parallel_execution=True
)
def gather_files(source: str) -> List[str]:
    # Use simple glob pattern matching for ProcessPoolExecutor compatibility
    patterns = ["*.stp", "*.step", "*.iges", "*.igs"]
    source_files = []

    for pattern in patterns:
        search_path = os.path.join(source, pattern)
        files = glob.glob(search_path)
        source_files.extend(files)

    print(f"Found {len(source_files)} CAD files in {source}")
    return source_files
[7]:
display_task_source(encode_manufacturing_data, "encode_manufacturing_data")

encode_manufacturing_data

@flowtask.transform(
    name="Manufacturing data encoding",
    inputs=["cad_dataset"],
    outputs=["cad_files_encoded"],
    parallel_execution=True
)
def encode_manufacturing_data(cad_file: str, cad_loader: HOOPSLoader, storage: DataStorage) -> str:
    # Load CAD model using the process-local HOOPSLoader
    cad_model = cad_loader.create_from_file(cad_file)

    # Set the schema for structured data organization
    # Schema is defined at module level, so it's available in all worker processes
    storage.set_schema(cad_schema)

    # Prepare BREP for feature extraction
    hoopstools = HOOPSTools()
    hoopstools.adapt_brep(cad_model, None)

    # Extract geometric features using BrepEncoder
    brep_encoder = BrepEncoder(cad_model.get_brep(), storage)

     # Topology & Graph
    graph = brep_encoder.push_face_adjacency_graph()
    extended_adj = brep_encoder.push_extended_adjacency()
    neighbors_count = brep_encoder.push_face_neighbors_count()
    edge_paths = brep_encoder.push_face_pair_edges_path(max_allow_edge_length=16)

    # Geometric Indices & Attributes
    face_attrs, face_types_dict = brep_encoder.push_face_attributes()
    face_discretization = brep_encoder.push_face_discretization(pointsamples=100)
    edge_attrs, edge_types_dict = brep_encoder.push_edge_attributes()
    curve_grids = brep_encoder.push_curvegrid(ugrid=10)

    # Face-Pair Histograms
    distance_hists = brep_encoder.push_average_face_pair_distance_histograms(grid=10, num_bins=64)
    angle_hists = brep_encoder.push_average_face_pair_angle_histograms(grid=10, num_bins=64)


    # Generate manufacturing classification data
    file_basename = os.path.basename(cad_file)
    file_name = os.path.splitext(file_basename)[0]

    # Set seed for reproducible results based on filename
    random.seed(hash(file_basename) % 1000)

    # Generate classification values
    machining_category = random.randint(1, 5)
    material_type = random.randint(1, 5)
    estimated_time = random.uniform(0.5, 10.0)

    # Material type descriptions
    material_descriptions = ["Steel", "Aluminum", "Titanium", "Plastic", "Composite"]

    # Save data using the OptStorage API (data_key format: "group/array_name")
    storage.save_data("machining/machining_category", np.array([machining_category], dtype=np.int32))
    storage.save_data("machining/material_type", np.array([material_type], dtype=np.int32))
    storage.save_data("machining/estimated_machining_time", np.array([estimated_time], dtype=np.float32))

    # Save categorical metadata (will be routed to .attribset)
    storage.save_metadata("material_type_description", material_descriptions[material_type - 1])

    # Save file-level metadata (will be routed to .infoset)
    storage.save_metadata("Item", str(cad_file))
    storage.save_metadata("Flow name", "minimal_manufacturing_flow")

    # Compress the storage into a .data file
    storage.compress_store()


    return storage.get_file_path("")

Flow Orchestration and Automatic Dataset Generation

The hoops_ai.create_flow() function orchestrates the data flow execution. The tasks parameters can receive any function defined by the user. This is fully editable, you can write your OWN encoding logic.

[8]:
# Create and run the Data Flow
flow_name = "minimal_manufacturing_flow"
cad_flow = hoops_ai.create_flow(
    name=flow_name,
    tasks=[gather_files, encode_manufacturing_data],  # Imported from cad_tasks.py
    max_workers=6,  # parallel running
    flows_outputdir=str(flows_outputdir),
    ml_task="Manufacturing Classification Demo",
    auto_dataset_export=True,  # Enable automatic dataset merging
    debug=False,  # Changed to False to enable parallel execution
    export_visualization=True
)

# Run the flow to process all files
print("Starting flow execution with ProcessPoolExecutor...")
print("✓ Schema is defined in cad_tasks.py, available to all worker processes")
flow_output, output_dict, flow_file = cad_flow.process(inputs={'cad_datasources': [str(datasources_dir)]})

# Display results
print("\n" + "="*70)
print("FLOW EXECUTION COMPLETED SUCCESSFULLY")
print("="*70)
print(f"\nDataset files created:")
print(f"  Main dataset: {output_dict.get('flow_data', 'N/A')}")
print(f"  Info dataset: {output_dict.get('flow_info', 'N/A')}")
print(f"  Attributes: {output_dict.get('flow_attributes', 'N/A')}")
print(f"  Flow file: {flow_file}")
print(f"\nTotal processing time: {output_dict.get('Duration [seconds]', {}).get('total', 0):.2f} seconds")
print(f"Files processed: {output_dict.get('file_count', 0)}")
Starting flow execution with ProcessPoolExecutor...
✓ Schema is defined in cad_tasks.py, available to all worker processes
|INFO| FLOW | ######### Flow 'minimal_manufacturing_flow' start #######
|WARNING| FLOW | Cleaning up existing flow directory: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow
|WARNING| FLOW | Removing all previous outputs for flow 'minimal_manufacturing_flow' to avoid build conflicts.
|INFO| FLOW | Flow directory successfully cleaned and recreated: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow
|INFO| FLOW |
Flow Execution Summary
|INFO| FLOW | ==================================================
|INFO| FLOW | Task 1: gather cad files
|INFO| FLOW |     Inputs : cad_datasources
|INFO| FLOW |     Outputs: cad_dataset
|INFO| FLOW | Task 2: Manufacturing data encoding
|INFO| FLOW |     Inputs : cad_dataset
|INFO| FLOW |     Outputs: cad_files_encoded
|INFO| FLOW | Task 3: AutoDatasetExportTask
|INFO| FLOW |     Inputs : cad_files_encoded
|INFO| FLOW |     Outputs: encoded_dataset, encoded_dataset_info, encoded_dataset_attribs
|INFO| FLOW |
Task Dependencies:
|INFO| FLOW | gather cad files has no dependencies.
|INFO| FLOW | gather cad files --> Manufacturing data encoding
|INFO| FLOW | Manufacturing data encoding --> AutoDatasetExportTask
|INFO| FLOW | ==================================================

|INFO| FLOW | Executing ParallelTask 'gather cad files' with 1 items.
|INFO| FLOW | Executing ParallelTask 'Manufacturing data encoding' with 101 items.
|INFO| FLOW | Executing SequentialTask 'AutoDatasetExportTask'.
[DatasetMerger] Saved schema with 5 groups to metadata.json
|INFO| FLOW | Auto dataset export completed in 30.61 seconds
Sequential Task end=====================
|INFO| FLOW | Time taken: 266.06 seconds
|INFO| FLOW | ######### Flow 'minimal_manufacturing_flow' end ######

======================================================================
FLOW EXECUTION COMPLETED SUCCESSFULLY
======================================================================

Dataset files created:
  Main dataset: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow\minimal_manufacturing_flow.dataset
  Info dataset: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow\minimal_manufacturing_flow.infoset
  Attributes: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow\minimal_manufacturing_flow.attribset
  Flow file: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out/flows/minimal_manufacturing_flow/minimal_manufacturing_flow.flow

Total processing time: 266.06 seconds
Files processed: 101
[ ]:

DATA SERVING : Use the DatasetExplorer to navigate your data

[9]:
# Explore the generated dataset
explorer = DatasetExplorer(flow_output_file=str(flow_file))
explorer.print_table_of_contents()
[DatasetExplorer] Default local cluster started: <Client: 'tcp://127.0.0.1:61555' processes=1 threads=16, memory=7.45 GiB>

--- Dataset Table of Contents ---

EDGES_GROUP:
  EDGE_CONVEXITIES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_DIHEDRAL_ANGLES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_INDICES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_LENGTHS_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_TYPES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGE_U_GRIDS_DATA: Shape: (7269, 10, 6), Dims: ('edge', 'u', 'component'), Size: 436140
  FILE_ID_CODE_EDGES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269

FACEFACE_GROUP:
  A3_DISTANCE_DATA: Shape: (87426, 64), Dims: ('facepair', 'bin'), Size: 5595264
  D2_DISTANCE_DATA: Shape: (87426, 64), Dims: ('facepair', 'bin'), Size: 5595264
  EXTENDED_ADJACENCY_DATA: Shape: (87426,), Dims: ('facepair',), Size: 87426
  FACE_PAIR_EDGES_PATH_DATA: Shape: (87426, 16), Dims: ('facepair', 'dim_path'), Size: 1398816
  FILE_ID_CODE_FACEFACE_DATA: Shape: (87426,), Dims: ('facepair',), Size: 87426

FACES_GROUP:
  FACE_AREAS_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FACE_DISCRETIZATION_DATA: Shape: (2796, 100, 7), Dims: ('face', 'sample', 'component'), Size: 1957200
  FACE_INDICES_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FACE_LOOPS_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FACE_NEIGHBORSCOUNT_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FACE_TYPES_DATA: Shape: (2796,), Dims: ('face',), Size: 2796
  FILE_ID_CODE_FACES_DATA: Shape: (2796,), Dims: ('face',), Size: 2796

GRAPH_GROUP:
  EDGES_DESTINATION_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  EDGES_SOURCE_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  FILE_ID_CODE_GRAPH_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269
  NUM_NODES_DATA: Shape: (7269,), Dims: ('edge',), Size: 7269

MACHINING_GROUP:
  ESTIMATED_MACHINING_TIME_DATA: Shape: (101,), Dims: ('part',), Size: 101
  FILE_ID_CODE_MACHINING_DATA: Shape: (101,), Dims: ('part',), Size: 101
  MACHINING_CATEGORY_DATA: Shape: (101,), Dims: ('part',), Size: 101
  MATERIAL_TYPE_DATA: Shape: (101,), Dims: ('part',), Size: 101
==================================
Columns in file_info:
                                 name   id                             description                   flow_name                        stream_cache_png                         stream_cache_3d subset table_name
0    0147f61879421f5af3260319e25b8f81    0  ...files\cadsynth100\step\00000078.stp  minimal_manufacturing_flow  ...147f61879421f5af3260319e25b8f81.png  ...147f61879421f5af3260319e25b8f81.scs    N/A  file_info
1    080b90e95ee1e0cd77d0baada1de196e    1  ...files\cadsynth100\step\00000039.stp  minimal_manufacturing_flow  ...80b90e95ee1e0cd77d0baada1de196e.png  ...80b90e95ee1e0cd77d0baada1de196e.scs    N/A  file_info
2    110f47e3f2969a885212887ad048b6dd    2  ...files\cadsynth100\step\00000014.stp  minimal_manufacturing_flow  ...10f47e3f2969a885212887ad048b6dd.png  ...10f47e3f2969a885212887ad048b6dd.scs    N/A  file_info
3    17d3f90a8b2a8b899a76f7bef8024935    3  ...files\cadsynth100\step\00000020.stp  minimal_manufacturing_flow  ...7d3f90a8b2a8b899a76f7bef8024935.png  ...7d3f90a8b2a8b899a76f7bef8024935.scs    N/A  file_info
4    1a036e50a7cafe7e2f6e82b3cb1daac5    4  ...files\cadsynth100\step\00000001.stp  minimal_manufacturing_flow  ...a036e50a7cafe7e2f6e82b3cb1daac5.png  ...a036e50a7cafe7e2f6e82b3cb1daac5.scs    N/A  file_info
5    1c089b81681e228074e5d8c735d88d3e    5  ...files\cadsynth100\step\00000044.stp  minimal_manufacturing_flow  ...c089b81681e228074e5d8c735d88d3e.png  ...c089b81681e228074e5d8c735d88d3e.scs    N/A  file_info
6    20b8dedec9901ca9cebf166edbaaccde    6  ...files\cadsynth100\step\00000016.stp  minimal_manufacturing_flow  ...0b8dedec9901ca9cebf166edbaaccde.png  ...0b8dedec9901ca9cebf166edbaaccde.scs    N/A  file_info
7    21224b9e7e65e077bddc133f29aab0e2    7  ...files\cadsynth100\step\00000064.stp  minimal_manufacturing_flow  ...1224b9e7e65e077bddc133f29aab0e2.png  ...1224b9e7e65e077bddc133f29aab0e2.scs    N/A  file_info
8    23016501e5cc50aa62f1b024d72d31d0    8  ...files\cadsynth100\step\00000003.stp  minimal_manufacturing_flow  ...3016501e5cc50aa62f1b024d72d31d0.png  ...3016501e5cc50aa62f1b024d72d31d0.scs    N/A  file_info
9    2468ec4bea7197c2d039495943ed0b08    9  ...files\cadsynth100\step\00000097.stp  minimal_manufacturing_flow  ...468ec4bea7197c2d039495943ed0b08.png  ...468ec4bea7197c2d039495943ed0b08.scs    N/A  file_info
..                                ...  ...                                     ...                         ...                                     ...                                     ...    ...        ...
91   ef1c6bc333a7305b98dff535021260cf   91  ...files\cadsynth100\step\00000076.stp  minimal_manufacturing_flow  ...f1c6bc333a7305b98dff535021260cf.png  ...f1c6bc333a7305b98dff535021260cf.scs    N/A  file_info
92   efb2e1eab37f5ec9aa2c3bf8619c85d6   92  ...files\cadsynth100\step\00000082.stp  minimal_manufacturing_flow  ...fb2e1eab37f5ec9aa2c3bf8619c85d6.png  ...fb2e1eab37f5ec9aa2c3bf8619c85d6.scs    N/A  file_info
93   efba9cf77f492d52104bcb9db2079abe   93  ...files\cadsynth100\step\00000027.stp  minimal_manufacturing_flow  ...fba9cf77f492d52104bcb9db2079abe.png  ...fba9cf77f492d52104bcb9db2079abe.scs    N/A  file_info
94   f16fc0f9246b97edc4e0a73b93222ece   94  ...files\cadsynth100\step\00000065.stp  minimal_manufacturing_flow  ...16fc0f9246b97edc4e0a73b93222ece.png  ...16fc0f9246b97edc4e0a73b93222ece.scs    N/A  file_info
95   f2c9c222d79fb33d84bed38c3f12895f   95  ...files\cadsynth100\step\00000099.stp  minimal_manufacturing_flow  ...2c9c222d79fb33d84bed38c3f12895f.png  ...2c9c222d79fb33d84bed38c3f12895f.scs    N/A  file_info
96   f741e12764b2a9b143cf7be893a1063c   96  ...files\cadsynth100\step\00000007.stp  minimal_manufacturing_flow  ...741e12764b2a9b143cf7be893a1063c.png  ...741e12764b2a9b143cf7be893a1063c.scs    N/A  file_info
97   f968b95ded06cfe8d3df3f4ce9339cb0   97  ...files\cadsynth100\step\00000060.stp  minimal_manufacturing_flow  ...968b95ded06cfe8d3df3f4ce9339cb0.png  ...968b95ded06cfe8d3df3f4ce9339cb0.scs    N/A  file_info
98   fa633f124236d03edbd3ad719a22c72f   98  ...files\cadsynth100\step\00000034.stp  minimal_manufacturing_flow  ...a633f124236d03edbd3ad719a22c72f.png  ...a633f124236d03edbd3ad719a22c72f.scs    N/A  file_info
99   fb6c3cb83cd14b1ed63185aa1b721278   99  ...files\cadsynth100\step\00000062.stp  minimal_manufacturing_flow  ...b6c3cb83cd14b1ed63185aa1b721278.png  ...b6c3cb83cd14b1ed63185aa1b721278.scs    N/A  file_info
100  ff29ad68509c13595479bd8095423eeb  100  ...files\cadsynth100\step\00000050.stp  minimal_manufacturing_flow  ...f29ad68509c13595479bd8095423eeb.png  ...f29ad68509c13595479bd8095423eeb.scs    N/A  file_info

ML-Ready Dataset Preparation

The DatasetLoader provides tools for preparing the merged dataset for machine learning:

Key Capabilities:

  • Stratified Splitting: Create train/validation/test splits while preserving class distributions

  • Subset Tracking: Records file assignments in the dataset metadata

[10]:
# Load and split dataset for machine learning
from hoops_ai.dataset import DatasetLoader

flow_path = pathlib.Path(flow_file)
loader = DatasetLoader(
    merged_store_path=str(flow_path.parent / f"{flow_path.stem}.dataset"),
    parquet_file_path=str(flow_path.parent / f"{flow_path.stem}.infoset")
)

# Split dataset by machining category with explicit group parameter
train_size, val_size, test_size = loader.split(
    key="machining_category",
    group="machining",  # Explicitly specify the group for clarity
    train=0.6,
    validation=0.2,
    test=0.2,
    random_state=42
)

print(f"Dataset split: Train={train_size}, Validation={val_size}, Test={test_size}")

# Access training dataset
train_dataset = loader.get_dataset("train")
print(f"Training dataset ready with {len(train_dataset)} samples")

loader.close_resources()
[DatasetExplorer] Default local cluster started: <Client: 'tcp://127.0.0.1:61570' processes=1 threads=16, memory=7.45 GiB>
DEBUG: Successfully built file lists with 101 files out of 101 original file codes

============================================================
DATASET STRUCTURE OVERVIEW
============================================================

Group: edges
------------------------------
  edge_convexities: (7269,) (int32)
  edge_dihedral_angles: (7269,) (float32)
  edge_indices: (7269,) (int32)
  edge_lengths: (7269,) (float32)
  edge_types: (7269,) (int32)
  edge_u_grids: (7269, 10, 6) (float32)
  file_id_code_edges: (7269,) (int64)

Group: faceface
------------------------------
  a3_distance: (87426, 64) (float32)
  d2_distance: (87426, 64) (float32)
  extended_adjacency: (87426,) (float32)
  face_pair_edges_path: (87426, 16) (int32)
  file_id_code_faceface: (87426,) (int64)

Group: faces
------------------------------
  face_areas: (2796,) (float32)
  face_discretization: (2796, 100, 7) (float32)
  face_indices: (2796,) (int32)
  face_loops: (2796,) (int32)
  face_neighborscount: (2796,) (int32)
  face_types: (2796,) (int32)
  file_id_code_faces: (2796,) (int64)

Group: graph
------------------------------
  edges_destination: (7269,) (int32)
  edges_source: (7269,) (int32)
  file_id_code_graph: (7269,) (int64)
  num_nodes: (7269,) (int32)

Group: machining
------------------------------
  estimated_machining_time: (101,) (float32)
  file_id_code_machining: (101,) (int64)
  machining_category: (101,) (int32)
  material_type: (101,) (int32)

============================================================
Dataset split by machining_category: Train=60, Validation=20, Test=21
Dataset split: Train=60, Validation=20, Test=21
Training dataset ready with 60 samples
[DatasetExplorer] Shutting down this Dask client...
[DatasetExplorer] Closing the LocalCluster...
[DatasetExplorer] All resources closed.

Summary: The HOOPS AI - Data Flow Advantage

This minimal example demonstrates how HOOPS AI simplifies CAD data processing for ML:

  1. Schema-First Approach: Define your data structure before processing

  2. Decorator-Based Tasks: Easily inject custom processing logic

  3. Automatic Orchestration: Let the framework handle execution complexity

  4. Unified Dataset: Get consistently merged data ready for ML

  5. Built-in Exploration: Analyze and prepare datasets with powerful tools

The framework automatically generates visualization assets and stream caches, making it easy to integrate with downstream visualization tools.

[11]:
# Visualization libraries
import matplotlib.pyplot as plt

def print_distribution_info(dist, title="Distribution"):
    """Helper function to print and visualize distribution data."""
    list_filecount = list()
    for i, bin_files in enumerate(dist['file_id_codes_in_bins']):
        list_filecount.append(bin_files.size)

    dist['file_count'] =list_filecount
    # Visualization with matplotlib
    fig, ax = plt.subplots(figsize=(12, 4))

    bin_centers = 0.5 * (dist['bin_edges'][1:] + dist['bin_edges'][:-1])
    ax.bar(bin_centers, dist['file_count'], width=(dist['bin_edges'][1] - dist['bin_edges'][0]),
           alpha=0.7, color='steelblue', edgecolor='black', linewidth=1)

    # Add file count annotations
    for i, count in enumerate(dist['file_count']):
        if count > 0:  # Only annotate non-empty bins
            ax.text(bin_centers[i], count + 0.5, f"{count}",
                    ha='center', va='bottom', fontsize=8)

    ax.set_xlabel('Value')
    ax.set_ylabel('Count')
    ax.set_title(f'{title} Histogram')
    ax.grid(True, linestyle='--', alpha=0.7)

    plt.tight_layout()
    plt.show()
[12]:
import time
start_time = time.time()
face_dist = explorer.create_distribution(key="machining_category", bins=None, group="machining")
print(f"Machining distribution created in {(time.time() - start_time):.2f} seconds\n")
print_distribution_info(face_dist, title="Machining Category")
Machining distribution created in 2.01 seconds

../../../_images/tutorials_hoops_ai_tutorials_notebooks_3a_ETL_pipeline_using_flow_23_1.png

Dataset Visualization with DatasetViewer

The DatasetViewer is a powerful visualization tool that bridges dataset queries and visual analysis. It enables you to quickly visualize query results in two ways:

  1. Image Grids: Generate collages of PNG previews for rapid visual scanning

  2. Interactive 3D Views: Open inline 3D viewers for detailed model inspection

[13]:
# Import the DatasetViewer from the insights module
from hoops_ai.insights import DatasetViewer

# Create a DatasetViewer using the convenience method from_explorer
# This method queries the explorer and builds the file ID to visualization path mappings
dataset_viewer = DatasetViewer.from_explorer(explorer)
2026-01-29 15:30:36 | INFO | hoops_ai.insights.dataset_viewer | Initialized process pool with 4 workers
2026-01-29 15:30:36 | INFO | hoops_ai.insights.dataset_viewer | DatasetViewer initialized with reference directory: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\minimal_manufacturing_flow
2026-01-29 15:30:36 | INFO | hoops_ai.insights.dataset_viewer | Built file mapping for 101 files
[ ]:

[14]:
start_time = time.time()

# condition
material_is_frequent = lambda ds: ds['material_type'] == 2

filelist = explorer.get_file_list(group="machining", where=material_is_frequent)
print(f"Filtering completed in {(time.time() - start_time):.2f} seconds")
print(filelist)
Filtering completed in 0.14 seconds
[ 2  9 18 28 30 36 44 48 55 66 76 83 84 85 89 91 94 97 99]
[ ]:

Example 1: Visualize Query Results as Image Grid

Now let’s use the query results we obtained earlier and visualize them as a grid of images. This is perfect for quickly scanning through many files to understand patterns or identify specific cases.

[15]:
# Visualize the filtered files as a 5x5 grid with file IDs as labels
fig = dataset_viewer.show_preview_as_image(
    filelist,
    k=len(filelist),                      # Show up to 25 files
    grid_cols=8,               # 5 columns
    label_format='id',         # Show file IDs as labels
    figsize=(15, 5)           # Larger figure size
)

plt.show()
../../../_images/tutorials_hoops_ai_tutorials_notebooks_3a_ETL_pipeline_using_flow_30_0.png