Train a Shape Embedding Model

> Purpose: This document is for Data Scientists who want to Train custom HOOPS Embedding models.

> For production use (computing embeddings and similarity search), see the Production workflow.

Overview

The EmbeddingFlowModel is a specialized FlowModel implementation designed for training shape embeddings from CAD data using contrastive learning. This model enables data scientists to train custom embedding models on their own CAD datasets.

Training → Production Workflow

  1. Train a custom model using EmbeddingFlowModel + FlowTrainer (this document)

  2. Register the trained model with HOOPSEmbeddings.register_model()

  3. Deploy for production use via HOOPSEmbeddings API (see production guide)

When to Train Custom Models

  • Your CAD parts have unique geometric characteristics not captured by pre-trained models

  • You need domain-specific embeddings (e.g., specific industry, manufacturing process)

  • You have a large proprietary dataset to learn from

  • You want to optimize embedding dimensions for your use case

Note: HOOPS AI’s provided a pre-trained model (e.g., ts3d_scl_dual_v1) that can be used directly. See the production guide on how to use it directly. Trained on a large dataset with nearly 1M parts from public datasets (ABC, fabwave, etc).

Key Features

  • Contrastive Learning: Learns shape representations by distinguishing between similar and dissimilar CAD geometries

  • Flexible Architecture: Configurable embedding dimensions, projection layers, and training parameters

  • Unsupervised Training: No labels required per CAD file - learns from geometric structure alone

import hoops_ai
from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel

# Set license
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

# Create embedding model with custom parameters
embedding_model = EmbeddingFlowModel(
    result_dir=str(flow_root_dir),
    emb_dim=1024,           # Embedding dimension
    lr=3e-4,                # Learning rate
    use_bn=True,            # Enable batch normalization
    temp_init=0.05,         # Initial temperature
    temp_min=0.01,          # Minimum temperature
    temp_max=0.20,          # Maximum temperature
)

Constructor Parameters

Essential Training Parameters

emb_dim (int, default: 1024)

The dimensionality of the learned embeddings. This determines the size of the vector representation for each CAD shape. - Higher dimensions can capture more detailed features but increase computational cost - Typical values: 512, 1024, 2048

lr (float, default: 3e-4)

Learning rate for the optimizer during training. - Controls the step size for gradient descent updates - May need adjustment based on batch size and dataset characteristics

use_bn (bool, default: True)

Enable batch normalization in the model architecture. - Helps stabilize training and can improve convergence - Recommended to keep enabled for most use cases

Temperature Parameters

These parameters control the contrastive loss function’s sensitivity to similarities:

temp_init (float, default: 0.05)

Initial temperature value for the contrastive loss. - Lower values make the model more discriminative - Higher values create softer similarities

temp_min (float, default: 0.01)

Minimum allowed temperature during training.

temp_max (float, default: 0.20)

Maximum allowed temperature during training.

Other Notable Parameters

proj_dim (int, default: 512)

Dimensionality of the projection head used in contrastive learning.

result_dir (str, optional)

Directory where training results, logs, and metrics will be saved.

Data Processing Pipeline

The EmbeddingFlowModel training requires a preprocessing pipeline that gathers CAD files and extract the cad data needed for the training. Here’s a complete example using the FlowManager decorators:

Task 1 - Extract: Uses @flowtask.extract to gather CAD files from local storage using CADFileRetriever. Supports multiple CAD formats and parallel processing.

Task 2 - Prepare data for Embeddings Training: Uses @flowtask.transform decorator which automatically initializes and provide an optimized datastorage and a parallel handling of the files

import hoops_ai
from typing import List
from pathlib import Path
from hoops_ai.flowmanager import flowtask
from hoops_ai.storage.cadfile_retriever import CADFileRetriever, LocalStorageProvider
from hoops_ai.storage.datastorage import DataStorage
from hoops_ai.cadaccess import HOOPSLoader


# TASK 1: Extract - Gather CAD files from source directories
@flowtask.extract(
    name="gather cad files",
    inputs=["cad_datasources"],
    outputs=["cad_dataset"],
    parallel_execution=True
)
def gather_cad_files(source: str) -> List[str]:
    """Gather CAD files from source directory"""
    retriever = CADFileRetriever(
        storage_provider=LocalStorageProvider(directory_path=source),
        formats=[".stp", ".step", ".iges", ".igs", ".cadpart"],
    )
    return retriever.get_file_list()

# TASK 2: Compute Embeddings - Process CAD files to extract shape embeddings
@flowtask.transform(
    name="Preparing data for Embeeding Model training",
    inputs=["cad_dataset"],
    outputs=["cad_files_encoded"],
    parallel_execution=True
)
def prepare_data_embeddings_training(cad_file: str, cad_loader :  HOOPSLoader, storage : DataStorage) -> str:
    """Logic to prepare data for exploring and machine learning training - Part Classification problem
    """

    facecount, edgecount = embedding_model.encode_cad_data(cad_file, cad_loader, storage)

    dgl_storage = DGLGraphStoreHandler()

    # DGL graph Bin file
    item_no_suffix = pathlib.Path(cad_file).with_suffix("")  # Remove the suffix to get the base name
    hash_id = generate_unique_id_from_path(str(item_no_suffix))
    dgl_output_path = pathlib.Path(flows_outputdir).joinpath("flows", flow_name, "dgl", f"{hash_id}.ml")
    dgl_output_path.parent.mkdir(parents=True, exist_ok=True)

    EmbeddingModel.convert_encoded_data_to_graph(storage, dgl_storage, str(dgl_output_path))

    # Compress the storage into a .data file
    storage.compress_store()

    # Return the base storage path
    return storage.get_file_path("")

Flow Execution: Creates a flow combining both tasks with parallel execution support. The pipeline outputs encoded CAD data ready for training.

import hoops_ai

# Create flow with both tasks
flow = hoops_ai.create_flow(
    name="ETL for embeddings training",
    tasks=[gather_cad_files, prepare_data_embeddings_training],
    max_workers=24,
    flows_outputdir="./etl_embedding_flow",
)

    # Execute flow
    results = flow.process(inputs={"cad_datasources": [cad_source]})

The output of the flow are the .dataset and .ml files both needed for the Data Exploration and Training

Usage with FlowTrainer

The EmbeddingFlowModel is designed to work seamlessly with the FlowTrainer class for training shape embeddings.

Basic Training Example

import pathlib
import hoops_ai
from hoops_ai.dataset import DatasetLoader
from hoops_ai.ml.EXPERIMENTAL import FlowTrainer


# Set license
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

# Load pre-processed dataset
flow_name = "etl_embedding_flow"
flow_root_dir = pathlib.Path("flows") / flow_name

dataset_path = str(flow_root_dir / f"{flow_name}.dataset")
info_path = str(flow_root_dir / f"{flow_name}.infoset")

# Load and split dataset
cadflowdataset = DatasetLoader(
    merged_store_path=dataset_path,
    parquet_file_path=info_path
)
cadflowdataset.split(
    key='face_types',
    group="faces",
    train=0.8,
    validation=0.1,
    test=0.1
)

# Create and configure trainer
flow_trainer = FlowTrainer(
    flowmodel=embedding_model,
    datasetLoader=cadflowdataset,
    experiment_name="shape_embeddings_training",
    result_dir=str(flow_root_dir),
    accelerator='cuda',     # Use 'cpu' if no GPU available
    max_epochs=50,
    batch_size=64
)

# Train the model
trained_model_path = flow_trainer.train()
print(f"Training complete. Model saved at: {trained_model_path}")

Usage with FlowInference

The FlowInference class enables you to use trained embedding models for inference on new CAD files.

Basic Inference Example

import hoops_ai
from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.ml.EXPERIMENTAL.flow_inference import FlowInference
from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel

# Create inference handler
inference = FlowInference(
    cad_loader=HOOPSLoader(),
    flowmodel=embedding_model
)

# Load trained model checkpoint
checkpoint_path = "flows/etl_embedding_flow/ml_output/.../best.ckpt"
inference.load_from_checkpoint(checkpoint_path)

# Process a CAD file
cad_file_path = "path/to/your/model.step"
batch = inference.preprocess(cad_file_path)

# Get embeddings
embeddings = inference.predict_and_postprocess(batch)
print(f"Shape embedding: {embeddings.shape}")

Registering Your Trained Model for Production

Once training is complete, register your custom model with HOOPSEmbeddings to use it in production:

from hoops_ai.ml.embeddings import HOOPSEmbeddings

# Register your trained model
HOOPSEmbeddings.register_model(
    model_name="my_custom_embeddings_v1",
    checkpoint_path="flows/my_embedding_flow/ml_output/0107/143022/best.ckpt"
)

# Now use it in production (see embeddings and retrieval guide)
embedder = HOOPSEmbeddings(model="my_custom_embeddings_v1", device="cpu")

# Compute embeddings for new CAD files
embedding = embedder.embed_shape("path/to/new_part.step")

# Or batch process
batch_embeddings = embedder.embed_shape_batch(
    cad_path_list=["part1.step", "part2.step", "part3.step"],
    max_workers=4
)

Next Steps: See the Production workflow for: - Using your registered model for similarity search - Indexing embeddings in vector databases - Querying for similar parts in production