Train a Shape Embedding Model

> Purpose: This document is for Data Scientists who want to Train custom HOOPS Embedding models.

> For production use (computing embeddings and similarity search), see the Production workflow.

Overview 

The EmbeddingFlowModel is a specialized FlowModel implementation designed for training shape embeddings from CAD data using contrastive learning. This model enables data scientists to train custom embedding models on their own CAD datasets.

Training → Production Workflow

Train a custom model using EmbeddingFlowModel + FlowTrainer (this document)
Register the trained model with HOOPSEmbeddings.register_model()
Deploy for production use via HOOPSEmbeddings API (see production guide)

When to Train Custom Models

Your CAD parts have unique geometric characteristics not captured by pre-trained models
You need domain-specific embeddings (e.g., specific industry, manufacturing process)
You have a large proprietary dataset to learn from
You want to optimize embedding dimensions for your use case

Note: HOOPS AI’s provided a pre-trained model (e.g., ts3d_scl_dual_v1) that can be used directly. See the production guide on how to use it directly. Trained on a large dataset with nearly 1M parts from public datasets (ABC, fabwave, etc).

Key Features 

Contrastive Learning: Learns shape representations by distinguishing between similar and dissimilar CAD geometries
Flexible Architecture: Configurable embedding dimensions, projection layers, and training parameters
Unsupervised Training: No labels required per CAD file - learns from geometric structure alone

import hoops_ai
from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel

# Set license
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

# Create embedding model with custom parameters
embedding_model = EmbeddingFlowModel(
    result_dir=str(flow_root_dir),
    emb_dim=1024,           # Embedding dimension
    lr=3e-4,                # Learning rate
    use_bn=True,            # Enable batch normalization
    temp_init=0.05,         # Initial temperature
    temp_min=0.01,          # Minimum temperature
    temp_max=0.20,          # Maximum temperature
)

Constructor Parameters 

Essential Training Parameters

emb_dim (int, default: 1024)

The dimensionality of the learned embeddings. This determines the size of the vector representation for each CAD shape. - Higher dimensions can capture more detailed features but increase computational cost - Typical values: 512, 1024, 2048

lr (float, default: 3e-4)

Learning rate for the optimizer during training. - Controls the step size for gradient descent updates - May need adjustment based on batch size and dataset characteristics

use_bn (bool, default: True)

Enable batch normalization in the model architecture. - Helps stabilize training and can improve convergence - Recommended to keep enabled for most use cases

Temperature Parameters

These parameters control the contrastive loss function’s sensitivity to similarities:

temp_init (float, default: 0.05)

Initial temperature value for the contrastive loss. - Lower values make the model more discriminative - Higher values create softer similarities

temp_min (float, default: 0.01)

Minimum allowed temperature during training.

temp_max (float, default: 0.20)

Maximum allowed temperature during training.

Other Notable Parameters

proj_dim (int, default: 512)

Dimensionality of the projection head used in contrastive learning.

result_dir (str, optional)

Directory where training results, logs, and metrics will be saved.

Data Processing Pipeline 

The EmbeddingFlowModel training requires a preprocessing pipeline that gathers CAD files and extract the cad data needed for the training. Here’s a complete example using the FlowManager decorators:

Task 1 - Extract: Uses @flowtask.extract to gather CAD files from local storage using CADFileRetriever. Supports multiple CAD formats and parallel processing.

Task 2 - Prepare data for Embeddings Training: Uses @flowtask.transform decorator which automatically initializes and provide an optimized datastorage and a parallel handling of the files

import hoops_ai
from typing import List
from pathlib import Path
from hoops_ai.flowmanager import flowtask
from hoops_ai.storage.cadfile_retriever import CADFileRetriever, LocalStorageProvider
from hoops_ai.storage.datastorage import DataStorage
from hoops_ai.cadaccess import HOOPSLoader


# TASK 1: Extract - Gather CAD files from source directories
@flowtask.extract(
    name="gather cad files",
    inputs=["cad_datasources"],
    outputs=["cad_dataset"],
    parallel_execution=True
)
def gather_cad_files(source: str) -> List[str]:
    """Gather CAD files from source directory"""
    retriever = CADFileRetriever(
        storage_provider=LocalStorageProvider(directory_path=source),
        formats=[".stp", ".step", ".iges", ".igs", ".cadpart"],
    )
    return retriever.get_file_list()

# TASK 2: Compute Embeddings - Process CAD files to extract shape embeddings
@flowtask.transform(
    name="Preparing data for Embeeding Model training",
    inputs=["cad_dataset"],
    outputs=["cad_files_encoded"],
    parallel_execution=True
)
def prepare_data_embeddings_training(cad_file: str, cad_loader :  HOOPSLoader, storage : DataStorage) -> str:
    """Logic to prepare data for exploring and machine learning training - Part Classification problem
    """

    facecount, edgecount = embedding_model.encode_cad_data(cad_file, cad_loader, storage)

    dgl_storage = DGLGraphStoreHandler()

    # DGL graph Bin file
    item_no_suffix = pathlib.Path(cad_file).with_suffix("")  # Remove the suffix to get the base name
    hash_id = generate_unique_id_from_path(str(item_no_suffix))
    dgl_output_path = pathlib.Path(flows_outputdir).joinpath("flows", flow_name, "dgl", f"{hash_id}.ml")
    dgl_output_path.parent.mkdir(parents=True, exist_ok=True)

    EmbeddingModel.convert_encoded_data_to_graph(storage, dgl_storage, str(dgl_output_path))

    # Compress the storage into a .data file
    storage.compress_store()

    # Return the base storage path
    return storage.get_file_path("")

Flow Execution: Creates a flow combining both tasks with parallel execution support. The pipeline outputs encoded CAD data ready for training.

import hoops_ai

# Create flow with both tasks
flow = hoops_ai.create_flow(
    name="ETL for embeddings training",
    tasks=[gather_cad_files, prepare_data_embeddings_training],
    max_workers=24,
    flows_outputdir="./etl_embedding_flow",
)

    # Execute flow
    results = flow.process(inputs={"cad_datasources": [cad_source]})

The output of the flow are the .dataset and .ml files both needed for the Data Exploration and Training

Usage with FlowTrainer 

The EmbeddingFlowModel is designed to work seamlessly with the FlowTrainer class for training shape embeddings.

Basic Training Example

import pathlib
import hoops_ai
from hoops_ai.dataset import DatasetLoader
from hoops_ai.ml.EXPERIMENTAL import FlowTrainer


# Set license
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

# Load pre-processed dataset
flow_name = "etl_embedding_flow"
flow_root_dir = pathlib.Path("flows") / flow_name

dataset_path = str(flow_root_dir / f"{flow_name}.dataset")
info_path = str(flow_root_dir / f"{flow_name}.infoset")

# Load and split dataset
cadflowdataset = DatasetLoader(
    merged_store_path=dataset_path,
    parquet_file_path=info_path
)
cadflowdataset.split(
    key='face_types',
    group="faces",
    train=0.8,
    validation=0.1,
    test=0.1
)

# Create and configure trainer
flow_trainer = FlowTrainer(
    flowmodel=embedding_model,
    datasetLoader=cadflowdataset,
    experiment_name="shape_embeddings_training",
    result_dir=str(flow_root_dir),
    accelerator='cuda',     # Use 'cpu' if no GPU available
    max_epochs=50,
    batch_size=64
)

# Train the model
trained_model_path = flow_trainer.train()
print(f"Training complete. Model saved at: {trained_model_path}")

Usage with FlowInference 

The FlowInference class enables you to use trained embedding models for inference on new CAD files.

Basic Inference Example

import hoops_ai
from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.ml.EXPERIMENTAL.flow_inference import FlowInference
from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel

# Create inference handler
inference = FlowInference(
    cad_loader=HOOPSLoader(),
    flowmodel=embedding_model
)

# Load trained model checkpoint
checkpoint_path = "flows/etl_embedding_flow/ml_output/.../best.ckpt"
inference.load_from_checkpoint(checkpoint_path)

# Process a CAD file
cad_file_path = "path/to/your/model.step"
batch = inference.preprocess(cad_file_path)

# Get embeddings
embeddings = inference.predict_and_postprocess(batch)
print(f"Shape embedding: {embeddings.shape}")

Registering Your Trained Model for Production 

Once training is complete, register your custom model with HOOPSEmbeddings to use it in production:

from hoops_ai.ml.embeddings import HOOPSEmbeddings

# Register your trained model
HOOPSEmbeddings.register_model(
    model_name="my_custom_embeddings_v1",
    checkpoint_path="flows/my_embedding_flow/ml_output/0107/143022/best.ckpt"
)

# Now use it in production (see embeddings and retrieval guide)
embedder = HOOPSEmbeddings(model="my_custom_embeddings_v1", device="cpu")

# Compute embeddings for new CAD files
embedding = embedder.embed_shape("path/to/new_part.step")

# Or batch process
batch_embeddings = embedder.embed_shape_batch(
    cad_path_list=["part1.step", "part2.step", "part3.step"],
    max_workers=4
)

Next Steps: See the Production workflow for: - Using your registered model for similarity search - Indexing embeddings in vector databases - Querying for similar parts in production