Train a Shape Embedding Model
> Purpose: This document is for Data Scientists who want to Train custom HOOPS Embedding models.
> For production use (computing embeddings and similarity search), see the Production workflow.
Overview
The EmbeddingFlowModel is a specialized FlowModel implementation designed for training shape embeddings from CAD data using contrastive learning. This model enables data scientists to train custom embedding models on their own CAD datasets.
Training → Production Workflow
Train a custom model using EmbeddingFlowModel + FlowTrainer (this document)
Register the trained model with HOOPSEmbeddings.register_model()
Deploy for production use via HOOPSEmbeddings API (see production guide)
When to Train Custom Models
Your CAD parts have unique geometric characteristics not captured by pre-trained models
You need domain-specific embeddings (e.g., specific industry, manufacturing process)
You have a large proprietary dataset to learn from
You want to optimize embedding dimensions for your use case
Note: HOOPS AI’s provided a pre-trained model (e.g., ts3d_scl_dual_v1) that can be used directly. See the production guide on how to use it directly. Trained on a large dataset with nearly 1M parts from public datasets (ABC, fabwave, etc).
Key Features
Contrastive Learning: Learns shape representations by distinguishing between similar and dissimilar CAD geometries
Flexible Architecture: Configurable embedding dimensions, projection layers, and training parameters
Unsupervised Training: No labels required per CAD file - learns from geometric structure alone
import hoops_ai
from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel
# Set license
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)
# Create embedding model with custom parameters
embedding_model = EmbeddingFlowModel(
result_dir=str(flow_root_dir),
emb_dim=1024, # Embedding dimension
lr=3e-4, # Learning rate
use_bn=True, # Enable batch normalization
temp_init=0.05, # Initial temperature
temp_min=0.01, # Minimum temperature
temp_max=0.20, # Maximum temperature
)
Constructor Parameters
Essential Training Parameters
emb_dim (int, default: 1024)
The dimensionality of the learned embeddings. This determines the size of the vector representation for each CAD shape. - Higher dimensions can capture more detailed features but increase computational cost - Typical values: 512, 1024, 2048
lr (float, default: 3e-4)
Learning rate for the optimizer during training. - Controls the step size for gradient descent updates - May need adjustment based on batch size and dataset characteristics
use_bn (bool, default: True)
Enable batch normalization in the model architecture. - Helps stabilize training and can improve convergence - Recommended to keep enabled for most use cases
Temperature Parameters
These parameters control the contrastive loss function’s sensitivity to similarities:
temp_init (float, default: 0.05)
Initial temperature value for the contrastive loss. - Lower values make the model more discriminative - Higher values create softer similarities
temp_min (float, default: 0.01)
Minimum allowed temperature during training.
temp_max (float, default: 0.20)
Maximum allowed temperature during training.
Other Notable Parameters
proj_dim (int, default: 512)
Dimensionality of the projection head used in contrastive learning.
result_dir (str, optional)
Directory where training results, logs, and metrics will be saved.
Data Processing Pipeline
The EmbeddingFlowModel training requires a preprocessing pipeline that gathers CAD files and extract the cad data needed for the training. Here’s a complete example using the FlowManager decorators:
Task 1 - Extract: Uses @flowtask.extract to gather CAD files from local storage using CADFileRetriever. Supports multiple CAD formats and parallel processing.
Task 2 - Prepare data for Embeddings Training: Uses @flowtask.transform decorator which automatically initializes and provide an optimized datastorage and a parallel handling of the files
import hoops_ai
from typing import List
from pathlib import Path
from hoops_ai.flowmanager import flowtask
from hoops_ai.storage.cadfile_retriever import CADFileRetriever, LocalStorageProvider
from hoops_ai.storage.datastorage import DataStorage
from hoops_ai.cadaccess import HOOPSLoader
# TASK 1: Extract - Gather CAD files from source directories
@flowtask.extract(
name="gather cad files",
inputs=["cad_datasources"],
outputs=["cad_dataset"],
parallel_execution=True
)
def gather_cad_files(source: str) -> List[str]:
"""Gather CAD files from source directory"""
retriever = CADFileRetriever(
storage_provider=LocalStorageProvider(directory_path=source),
formats=[".stp", ".step", ".iges", ".igs", ".cadpart"],
)
return retriever.get_file_list()
# TASK 2: Compute Embeddings - Process CAD files to extract shape embeddings
@flowtask.transform(
name="Preparing data for Embeeding Model training",
inputs=["cad_dataset"],
outputs=["cad_files_encoded"],
parallel_execution=True
)
def prepare_data_embeddings_training(cad_file: str, cad_loader : HOOPSLoader, storage : DataStorage) -> str:
"""Logic to prepare data for exploring and machine learning training - Part Classification problem
"""
facecount, edgecount = embedding_model.encode_cad_data(cad_file, cad_loader, storage)
dgl_storage = DGLGraphStoreHandler()
# DGL graph Bin file
item_no_suffix = pathlib.Path(cad_file).with_suffix("") # Remove the suffix to get the base name
hash_id = generate_unique_id_from_path(str(item_no_suffix))
dgl_output_path = pathlib.Path(flows_outputdir).joinpath("flows", flow_name, "dgl", f"{hash_id}.ml")
dgl_output_path.parent.mkdir(parents=True, exist_ok=True)
EmbeddingModel.convert_encoded_data_to_graph(storage, dgl_storage, str(dgl_output_path))
# Compress the storage into a .data file
storage.compress_store()
# Return the base storage path
return storage.get_file_path("")
Flow Execution: Creates a flow combining both tasks with parallel execution support. The pipeline outputs encoded CAD data ready for training.
import hoops_ai
# Create flow with both tasks
flow = hoops_ai.create_flow(
name="ETL for embeddings training",
tasks=[gather_cad_files, prepare_data_embeddings_training],
max_workers=24,
flows_outputdir="./etl_embedding_flow",
)
# Execute flow
results = flow.process(inputs={"cad_datasources": [cad_source]})
The output of the flow are the .dataset and .ml files both needed for the Data Exploration and Training
Usage with FlowTrainer
The EmbeddingFlowModel is designed to work seamlessly with the FlowTrainer class for training shape embeddings.
Basic Training Example
import pathlib
import hoops_ai
from hoops_ai.dataset import DatasetLoader
from hoops_ai.ml.EXPERIMENTAL import FlowTrainer
# Set license
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)
# Load pre-processed dataset
flow_name = "etl_embedding_flow"
flow_root_dir = pathlib.Path("flows") / flow_name
dataset_path = str(flow_root_dir / f"{flow_name}.dataset")
info_path = str(flow_root_dir / f"{flow_name}.infoset")
# Load and split dataset
cadflowdataset = DatasetLoader(
merged_store_path=dataset_path,
parquet_file_path=info_path
)
cadflowdataset.split(
key='face_types',
group="faces",
train=0.8,
validation=0.1,
test=0.1
)
# Create and configure trainer
flow_trainer = FlowTrainer(
flowmodel=embedding_model,
datasetLoader=cadflowdataset,
experiment_name="shape_embeddings_training",
result_dir=str(flow_root_dir),
accelerator='cuda', # Use 'cpu' if no GPU available
max_epochs=50,
batch_size=64
)
# Train the model
trained_model_path = flow_trainer.train()
print(f"Training complete. Model saved at: {trained_model_path}")
Usage with FlowInference
The FlowInference class enables you to use trained embedding models for inference on new CAD files.
Basic Inference Example
import hoops_ai
from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.ml.EXPERIMENTAL.flow_inference import FlowInference
from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel
# Create inference handler
inference = FlowInference(
cad_loader=HOOPSLoader(),
flowmodel=embedding_model
)
# Load trained model checkpoint
checkpoint_path = "flows/etl_embedding_flow/ml_output/.../best.ckpt"
inference.load_from_checkpoint(checkpoint_path)
# Process a CAD file
cad_file_path = "path/to/your/model.step"
batch = inference.preprocess(cad_file_path)
# Get embeddings
embeddings = inference.predict_and_postprocess(batch)
print(f"Shape embedding: {embeddings.shape}")
Registering Your Trained Model for Production
Once training is complete, register your custom model with HOOPSEmbeddings to use it in production:
from hoops_ai.ml.embeddings import HOOPSEmbeddings
# Register your trained model
HOOPSEmbeddings.register_model(
model_name="my_custom_embeddings_v1",
checkpoint_path="flows/my_embedding_flow/ml_output/0107/143022/best.ckpt"
)
# Now use it in production (see embeddings and retrieval guide)
embedder = HOOPSEmbeddings(model="my_custom_embeddings_v1", device="cpu")
# Compute embeddings for new CAD files
embedding = embedder.embed_shape("path/to/new_part.step")
# Or batch process
batch_embeddings = embedder.embed_shape_batch(
cad_path_list=["part1.step", "part2.step", "part3.step"],
max_workers=4
)
Next Steps: See the Production workflow for: - Using your registered model for similarity search - Indexing embeddings in vector databases - Querying for similar parts in production