############################# Train a Shape Embedding Model ############################# .. sidebar:: Table of Contents .. contents:: :local: :depth: 1 > **Purpose**: This document is for **Data Scientists** who want to **Train custom HOOPS Embedding models**. > **For production use** (computing embeddings and similarity search), see the :doc:`Production workflow `. Overview ======== The `EmbeddingFlowModel` is a specialized FlowModel implementation designed for **training** shape embeddings from CAD data using contrastive learning. This model enables data scientists to train custom embedding models on their own CAD datasets. Training → Production Workflow ------------------------------- 1. **Train** a custom model using `EmbeddingFlowModel` + `FlowTrainer` (this document) 2. **Register** the trained model with `HOOPSEmbeddings.register_model()` 3. **Deploy** for production use via `HOOPSEmbeddings` API (:doc:`see production guide `) When to Train Custom Models --------------------------- - Your CAD parts have unique geometric characteristics not captured by pre-trained models - You need domain-specific embeddings (e.g., specific industry, manufacturing process) - You have a large proprietary dataset to learn from - You want to optimize embedding dimensions for your use case **Note**: HOOPS AI's provided a pre-trained model (e.g., `ts3d_scl_dual_v1`) that can be used directly. See the :doc:`production guide ` on how to use it directly. Trained on a large dataset with nearly 1M parts from **public datasets (ABC, fabwave, etc)**. Key Features ============ - **Contrastive Learning**: Learns shape representations by distinguishing between similar and dissimilar CAD geometries - **Flexible Architecture**: Configurable embedding dimensions, projection layers, and training parameters - **Unsupervised Training**: No labels required per CAD file - learns from geometric structure alone .. code-block:: python import hoops_ai from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel # Set license hoops_ai.set_license(hoops_ai.use_test_license(), validate=False) # Create embedding model with custom parameters embedding_model = EmbeddingFlowModel( result_dir=str(flow_root_dir), emb_dim=1024, # Embedding dimension lr=3e-4, # Learning rate use_bn=True, # Enable batch normalization temp_init=0.05, # Initial temperature temp_min=0.01, # Minimum temperature temp_max=0.20, # Maximum temperature ) Constructor Parameters ======================= Essential Training Parameters ----------------------------- `emb_dim` (int, default: 1024) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The dimensionality of the learned embeddings. This determines the size of the vector representation for each CAD shape. - Higher dimensions can capture more detailed features but increase computational cost - Typical values: 512, 1024, 2048 `lr` (float, default: 3e-4) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Learning rate for the optimizer during training. - Controls the step size for gradient descent updates - May need adjustment based on batch size and dataset characteristics `use_bn` (bool, default: True) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Enable batch normalization in the model architecture. - Helps stabilize training and can improve convergence - Recommended to keep enabled for most use cases Temperature Parameters ---------------------- These parameters control the contrastive loss function's sensitivity to similarities: `temp_init` (float, default: 0.05) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Initial temperature value for the contrastive loss. - Lower values make the model more discriminative - Higher values create softer similarities `temp_min` (float, default: 0.01) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Minimum allowed temperature during training. `temp_max` (float, default: 0.20) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Maximum allowed temperature during training. Other Notable Parameters ------------------------ `proj_dim` (int, default: 512) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dimensionality of the projection head used in contrastive learning. `result_dir` (str, optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Directory where training results, logs, and metrics will be saved. Data Processing Pipeline ======================== The `EmbeddingFlowModel` training requires a preprocessing pipeline that gathers CAD files and extract the cad data needed for the training. Here's a complete example using the FlowManager decorators: **Task 1 - Extract**: Uses `@flowtask.extract` to gather CAD files from local storage using `CADFileRetriever`. Supports multiple CAD formats and parallel processing. **Task 2 - Prepare data for Embeddings Training**: Uses `@flowtask.transform` decorator which automatically initializes and provide an optimized datastorage and a parallel handling of the files .. code-block:: python import hoops_ai from typing import List from pathlib import Path from hoops_ai.flowmanager import flowtask from hoops_ai.storage.cadfile_retriever import CADFileRetriever, LocalStorageProvider from hoops_ai.storage.datastorage import DataStorage from hoops_ai.cadaccess import HOOPSLoader # TASK 1: Extract - Gather CAD files from source directories @flowtask.extract( name="gather cad files", inputs=["cad_datasources"], outputs=["cad_dataset"], parallel_execution=True ) def gather_cad_files(source: str) -> List[str]: """Gather CAD files from source directory""" retriever = CADFileRetriever( storage_provider=LocalStorageProvider(directory_path=source), formats=[".stp", ".step", ".iges", ".igs", ".cadpart"], ) return retriever.get_file_list() # TASK 2: Compute Embeddings - Process CAD files to extract shape embeddings @flowtask.transform( name="Preparing data for Embeeding Model training", inputs=["cad_dataset"], outputs=["cad_files_encoded"], parallel_execution=True ) def prepare_data_embeddings_training(cad_file: str, cad_loader : HOOPSLoader, storage : DataStorage) -> str: """Logic to prepare data for exploring and machine learning training - Part Classification problem """ facecount, edgecount = embedding_model.encode_cad_data(cad_file, cad_loader, storage) dgl_storage = DGLGraphStoreHandler() # DGL graph Bin file item_no_suffix = pathlib.Path(cad_file).with_suffix("") # Remove the suffix to get the base name hash_id = generate_unique_id_from_path(str(item_no_suffix)) dgl_output_path = pathlib.Path(flows_outputdir).joinpath("flows", flow_name, "dgl", f"{hash_id}.ml") dgl_output_path.parent.mkdir(parents=True, exist_ok=True) EmbeddingModel.convert_encoded_data_to_graph(storage, dgl_storage, str(dgl_output_path)) # Compress the storage into a .data file storage.compress_store() # Return the base storage path return storage.get_file_path("") **Flow Execution**: Creates a flow combining both tasks with parallel execution support. The pipeline outputs encoded CAD data ready for training. .. code-block:: python import hoops_ai # Create flow with both tasks flow = hoops_ai.create_flow( name="ETL for embeddings training", tasks=[gather_cad_files, prepare_data_embeddings_training], max_workers=24, flows_outputdir="./etl_embedding_flow", ) # Execute flow results = flow.process(inputs={"cad_datasources": [cad_source]}) The output of the flow are the .dataset and .ml files both needed for the Data Exploration and Training Usage with FlowTrainer ======================= The `EmbeddingFlowModel` is designed to work seamlessly with the `FlowTrainer` class for training shape embeddings. Basic Training Example ----------------------- .. code-block:: python import pathlib import hoops_ai from hoops_ai.dataset import DatasetLoader from hoops_ai.ml.EXPERIMENTAL import FlowTrainer # Set license hoops_ai.set_license(hoops_ai.use_test_license(), validate=False) # Load pre-processed dataset flow_name = "etl_embedding_flow" flow_root_dir = pathlib.Path("flows") / flow_name dataset_path = str(flow_root_dir / f"{flow_name}.dataset") info_path = str(flow_root_dir / f"{flow_name}.infoset") # Load and split dataset cadflowdataset = DatasetLoader( merged_store_path=dataset_path, parquet_file_path=info_path ) cadflowdataset.split( key='face_types', group="faces", train=0.8, validation=0.1, test=0.1 ) # Create and configure trainer flow_trainer = FlowTrainer( flowmodel=embedding_model, datasetLoader=cadflowdataset, experiment_name="shape_embeddings_training", result_dir=str(flow_root_dir), accelerator='cuda', # Use 'cpu' if no GPU available max_epochs=50, batch_size=64 ) # Train the model trained_model_path = flow_trainer.train() print(f"Training complete. Model saved at: {trained_model_path}") Usage with FlowInference ======================== The `FlowInference` class enables you to use trained embedding models for inference on new CAD files. Basic Inference Example ----------------------- .. code-block:: python import hoops_ai from hoops_ai.cadaccess import HOOPSLoader from hoops_ai.ml.EXPERIMENTAL.flow_inference import FlowInference from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel # Create inference handler inference = FlowInference( cad_loader=HOOPSLoader(), flowmodel=embedding_model ) # Load trained model checkpoint checkpoint_path = "flows/etl_embedding_flow/ml_output/.../best.ckpt" inference.load_from_checkpoint(checkpoint_path) # Process a CAD file cad_file_path = "path/to/your/model.step" batch = inference.preprocess(cad_file_path) # Get embeddings embeddings = inference.predict_and_postprocess(batch) print(f"Shape embedding: {embeddings.shape}") Registering Your Trained Model for Production ============================================= Once training is complete, register your custom model with `HOOPSEmbeddings` to use it in production: .. code-block:: python from hoops_ai.ml.embeddings import HOOPSEmbeddings # Register your trained model HOOPSEmbeddings.register_model( model_name="my_custom_embeddings_v1", checkpoint_path="flows/my_embedding_flow/ml_output/0107/143022/best.ckpt" ) # Now use it in production (see embeddings and retrieval guide) embedder = HOOPSEmbeddings(model="my_custom_embeddings_v1", device="cpu") # Compute embeddings for new CAD files embedding = embedder.embed_shape("path/to/new_part.step") # Or batch process batch_embeddings = embedder.embed_shape_batch( cad_path_list=["part1.step", "part2.step", "part3.step"], max_workers=4 ) **Next Steps**: See the :doc:`Production workflow ` for: - Using your registered model for similarity search - Indexing embeddings in vector databases - Querying for similar parts in production