#############################
Train a Shape Embedding Model 
#############################

.. sidebar:: Table of Contents

   .. contents::
      :local:
      :depth: 1


> **Purpose**: This document is for **Data Scientists** who want to **Train custom HOOPS Embedding models**. 

> **For production use** (computing embeddings and similarity search), see the :doc:`Production workflow <embeddings-production>`.

Overview
========

The `EmbeddingFlowModel` is a specialized FlowModel implementation designed for **training** shape embeddings from CAD data using contrastive learning. This model enables data scientists to train custom embedding models on their own CAD datasets.

Training → Production Workflow
-------------------------------

1. **Train** a custom model using `EmbeddingFlowModel` + `FlowTrainer` (this document)
2. **Register** the trained model with `HOOPSEmbeddings.register_model()` 
3. **Deploy** for production use via `HOOPSEmbeddings` API (:doc:`see production guide <embeddings-production>`)

When to Train Custom Models
---------------------------

- Your CAD parts have unique geometric characteristics not captured by pre-trained models
- You need domain-specific embeddings (e.g., specific industry, manufacturing process)
- You have a large proprietary dataset to learn from
- You want to optimize embedding dimensions for your use case

**Note**: HOOPS AI's provided a pre-trained model (e.g., `ts3d_scl_dual_v1`) that can be used directly. See the :doc:`production guide <embeddings-production>` on how to use it directly. Trained on a large dataset with nearly 1M parts from **public datasets (ABC, fabwave, etc)**. 

Key Features
============

- **Contrastive Learning**: Learns shape representations by distinguishing between similar and dissimilar CAD geometries
- **Flexible Architecture**: Configurable embedding dimensions, projection layers, and training parameters
- **Unsupervised Training**: No labels required per CAD file - learns from geometric structure alone 

.. code-block:: python

    import hoops_ai
    from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel

    # Set license
    hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

    # Create embedding model with custom parameters
    embedding_model = EmbeddingFlowModel(
        result_dir=str(flow_root_dir),
        emb_dim=1024,           # Embedding dimension
        lr=3e-4,                # Learning rate
        use_bn=True,            # Enable batch normalization
        temp_init=0.05,         # Initial temperature
        temp_min=0.01,          # Minimum temperature
        temp_max=0.20,          # Maximum temperature
    )

Constructor Parameters
=======================

Essential Training Parameters
-----------------------------

`emb_dim` (int, default: 1024)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


The dimensionality of the learned embeddings. This determines the size of the vector representation for each CAD shape.
- Higher dimensions can capture more detailed features but increase computational cost
- Typical values: 512, 1024, 2048

`lr` (float, default: 3e-4)
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Learning rate for the optimizer during training.
- Controls the step size for gradient descent updates
- May need adjustment based on batch size and dataset characteristics

`use_bn` (bool, default: True)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Enable batch normalization in the model architecture.
- Helps stabilize training and can improve convergence
- Recommended to keep enabled for most use cases

Temperature Parameters
----------------------

These parameters control the contrastive loss function's sensitivity to similarities:

`temp_init` (float, default: 0.05)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Initial temperature value for the contrastive loss.
- Lower values make the model more discriminative
- Higher values create softer similarities

`temp_min` (float, default: 0.01)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Minimum allowed temperature during training.

`temp_max` (float, default: 0.20)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Maximum allowed temperature during training.


Other Notable Parameters
------------------------

`proj_dim` (int, default: 512)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Dimensionality of the projection head used in contrastive learning.


`result_dir` (str, optional)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Directory where training results, logs, and metrics will be saved.

Data Processing Pipeline
========================

The `EmbeddingFlowModel` training requires a preprocessing pipeline that gathers CAD files and extract the cad data needed for the training. Here's a complete example using the FlowManager decorators:

**Task 1 - Extract**: Uses `@flowtask.extract` to gather CAD files from local storage using `CADFileRetriever`. Supports multiple CAD formats and parallel processing.

**Task 2 - Prepare data for Embeddings Training**: Uses `@flowtask.transform` decorator which automatically initializes and provide an optimized datastorage and a parallel handling of the files

.. code-block:: python

    import hoops_ai
    from typing import List
    from pathlib import Path
    from hoops_ai.flowmanager import flowtask
    from hoops_ai.storage.cadfile_retriever import CADFileRetriever, LocalStorageProvider
    from hoops_ai.storage.datastorage import DataStorage
    from hoops_ai.cadaccess import HOOPSLoader


    # TASK 1: Extract - Gather CAD files from source directories
    @flowtask.extract(
        name="gather cad files",
        inputs=["cad_datasources"],
        outputs=["cad_dataset"],
        parallel_execution=True
    )
    def gather_cad_files(source: str) -> List[str]:
        """Gather CAD files from source directory"""
        retriever = CADFileRetriever(
            storage_provider=LocalStorageProvider(directory_path=source),
            formats=[".stp", ".step", ".iges", ".igs", ".cadpart"],
        )
        return retriever.get_file_list()

    # TASK 2: Compute Embeddings - Process CAD files to extract shape embeddings
    @flowtask.transform(
        name="Preparing data for Embeeding Model training",
        inputs=["cad_dataset"],
        outputs=["cad_files_encoded"],
        parallel_execution=True
    )
    def prepare_data_embeddings_training(cad_file: str, cad_loader :  HOOPSLoader, storage : DataStorage) -> str:
        """Logic to prepare data for exploring and machine learning training - Part Classification problem
        """

        facecount, edgecount = embedding_model.encode_cad_data(cad_file, cad_loader, storage)
        
        dgl_storage = DGLGraphStoreHandler()

        # DGL graph Bin file
        item_no_suffix = pathlib.Path(cad_file).with_suffix("")  # Remove the suffix to get the base name
        hash_id = generate_unique_id_from_path(str(item_no_suffix))
        dgl_output_path = pathlib.Path(flows_outputdir).joinpath("flows", flow_name, "dgl", f"{hash_id}.ml")  
        dgl_output_path.parent.mkdir(parents=True, exist_ok=True)

        EmbeddingModel.convert_encoded_data_to_graph(storage, dgl_storage, str(dgl_output_path))
        
        # Compress the storage into a .data file
        storage.compress_store()
        
        # Return the base storage path
        return storage.get_file_path("")


**Flow Execution**: Creates a flow combining both tasks with parallel execution support. The pipeline outputs encoded CAD data ready for training. 

.. code-block:: python

    import hoops_ai

    # Create flow with both tasks
    flow = hoops_ai.create_flow(
        name="ETL for embeddings training",
        tasks=[gather_cad_files, prepare_data_embeddings_training],
        max_workers=24,
        flows_outputdir="./etl_embedding_flow",
    )
        
        # Execute flow
        results = flow.process(inputs={"cad_datasources": [cad_source]})


The output of the flow are the .dataset and .ml files both needed for the Data Exploration and Training


Usage with FlowTrainer
=======================

The `EmbeddingFlowModel` is designed to work seamlessly with the `FlowTrainer` class for training shape embeddings.

Basic Training Example
-----------------------

.. code-block:: python

    import pathlib
    import hoops_ai
    from hoops_ai.dataset import DatasetLoader
    from hoops_ai.ml.EXPERIMENTAL import FlowTrainer


    # Set license
    hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

    # Load pre-processed dataset
    flow_name = "etl_embedding_flow"
    flow_root_dir = pathlib.Path("flows") / flow_name

    dataset_path = str(flow_root_dir / f"{flow_name}.dataset")
    info_path = str(flow_root_dir / f"{flow_name}.infoset")

    # Load and split dataset
    cadflowdataset = DatasetLoader(
        merged_store_path=dataset_path,
        parquet_file_path=info_path
    )
    cadflowdataset.split(
        key='face_types',
        group="faces",
        train=0.8,
        validation=0.1,
        test=0.1
    )

    # Create and configure trainer
    flow_trainer = FlowTrainer(
        flowmodel=embedding_model,
        datasetLoader=cadflowdataset,
        experiment_name="shape_embeddings_training",
        result_dir=str(flow_root_dir),
        accelerator='cuda',     # Use 'cpu' if no GPU available
        max_epochs=50,
        batch_size=64
    )

    # Train the model
    trained_model_path = flow_trainer.train()
    print(f"Training complete. Model saved at: {trained_model_path}")


Usage with FlowInference
========================

The `FlowInference` class enables you to use trained embedding models for inference on new CAD files.

Basic Inference Example
-----------------------

.. code-block:: python

    import hoops_ai
    from hoops_ai.cadaccess import HOOPSLoader
    from hoops_ai.ml.EXPERIMENTAL.flow_inference import FlowInference
    from hoops_ai.ml.EXPERIMENTAL.flow_model_embedding import EmbeddingFlowModel

    # Create inference handler
    inference = FlowInference(
        cad_loader=HOOPSLoader(),
        flowmodel=embedding_model
    )

    # Load trained model checkpoint
    checkpoint_path = "flows/etl_embedding_flow/ml_output/.../best.ckpt"
    inference.load_from_checkpoint(checkpoint_path)

    # Process a CAD file
    cad_file_path = "path/to/your/model.step"
    batch = inference.preprocess(cad_file_path)

    # Get embeddings
    embeddings = inference.predict_and_postprocess(batch)
    print(f"Shape embedding: {embeddings.shape}")


Registering Your Trained Model for Production
=============================================

Once training is complete, register your custom model with `HOOPSEmbeddings` to use it in production:

.. code-block:: python

    from hoops_ai.ml.embeddings import HOOPSEmbeddings

    # Register your trained model
    HOOPSEmbeddings.register_model(
        model_name="my_custom_embeddings_v1",
        checkpoint_path="flows/my_embedding_flow/ml_output/0107/143022/best.ckpt"
    )

    # Now use it in production (see embeddings and retrieval guide)
    embedder = HOOPSEmbeddings(model="my_custom_embeddings_v1", device="cpu")

    # Compute embeddings for new CAD files
    embedding = embedder.embed_shape("path/to/new_part.step")

    # Or batch process
    batch_embeddings = embedder.embed_shape_batch(
        cad_path_list=["part1.step", "part2.step", "part3.step"],
        max_workers=4
    )


**Next Steps**: See the :doc:`Production workflow <embeddings-production>` for:
- Using your registered model for similarity search
- Indexing embeddings in vector databases
- Querying for similar parts in production