Similarity Search Workflow

Important

Purpose: This document is for Developers and Engineers using embeddings in Production Applications.

For an overview (concepts and choosing between pre-trained vs custom models), start with Embeddings & Similarity Search.

Embeddings Retrieval Workflow

Prerequisites

import hoops_ai
from hoops_ai.ml.embeddings import HOOPSEmbeddings
from hoops_ai.ml import CADSearch
from pathlib import Path

# Set license
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

Step 1: Embed CAD Files

Single File Embedding

# Initialize the embeddings model
embedder = HOOPSEmbeddings(
    model="ts3d_scl_dual_v1",  # Pre-trained model name
    device="cpu"                # or "cuda" for GPU
)

# Embed a single CAD file
embedding = embedder.embed_shape("path/to/part.step")

print(f"Model: {embedding.model}")           # e.g., "HOOPS_AI:ts3d_scl_dual_v1"
print(f"Dimension: {embedding.dim}")         # e.g., 256
print(f"Vector shape: {embedding.values.shape}")  # (256,)

Step 3: Search for Similar Parts

Option A: Search by CAD File (Query-Time Embedding)

Search using a new CAD file - the model will embed it on-the-fly:

# Search for parts similar to a query CAD file
results = searcher.search_by_shape(
    cad_path="path/to/query_part.step",
    top_k=10,                    # Return top 10 matches
    include_metadata=True        # Include part metadata
)

# Process results
for hit in results:
    print(f"Part ID: {hit.id}")
    print(f"Similarity Score: {hit.score:.4f}")
    print(f"Metadata: {hit.metadata}")
    print("---")

How it works:

  1. Query CAD file is embedded using the shape_model (HOOPSEmbeddings)

  2. Vector store computes similarity against indexed embeddings

  3. Returns top-k most similar parts with scores

Option B: Search by Pre-Computed Embedding

If you already have an embedding, search directly without re-embedding:

# Compute embedding once
query_embedding = embedder.embed_shape("path/to/query_part.step")

# Search using the embedding
results = searcher.search_by_embedding(
    query_embedding=query_embedding,
    search_space="shape",        # Search in shape embeddings
    top_k=10
)

for hit in results:
    print(f"Match: {hit.id} (score: {hit.score:.4f})")

When to use this:

  • You want to reuse the same query embedding multiple times

  • You’re comparing embeddings from different sources

  • You want to avoid re-computing embeddings

Option C: Search with Metadata Filters

Filter results based on metadata criteria:

# Search with filters (metadata must be provided during indexing)
results = searcher.search_by_shape(
    cad_path="query_part.step",
    top_k=20,
    filters={
    "category": "bracket",
    "material": "steel"
    }
)

Complete End-to-End Example

import hoops_ai
from hoops_ai.ml.embeddings import HOOPSEmbeddings
from hoops_ai.ml import CADSearch
from pathlib import Path

# Setup
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

# Step 1: Prepare CAD files
cad_directory = Path("path/to/cad/library")
cad_files = [str(f) for f in cad_directory.glob("*.step")]

print(f"Found {len(cad_files)} CAD files to process")

# Step 2: Batch embed all files
embedder = HOOPSEmbeddings(model="ts3d_scl_dual_v1", device="cpu")

embedding_batch = embedder.embed_shape_batch(
    cad_path_list=cad_files,
    max_workers=8,
    show_progress=True
)

print(f"✓ Embedded {len(embedding_batch.ids)} files")
print(f"✗ Failed: {embedding_batch.metadata['failed_count']}")

# Step 3: Index embeddings
searcher = CADSearch(shape_model=embedder)
searcher.index_shape(embedding_batch)

print("✓ Indexed embeddings in vector store")

# Step 4: Search for similar parts
query_file = "path/to/new_part.step"

results = searcher.search_by_shape(
    cad_path=query_file,
    top_k=5
)

print(f"\nTop 5 similar parts to '{query_file}':")
for i, hit in enumerate(results, 1):
    print(f"{i}. {hit.id} - Similarity: {hit.score:.4f}")

# Clean up
searcher.close()

Persisting Indices for Reuse

When working with large datasets, re-computing embeddings and re-indexing can be time-consuming. The persistence API allows you to save computed indices to disk and reload them in future sessions.

Saving an Index

After indexing embeddings, save the vector store to disk:

# Build index from embeddings
searcher = CADSearch(shape_model=embedder)
searcher.index_shape(embedding_batch)

# Save to disk (works for both shape and text indices)
searcher.save_shape_index("parts_library.faiss")

This creates two files:

  • parts_library.faiss: FAISS index with vector data

  • parts_library.meta: Metadata (ID mappings, part metadata)

Loading a Saved Index

In a new session, skip indexing and load directly:

import hoops_ai
from hoops_ai.ml.embeddings import HOOPSEmbeddings
from hoops_ai.ml import CADSearch

# Setup (license and model)
hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)
embedder = HOOPSEmbeddings(model="ts3d_scl_dual_v1", device="cpu")

# Create searcher and load pre-built index
searcher = CADSearch(shape_model=embedder)
searcher.load_shape_index("parts_library.faiss")

# Query immediately without indexing
results = searcher.search_by_shape("new_part.step", top_k=10)

Key Points:

  • No re-indexing: Skip embedding computation entirely

  • Faster startup: Load index in seconds vs. minutes/hours

  • Portable: Share index files with team members

  • ⚠️ Model consistency: Ensure the same shape_model is used for loading and querying

Complete Workflow Example

# ========== Session 1: Build and Save ==========
from hoops_ai.ml.embeddings import HOOPSEmbeddings
from hoops_ai.ml import CADSearch
from pathlib import Path

# Embed dataset
embedder = HOOPSEmbeddings(model="ts3d_scl_dual_v1")
cad_files = [str(f) for f in Path("parts_library").glob("*.step")]

embedding_batch = embedder.embed_shape_batch(
    cad_path_list=cad_files,
    max_workers=8,
    show_progress=True
)

# Index and save
searcher = CADSearch(shape_model=embedder)
searcher.index_shape(embedding_batch)
searcher.save_shape_index("production_parts.faiss")

print(f"Saved index with {len(cad_files)} parts")

# ========== Session 2: Load and Query ==========
# (On a different machine or after restart)
from hoops_ai.ml.embeddings import HOOPSEmbeddings
from hoops_ai.ml import CADSearch

# Load model and index
embedder = HOOPSEmbeddings(model="ts3d_scl_dual_v1")
searcher = CADSearch(shape_model=embedder)
searcher.load_shape_index("production_parts.faiss")

# Query immediately
results = searcher.search_by_shape("query_part.step", top_k=5)
for hit in results:
    print(f"{hit.id}: {hit.score:.4f}")

Text Index Persistence

The same pattern works for text embeddings:

# Save text index
searcher.index_text(text_embedding_batch)
searcher.save_text_index("text_descriptions.faiss")

# Load text index
searcher.load_text_index("text_descriptions.faiss")
results = searcher.search_by_text("steel bracket", top_k=10)

Cloud Vector Stores (Future)

For cloud-based stores (Weaviate, Qdrant, Pinecone), the persistence model differs:

# Example with Qdrant (future implementation)
from hoops_ai.ml.embeddings import QdrantVectorStore

# Data is already persisted on Qdrant server
searcher = CADSearch(shape_model=embedder)

# Load connection to existing collection
searcher.load_shape_index(
    "production_parts",  # Collection name
    vector_store_cls=QdrantVectorStore  # Specify cloud store
)

# Query directly (data already on server)
results = searcher.search_by_shape("query.step", top_k=10)

Persistence Comparison:

Vector Store

save() Behavior

load() Behavior

FAISS (local)

Serialize index to disk

Deserialize from file

Weaviate

Save connection config

Reconnect to cloud index

Qdrant

Save connection config

Reconnect to cloud collection

Pinecone

Save connection config

Reconnect to cloud index

Advanced Usage

Reusing the Shape Model for Multiple Queries

The shape_model passed to CADSearch is reused for all queries, avoiding model reloads:

# Initialize once
embedder = HOOPSEmbeddings(model="ts3d_scl_dual_v1")
searcher = CADSearch(shape_model=embedder)

# Index your dataset
searcher.index_shape(embedding_batch)

# Query multiple times - model is already loaded
results1 = searcher.search_by_shape("query1.step", top_k=10)
results2 = searcher.search_by_shape("query2.step", top_k=10)
results3 = searcher.search_by_shape("query3.step", top_k=10)

Using Custom Trained Models

If you’ve trained a custom embedding model using EmbeddingFlowModel, register it for production use:

# Register custom model (trained via EmbeddingFlowModel + FlowTrainer)
HOOPSEmbeddings.register_model(
    model_name="my_custom_model_v1",
    checkpoint_path="flows/my_embedding_flow/ml_output/0107/143022/best.ckpt"
)

# Use it just like a pre-trained model
embedder = HOOPSEmbeddings(model="my_custom_model_v1", device="cpu")
embedding_batch = embedder.embed_shape_batch(cad_files, max_workers=4)

# Model ID will be prefixed with "CUSTOM:"
print(embedding_batch.model)  # "CUSTOM:my_custom_model_v1"

How to train custom models: See the Shape Embeddings Model for:

  • Training workflow with EmbeddingFlowModel + FlowTrainer

  • Configuring model architecture and hyperparameters

  • Evaluating model performance

  • Best practices for custom model training

API Reference Summary

HOOPSEmbeddings

Method

Description

embed_shape(cad_path)

Embed single CAD file → Embedding

embed_shape_batch(cad_path_list, max_workers)

Batch embed multiple files → EmbeddingBatch

register_model(name, checkpoint_path)

Register custom trained model

list_available_models()

List pre-trained models

CADSearch

Method

Description

index_shape(embedding_batch)

Index shape embeddings for search

index_text(embedding_batch)

Index text embeddings for search

search_by_shape(cad_path, top_k)

Search by CAD file (on-the-fly embedding)

search_by_text(query_text, top_k)

Search by text description

search_by_embedding(query_embedding, top_k)

Search by pre-computed embedding

save_shape_index(path)

Save shape index to disk for reuse

save_text_index(path)

Save text index to disk for reuse

load_shape_index(path)

Load pre-built shape index from disk

load_text_index(path)

Load pre-built text index from disk

close()

Clean up resources

EmbeddingBatch

Attribute

Description

values

NumPy array of shape (n_parts, dim)

model

Model identifier string

dim

Embedding dimensionality

ids

List of part IDs (file paths)

metadata

Dict with batch-level info