hoops_ai.dataset.dataset_loader

Classes

CADDataset(parent_dataset, indices)

A framework-agnostic dataset object that contains training, validation or testing data.

DatasetLoader([merged_store_path, ...])

A framework-agnostic dataset class that:

class hoops_ai.dataset.dataset_loader.CADDataset(parent_dataset, indices)

Bases: object

A framework-agnostic dataset object that contains training, validation or testing data. Can be converted to framework-specific formats as needed.

property data_files

Return the list of .bin file paths for this subset only.

get_item(i)

Framework-agnostic item access

get_raw_data(i)

Get raw file paths for an item without loading

property label_datas

Return the list of label arrays for this subset only.

remove_indices(local_indices_to_remove)

Remove items by local subset index. This also removes them from the parent dataset, so the parent’s data_files/label_datas arrays and indexing will be updated. Then we adjust self.indices accordingly.

to_torch(collate_fn=None)

Convert to PyTorch Dataset using lazy import

Parameters:

collate_fn – Optional collate function for batching

class hoops_ai.dataset.dataset_loader.DatasetLoader(merged_store_path=None, parquet_file_path=None, item_loader_func=None)

Bases: object

A framework-agnostic dataset class that:
  • Creates internally a DatasetExplorer from a .dataset and infoset files based on any group/key available

  • Builds membership matrix for multi-label stratification

  • Splits data into train/validation/test subsets

  • Provides get_dataset(…) to get a CADDataset object

  • Offers remove_indices(…) returning a map of old -> new indices

Parameters:
  • merged_store_path (str | None)

  • parquet_file_path (str | None)

  • item_loader_func (Callable | None)

available_arrays(group_name)

Get all available arrays for a specific group.

Parameters:

group_name (str)

Return type:

set

available_groups()

Get all available groups in the dataset.

Return type:

set

close_resources(clear_split_history=True)

Close and cleanup resources, particularly the DatasetExplorer instance.

Parameters:

clear_split_history (bool) – If True, also clears the split history

diagnose_file_codes_mismatch(file_codes=None)

Diagnostic method to help understand file code and ID mismatches. Call this method when experiencing issues with file code mapping.

find_group_for_array(array_name)

Find which group contains a specific array. Returns None if the array is not found in any group.

Parameters:

array_name (str)

Return type:

str | None

get_available_stratification_keys()

Get all available keys that can be used for stratification, grouped by their containing group.

Return type:

dict

get_dataset(subset, key=None)

Return a framework-agnostic CADDataset for ‘train’, ‘validation’, or ‘test’.

Parameters:
  • subset (str) – One of ‘train’, ‘validation’, or ‘test’

  • key (str | None) – Optional key to specify which split to use if multiple splits exist. If None, uses the current active split.

Returns:

A framework-agnostic dataset containing the requested subset

Return type:

CADDataset

remove_indices(indices_to_remove)

Removes the given global indices from data_files/label_datas. Returns a dict mapping old_index -> new_index for items that remain.

reset_split_state()

Reset the split state to allow for a new split with different parameters. This preserves the previous split results in _split_history.

split(key='label', group='machining', categories=None, train=0.8, validation=0.1, test=0.1, random_state=42, force_reset=False)

Perform stratified split of dataset based on a key.

Parameters:
  • key (str) – The key to stratify by (e.g. ‘label’, ‘machining_category’)

  • group (str) – The group where the key is stored

  • categories (List[Any] | None) – Optional list of categories. If None, they will be inferred.

  • train (float) – The fraction of data to use for training.

  • validation (float) – The fraction of data to use for validation.

  • test (float) – The fraction of data to use for testing.

  • random_state (int) – Random seed for reproducibility.

  • force_reset (bool) – If True, force reset the split state even if the key is the same.

Returns:

(train_count, val_count, test_count)

Return type:

Tuple[int, int, int]

validate_configuration()

Validate the current configuration and return a summary of available data. Useful for debugging and ensuring the dataset is properly configured.

Return type:

dict