hoops_ai.storage.DatasetMerger

class hoops_ai.storage.DatasetMerger(zip_files=None, merged_store_path=None, file_id_codes=None, dask_client_params=None, delete_source_files=True)

Bases: object

Handles the merging of multiple .zip Zarr files into a single store (plus optional partial batching logic).

Parameters:
close_dask_resources(close_global=False)

Close the Dask client and cluster that this class manages. If close_global is True, also close any globally active Dask client.

Parameters:

close_global (bool)

Return type:

None

discover_groups_from_files(max_files_to_scan=5)

Dynamically discover group and array structure from the input files. Uses stored schemas if available, falls back to naming heuristics for backward compatibility.

Parameters:

max_files_to_scanint

Maximum number of files to scan for structure discovery

Returns:

Dict[str, Dict[str, List[str]]]

Discovered group structure in DATA_GROUPS format

Parameters:

max_files_to_scan (int)

Return type:

Dict[str, Dict[str, List[str]]]

get_discovered_groups()

Get the discovered groups structure.

Returns:

Optional[Dict[str, Dict[str, List[str]]]]

The discovered groups or None if not discovered yet

Return type:

Dict[str, Dict[str, List[str]]] | None

merge_data(face_chunk=500000, edge_chunk=500000, faceface_flat_chunk=100000, batch_size=None, consolidate_metadata=True, force_compression=True)

Merges all zip_files into a single Zarr store.

If force_compression is True, ensures the output uses ZipStore format with .dataset extension. If batch_size is None, merges all files in one shot. Otherwise merges them in partial batches and then merges those partial stores into a final store.

Parameters:

face_chunkint

Chunk size for face arrays

edge_chunkint

Chunk size for edge arrays

faceface_flat_chunkint

Chunk size for faceface arrays

batch_sizeOptional[int]

If provided, process files in batches of this size

consolidate_metadatabool

Whether to consolidate metadata after merging

force_compressionbool

If True, ensures output is a compressed .dataset file instead of a directory

Parameters:
  • face_chunk (int)

  • edge_chunk (int)

  • faceface_flat_chunk (int)

  • batch_size (int | None)

  • consolidate_metadata (bool)

  • force_compression (bool)

Return type:

None

print_discovered_structure()

Print the discovered group and array structure in a readable format.

Return type:

None

set_schema(schema)

Set a schema that defines the expected structure of the data.

Parameters:

schemaDict[str, Any]

Schema definition with group and array specifications

Parameters:

schema (Dict[str, Any])

Return type:

None

property DATA_GROUPS: Dict[str, Dict[str, List[str]]]

Returns the data groups structure. If groups have been discovered dynamically, returns those; otherwise returns the CAD-specific defaults.