Working with reference data — KBase SDK 1.2.0 documentation

The Nugget

  • KBase SDK allows developers to efficiently manage large reference data collections by utilizing a dedicated read-only volume, minimizing Docker image size and maintaining reproducibility.

Make it stick

  • 📦 Reference Data is kept in a special volume to reduce Docker image size.
  • 🔄 When updating, increase the version in kbase.yaml to initialize new reference data without affecting older versions.
  • 🔍 A READY file in /data indicates successful creation of the reference data volume.
  • ⚠️ Write access is allowed only during app initialization to avoid unintended data changes during execution.

Key insights

Overview of Reference Data Management

  • Reference data is essential for many applications, helping to avoid adding large datasets directly to Git repositories.
  • Uses a special volume system to store reference data, ensuring a smaller Docker image.
  • At registration time, an init mode initializes the reference data volume.

Implementation Steps for Developers

  1. Update kbase.yaml: Include a data-version tag with a semantic version.
  2. Modify entrypoint.sh: Add download and preparation steps in the init block and place the data in /data.
  3. Create a READY file: This file confirms that the reference data has been correctly initialized.
  4. Conduct sanity tests: Verify that expected files are present post initialization.

Updating and Maintaining Reference Data

  • Developers can update reference data by:
    • Incrementing the version number in kbase.yaml.
    • Updating the init section of entrypoint.sh.
    • Re-registering the app to initialize the new version.
  • Old versions of the application will still reference their specific version of the data, ensuring consistent results.

Important Considerations

  • The /data area is read-only during app execution; any changes must be initiated through the entrypoint script.
  • Directory modifications in the Dockerfile will not reflect in the /data area post-initialization.
  • If the application requires writable reference data at runtime, developers need to copy it to a writable area before execution.

Key quotes

  • "KBase App reference data is designed to address scenarios where large datasets would otherwise bloat the Git repo."
  • "This feature works by saving reference data in a special volume space and making that volume available through a read-only mount."
  • "Older versions of the app will continue to use the previous reference data specified in that version's kbase.yaml file."
  • "The reference data area is mounted during initialization and replaces /data from the Docker image."
  • "For reproducibility, reference data should only be writable at registration time."
This summary contains AI-generated information and may have important inaccuracies or omissions.