Skip to content

VanLoo-lab/DNAStream

Repository files navigation

DNAStream

DNAStream is an HDF5-backed, multi-modal data structure for organizing DNA sequencing data and downstream evolutionary analysis. It provides compact on-disk storage, fast partial reads, and a structured way to track entities, links, and changes over time.

Beta scope

This beta release focuses on:

  • Registry: typed registries with built-in schemas for core entities (e.g., samples, variants, SNPs) and activation status
  • Provenance: lightweight event logging for dataset changes (create, append, modify, designate)

Expect the API and on-disk layout to evolve during beta.

overview Beta includes Registry + Provenance. Measurements and Results are planned (marked with * in the diagram).

Key features

  • Efficient storage and access via a chunked HDF5 file with lazy reads for large cohorts
  • Entity registries with schemas to validate fields, manage activation, and support consistent linking across datasets and analyses
  • Provenance logging of change events to support reproducibility and collaboration

Coming soon

  • Measurements linked to registered entities (e.g., variant/total read counts, binned counts)
  • Results storage and retrieval (e.g., copy number calling, clonal trees)
  • Canonical result pointers to mark the active/preferred outputs among multiple runs
  • Custom schemas for specialized registries, measurements, and results
  • Multi-user workflows with a clear concurrency policy for write access

Table of Contents

Dependencies

DNAStream has the following dependencies. These will be automatically installed during installation. Pinned versions to be determined later.

  • h5py
  • numpy
  • pandas

Optional dependencies

To view the documentation locally:

  • mkdocs>=1.5
  • mkdocs-material>=9
  • mkdocstrings[python]>=0.25
pip install -e ".[docs]"

To run the test suite:

  • pytest>=7
pip install -e ".[test]"

For developers, all of the above dependencies plus:

  • black>=24
pip install -e ".[dev]"

Installation

Create a conda/mamba environment (recommended) and install the package from the Github tagged release.

#optional but recommended
conda create -n dnastream python=3.11
conda activate dnastream 

#Just the DNAStream package
pip install "dnastream @ git+https://github.com/VanLoo-lab/DNAStream.git"

With local documentation and tutorials

#optional but recommended
conda create -n dnastream python=3.11
conda activate dnastream 

# Clone the repo
git clone https://github.com/VanLoo-lab/DNAStream.git
cd DNAStream

# Install with docs dependencies
pip install -e ".[docs]"

# Serve the docs
mkdocs serve

Verify the installation.

python -c "import dnastream; from dnastream import DNAStream; print(dnastream.__version__)"

Package is ready to use if no errors occurred!

Quickstart

The beta release is focused on Registry and Provenance. The minimal example below creates a file, appends rows to a registry, iterates decoded rows, and inspects recent provenance events. See the tutorials for more detailed instruction of how to use the dnastream package.

from pathlib import Path
import tempfile
from dnastream import DNAStream

tmpdir = Path(tempfile.mkdtemp(prefix="dnastream_tutorial_"))
myfile = tmpdir / "temp.h5"

#Create a new DNAStream HDF5 file, user warning if file already exists.
ds = DNAStream.create(myfile, patient_id="patientX")
ds.close()

Outside of create it is recommended to connect with a context manager. Here we add to entities to the built-in sample registry.

with DNAStream.open(myfile, mode="r+", verbose=True) as ds:

    #pointer to the built-in sample registry
    reg = ds.sample

    reg.add([
        {"sample_name": "S1", "modality": "bulk"},
        {"sample_name": "S2", "modality": "single-cell"},
    ])

    print(f"Sample registry contains {len(reg)} entities")

We can also add variant entitites to the Variant Registry and iterate through the registry, extracting key information.

with DNAStream.open(myfile, mode="r+", verbose=True) as ds:
       
    ds.variant.add([
        {"chrom": "chr1", "start_pos": 1231, "end_pos": 1232, "ref_allele": "A", "alt_allele": "T"},
    ])

    for snv in ds.variant:
        print(f"SNV id: {snv['id']}, SNV label: {snv['label']}, active {snv['active']}")

Then we can inspect the provenance log to see all modifications to the DNAStream file.

with DNAStream.open(myfile, mode="r+", verbose=True) as ds:
    
    # pointer to the built-in provenance modification log
    log = ds.log


    # #extract the entire registry to a dataframe
    event_log_df = log.to_dataframe()
    print(event_log_df.head())

We can also iterate through a Registry or Provenance object to obtain a dictionary of each entry:

with DNAStream.open(myfile, mode="r", verbose=False) as ds:
    for row in ds.sample:
        print(row)

    for event in ds.log:
      print(event)

CLI Tools

dnastream create -h            
usage: dnastream create [-h] -f FILE [-p PATIENT_ID] [-v]

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Path to the DNAStream file to be created.
  -p PATIENT_ID, --patient-id PATIENT_ID
                        Patient identifier to store in file (optional).
  -v, --verbose

Example

dnastream create -f dnastream.h5 -p patientX -v

Documentation

During the beta phase, the documentation is not yet published on GitHub Pages. You can build and serve the docs locally with MkDocs.

Install the optional documentation dependencies:

pip install -e ".[docs]"  # after cloning the repository

Then serve the documentation locally:

mkdocs serve

MkDocs will print a local URL such as:

INFO - [10:31:55] Browser connected: http://127.0.0.1:8000/

Open that address in your browser to view the docs.

Unit tests

If the optional dependences are installed for the test suite, then the package can be tested with:

pytest

About

DNAStream is an HDF5-based data structure for efficient storage, indexing, logging and retrieval of processed DNA sequencing data across multiple sequencing modalities.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages