DNAStream is an HDF5-backed, multi-modal data structure for organizing DNA sequencing data and downstream evolutionary analysis. It provides compact on-disk storage, fast partial reads, and a structured way to track entities, links, and changes over time.
This beta release focuses on:
- Registry: typed registries with built-in schemas for core entities (e.g., samples, variants, SNPs) and activation status
- Provenance: lightweight event logging for dataset changes (create, append, modify, designate)
Expect the API and on-disk layout to evolve during beta.
Beta includes Registry + Provenance. Measurements and Results are planned (marked with * in the diagram).
- Efficient storage and access via a chunked HDF5 file with lazy reads for large cohorts
- Entity registries with schemas to validate fields, manage activation, and support consistent linking across datasets and analyses
- Provenance logging of change events to support reproducibility and collaboration
- Measurements linked to registered entities (e.g., variant/total read counts, binned counts)
- Results storage and retrieval (e.g., copy number calling, clonal trees)
- Canonical result pointers to mark the active/preferred outputs among multiple runs
- Custom schemas for specialized registries, measurements, and results
- Multi-user workflows with a clear concurrency policy for write access
DNAStream has the following dependencies. These will be automatically installed during installation. Pinned versions to be determined later.
h5pynumpypandas
To view the documentation locally:
mkdocs>=1.5mkdocs-material>=9mkdocstrings[python]>=0.25
pip install -e ".[docs]"To run the test suite:
pytest>=7
pip install -e ".[test]"For developers, all of the above dependencies plus:
black>=24
pip install -e ".[dev]"Create a conda/mamba environment (recommended) and install the package from the Github tagged release.
#optional but recommended
conda create -n dnastream python=3.11
conda activate dnastream
#Just the DNAStream package
pip install "dnastream @ git+https://github.com/VanLoo-lab/DNAStream.git"#optional but recommended
conda create -n dnastream python=3.11
conda activate dnastream
# Clone the repo
git clone https://github.com/VanLoo-lab/DNAStream.git
cd DNAStream
# Install with docs dependencies
pip install -e ".[docs]"
# Serve the docs
mkdocs serve
Verify the installation.
python -c "import dnastream; from dnastream import DNAStream; print(dnastream.__version__)"
Package is ready to use if no errors occurred!
The beta release is focused on Registry and Provenance. The minimal example below creates a file, appends rows to a registry, iterates decoded rows, and inspects recent provenance events. See the tutorials for more detailed instruction of how to use the dnastream package.
from pathlib import Path
import tempfile
from dnastream import DNAStream
tmpdir = Path(tempfile.mkdtemp(prefix="dnastream_tutorial_"))
myfile = tmpdir / "temp.h5"
#Create a new DNAStream HDF5 file, user warning if file already exists.
ds = DNAStream.create(myfile, patient_id="patientX")
ds.close()Outside of create it is recommended to connect with a context manager.
Here we add to entities to the built-in sample registry.
with DNAStream.open(myfile, mode="r+", verbose=True) as ds:
#pointer to the built-in sample registry
reg = ds.sample
reg.add([
{"sample_name": "S1", "modality": "bulk"},
{"sample_name": "S2", "modality": "single-cell"},
])
print(f"Sample registry contains {len(reg)} entities")We can also add variant entitites to the Variant Registry and iterate through the registry, extracting key information.
with DNAStream.open(myfile, mode="r+", verbose=True) as ds:
ds.variant.add([
{"chrom": "chr1", "start_pos": 1231, "end_pos": 1232, "ref_allele": "A", "alt_allele": "T"},
])
for snv in ds.variant:
print(f"SNV id: {snv['id']}, SNV label: {snv['label']}, active {snv['active']}")Then we can inspect the provenance log to see all modifications to the DNAStream file.
with DNAStream.open(myfile, mode="r+", verbose=True) as ds:
# pointer to the built-in provenance modification log
log = ds.log
# #extract the entire registry to a dataframe
event_log_df = log.to_dataframe()
print(event_log_df.head())We can also iterate through a Registry or Provenance object to obtain a dictionary of each entry:
with DNAStream.open(myfile, mode="r", verbose=False) as ds:
for row in ds.sample:
print(row)
for event in ds.log:
print(event)dnastream create -h
usage: dnastream create [-h] -f FILE [-p PATIENT_ID] [-v]
options:
-h, --help show this help message and exit
-f FILE, --file FILE Path to the DNAStream file to be created.
-p PATIENT_ID, --patient-id PATIENT_ID
Patient identifier to store in file (optional).
-v, --verbosednastream create -f dnastream.h5 -p patientX -vDuring the beta phase, the documentation is not yet published on GitHub Pages. You can build and serve the docs locally with MkDocs.
Install the optional documentation dependencies:
pip install -e ".[docs]" # after cloning the repositoryThen serve the documentation locally:
mkdocs serveMkDocs will print a local URL such as:
INFO - [10:31:55] Browser connected: http://127.0.0.1:8000/
Open that address in your browser to view the docs.
If the optional dependences are installed for the test suite, then the package can be tested with:
pytest