A configurable, reproducible pipeline for mining and curating TOSCA (Topology and Orchestration Specification for Cloud Applications) blueprints from public code repositories.
TOSCAmine discovers, clones, validates, and assembles a curated dataset of TOSCA blueprints from GitHub and Codeberg. The pipeline is organized into four phases, each producing a checkpoint saved in runs/<run_name>/. Large artifacts (cloned repos, raw extracted files) are stored in artifacts/, which is gitignored.
tosca-mining-framework/
├── README.md
├── pyproject.toml # uv/hatchling config, entry point
├── uv.lock # pinned dependency tree
├── .python-version # Python 3.11
├── .env.example # Template for API tokens
│
├── scripts/
│ ├── mining/ # Pipeline source code
│ │ ├── __init__.py
│ │ ├── main.py # CLI orchestrator
│ │ ├── config.yaml # Queries, markers, run settings
│ │ ├── config/
│ │ │ ├── __init__.py
│ │ │ └── loader.py # Config loading, path derivation
│ │ ├── utils/
│ │ │ ├── __init__.py
│ │ │ ├── api_client.py # Shared HTTP client (rate-limit, retry)
│ │ │ └── stats_tracker.py # Per-run stats → mining_stats.md
│ │ └── pipeline/
│ │ ├── __init__.py
│ │ ├── phase1_discover/ # GitHub and Codeberg searchers
│ │ │ ├── __init__.py
│ │ │ ├── github.py
│ │ │ └── codeberg.py
│ │ ├── phase2_clone/ # Full clone + TOSCA extraction + commit history
│ │ ├── phase3_validate/ # Three-tier quality model + metadata
│ │ └── phase4_build/ # Merge → CSV + Parquet + summary
│ └── analysis/
│ └── generate_figures.py # Generates paper figures from summary.json
│
├── runs/ # Pushed to git — one directory per run
│ └── <run_name>/
│ ├── config.yaml # Snapshot of the config used
│ ├── discovered_repos.json # Phase 1 output
│ ├── clone_progress.json # Phase 2 progress (resumable)
│ ├── tosca_file_commits.csv # Phase 2 output — per-file commit history
│ ├── tosca_metadata.json # Phase 3 output
│ ├── tosca_dataset.csv # Final dataset (CSV)
│ ├── tosca_dataset.parquet # Final dataset (Parquet)
│ ├── summary.json # Run statistics
│ ├── mining_stats.md # Per-phase statistics
│ └── figures/ # Generated figures (optional)
│
└── artifacts/ # Gitignored — large files
└── <run_name>/
├── repos/ # Cloned repositories
├── tosca_files/ # Extracted TOSCA files
└── dataset/ # Organized per-repo file structure (Phase 4)
- Python 3.11
- uv — dependency management and execution
- Git
- GitHub Personal Access Token with
public_reposcope (required for discovery) - Codeberg token — optional, the API is public without one
# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Install dependencies (reads uv.lock for exact reproducibility)
uv sync
# 3. Configure API tokens
cp .env.example .env
# Edit .env and set at minimum:
# GITHUB_TOKEN=ghp_...The tool is invoked via uv run tosca-mine. The run_name in config.yaml identifies the output directory.
# Run the full pipeline
uv run tosca-mine --phase all
# Run a single phase
uv run tosca-mine --phase discover
uv run tosca-mine --phase clone
uv run tosca-mine --phase validate
uv run tosca-mine --phase build
# Resume from a specific phase (earlier phases untouched)
uv run tosca-mine --from-phase clone # clone → validate → build
uv run tosca-mine --from-phase validate # validate → build
# Redo a phase from scratch
uv run tosca-mine --phase validate --overwrite
# Create a new isolated run
uv run tosca-mine --phase all --new-run # runs/default_20260519_143022/
uv run tosca-mine --phase all --config nfv.yaml # runs/<run_name in nfv.yaml>/
# Test with a small subset
uv run tosca-mine --phase all --max-repos 10 --keep-reposRun isolation: the tool never silently overwrites or merges existing runs. If a run directory already has output, it stops with an error. Use a different
run_nameinconfig.yaml(or--new-run) to keep runs isolated and comparable.
Searches two forges for TOSCA repositories:
| Forge | Strategies | Auth |
|---|---|---|
| GitHub | Keyword search, Code search, Topic search | Token required |
| Codeberg | Keyword search | Token optional |
GitHub supports three complementary strategies:
- Keyword search — repository name/description search with TOSCA-related terms
- Code search — searches for known
tosca_definitions_versionstrings in YAML files (date-partitioned to work around the 1,000-result API cap) - Topic search — repository search by topic tags (e.g.
tosca,cloudify,alien4cloud)
Results are saved incrementally to discovered_repos.json after each strategy.
Full-clones each discovered repository, walks the file tree for .yaml/.yml files, and copies any file whose first 2 KB contains a TOSCA marker (tosca_definitions_version, cloudify_dsl, alien_dsl, etc.) to artifacts/<run>/tosca_files/. Failed clones are retried with exponential backoff.
Per-file commit history is extracted via git log --follow and written incrementally to runs/<run>/tosca_file_commits.csv (schema: filename, repo_full_name, original_path, commit_hash, commit_date, author_name, author_email, subject). This enables longitudinal analysis of TOSCA file evolution.
Applies a three-tier, dialect-agnostic quality model to each extracted file:
| Tier | Name | Criterion |
|---|---|---|
| 1 | Parseable | Valid YAML + mapping root + non-empty tosca_definitions_version |
| 2 | Structurally meaningful | Passes 6 structural coherence checks |
| 3 | Version-classified | Version string matches one of 25 known spec versions |
Structural metadata (node types, templates, relationships, inputs, outputs, etc.) is extracted from all Tier 1 files and saved to tosca_metadata.json.
Merges repository-level metadata (Phase 1) with file-level metadata (Phase 3). Exports the final dataset as CSV and Parquet to runs/<run_name>/, organizes TOSCA files by repository in artifacts/<run>/dataset/, and writes summary.json to both locations.
# Default output: runs/<run_name>/figures/
uv run python scripts/analysis/generate_figures.py --run <run_name>
# Custom output directory
uv run python scripts/analysis/generate_figures.py --run <run_name> --out path/to/outputEach row in tosca_dataset.csv / tosca_dataset.parquet represents one valid TOSCA file.
| Column | Description |
|---|---|
filename |
Flat filename: owner__repo__path__to__file.yaml |
original_path |
Original path within the repository |
tosca_version |
Version string, e.g. tosca_simple_yaml_1_3 |
tosca_profile |
Dialect classification (e.g. tosca_simple_yaml_1_X, cloudify, alien4cloud) |
known_version |
Whether the version matches a recognised spec version |
meaningful |
Passes all Tier 2 structural coherence checks |
coherence_warnings |
JSON list of Tier 2 issue codes |
description |
File-level description field if present |
has_topology_template |
Contains a topology_template (TOSCA v1.x) or service_template (TOSCA 2.0) block |
has_imports |
Imports other definitions |
node_types_count |
Number of node type definitions |
node_types |
JSON list of node type names |
relationship_types_count |
Number of relationship type definitions |
relationship_templates_count |
Number of relationship template instantiations |
capability_types_count |
Number of capability type definitions |
data_types_count |
Number of data type definitions |
policy_types_count |
Number of policy type definitions |
group_types_count |
Number of group type definitions |
artifact_types_count |
Number of artifact type definitions |
interface_types_count |
Number of interface type definitions |
node_templates_count |
Number of node template instantiations |
node_templates |
JSON list of node template names |
inputs_count |
Number of topology input parameters |
outputs_count |
Number of topology output parameters |
file_size_bytes |
File size in bytes |
line_count |
Number of lines |
| Column | Description |
|---|---|
repo_full_name |
owner/repo identifier |
repo_owner |
Repository owner (user or organisation) |
repo_name |
Repository name |
repo_html_url |
URL to the repository |
repo_description |
Repository description |
repo_stars |
Star count |
repo_forks |
Fork count |
repo_language |
Primary programming language |
repo_topics |
Topics/tags (JSON list) |
repo_last_updated |
Last update date (ISO 8601) |
repo_created_at |
Creation date (ISO 8601) |
repo_size_kb |
Repository size in KB |
repo_fork |
Whether this repo is a fork of another |
repo_license |
SPDX license identifier |
repo_open_issues |
Number of open issues |
repo_watchers |
Watcher/star count |
repo_archived |
Whether the repository is archived |
repo_disabled |
Whether the repository is disabled |
repo_visibility |
Visibility (public, private) |
source_queries |
Queries that matched this repo (JSON list) |
All queries, markers, and run settings live in scripts/mining/config.yaml:
run_name: default # identifies the output directory (runs/default/)
github:
tokens:
- "${GITHUB_TOKEN}"
repo_search_queries: [...] # keyword queries
code_search_queries: [...] # version-string code search queries
topic_queries: [...] # topic queries
codeberg:
tokens:
- "${CODEBERG_TOKEN}" # optional
repo_search_queries: [...]
cloner:
tosca_markers: [...] # strings checked in first 2KB of each YAML file
retry_attempts: 3
retry_delay_seconds: 5
skip_dirs: [...]
delete_after: false # set to true to reclaim disk after extractionTo run with different queries, create a new config file with a different run_name:
uv run tosca-mine --config scripts/mining/nfv.yaml --phase alluv.lockpins the exact dependency tree;uv syncinstalls an identical environment on any machine.- Each run saves a
config.yamlsnapshot insideruns/<run_name>/so the exact queries are always recorded alongside the output. runs/is pushed to this repository;artifacts/is gitignored.- The pipeline guarantees process replicability, not result reproducibility: repository churn (repos created/deleted/made private) means repeated discovery runs may yield slightly different results (~0.3% variance in final file count across runs).