TOSCAmine

A configurable, reproducible pipeline for mining and curating TOSCA (Topology and Orchestration Specification for Cloud Applications) blueprints from public code repositories.

Overview

TOSCAmine discovers, clones, validates, and assembles a curated dataset of TOSCA blueprints from GitHub and Codeberg. The pipeline is organized into four phases, each producing a checkpoint saved in runs/<run_name>/. Large artifacts (cloned repos, raw extracted files) are stored in artifacts/, which is gitignored.

Repository Structure

tosca-mining-framework/
├── README.md
├── pyproject.toml                  # uv/hatchling config, entry point
├── uv.lock                         # pinned dependency tree
├── .python-version                 # Python 3.11
├── .env.example                    # Template for API tokens
│
├── scripts/
│   ├── mining/                     # Pipeline source code
│   │   ├── __init__.py
│   │   ├── main.py                 # CLI orchestrator
│   │   ├── config.yaml             # Queries, markers, run settings
│   │   ├── config/
│   │   │   ├── __init__.py
│   │   │   └── loader.py           # Config loading, path derivation
│   │   ├── utils/
│   │   │   ├── __init__.py
│   │   │   ├── api_client.py       # Shared HTTP client (rate-limit, retry)
│   │   │   └── stats_tracker.py    # Per-run stats → mining_stats.md
│   │   └── pipeline/
│   │       ├── __init__.py
│   │       ├── phase1_discover/    # GitHub and Codeberg searchers
│   │       │   ├── __init__.py
│   │       │   ├── github.py
│   │       │   └── codeberg.py
│   │       ├── phase2_clone/       # Full clone + TOSCA extraction + commit history
│   │       ├── phase3_validate/    # Three-tier quality model + metadata
│   │       └── phase4_build/       # Merge → CSV + Parquet + summary
│   └── analysis/
│       └── generate_figures.py     # Generates paper figures from summary.json
│
├── runs/                           # Pushed to git — one directory per run
│   └── <run_name>/
│       ├── config.yaml             # Snapshot of the config used
│       ├── discovered_repos.json   # Phase 1 output
│       ├── clone_progress.json     # Phase 2 progress (resumable)
│       ├── tosca_file_commits.csv  # Phase 2 output — per-file commit history
│       ├── tosca_metadata.json     # Phase 3 output
│       ├── tosca_dataset.csv       # Final dataset (CSV)
│       ├── tosca_dataset.parquet   # Final dataset (Parquet)
│       ├── summary.json            # Run statistics
│       ├── mining_stats.md         # Per-phase statistics
│       └── figures/                # Generated figures (optional)
│
└── artifacts/                      # Gitignored — large files
    └── <run_name>/
        ├── repos/                  # Cloned repositories
        ├── tosca_files/            # Extracted TOSCA files
        └── dataset/                # Organized per-repo file structure (Phase 4)

Prerequisites

Python 3.11
uv — dependency management and execution
Git
GitHub Personal Access Token with public_repo scope (required for discovery)
Codeberg token — optional, the API is public without one

Setup

# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies (reads uv.lock for exact reproducibility)
uv sync

# 3. Configure API tokens
cp .env.example .env
# Edit .env and set at minimum:
#   GITHUB_TOKEN=ghp_...

Running the Pipeline

The tool is invoked via uv run tosca-mine. The run_name in config.yaml identifies the output directory.

# Run the full pipeline
uv run tosca-mine --phase all

# Run a single phase
uv run tosca-mine --phase discover
uv run tosca-mine --phase clone
uv run tosca-mine --phase validate
uv run tosca-mine --phase build

# Resume from a specific phase (earlier phases untouched)
uv run tosca-mine --from-phase clone      # clone → validate → build
uv run tosca-mine --from-phase validate   # validate → build

# Redo a phase from scratch
uv run tosca-mine --phase validate --overwrite

# Create a new isolated run
uv run tosca-mine --phase all --new-run           # runs/default_20260519_143022/
uv run tosca-mine --phase all --config nfv.yaml   # runs/<run_name in nfv.yaml>/

# Test with a small subset
uv run tosca-mine --phase all --max-repos 10 --keep-repos

Run isolation: the tool never silently overwrites or merges existing runs. If a run directory already has output, it stops with an error. Use a different run_name in config.yaml (or --new-run) to keep runs isolated and comparable.

Pipeline Phases

Phase 1 — Discovery

Searches two forges for TOSCA repositories:

Forge	Strategies	Auth
GitHub	Keyword search, Code search, Topic search	Token required
Codeberg	Keyword search	Token optional

GitHub supports three complementary strategies:

Keyword search — repository name/description search with TOSCA-related terms
Code search — searches for known tosca_definitions_version strings in YAML files (date-partitioned to work around the 1,000-result API cap)
Topic search — repository search by topic tags (e.g. tosca, cloudify, alien4cloud)

Results are saved incrementally to discovered_repos.json after each strategy.

Phase 2 — Clone & Extract

Full-clones each discovered repository, walks the file tree for .yaml/.yml files, and copies any file whose first 2 KB contains a TOSCA marker (tosca_definitions_version, cloudify_dsl, alien_dsl, etc.) to artifacts/<run>/tosca_files/. Failed clones are retried with exponential backoff.

Per-file commit history is extracted via git log --follow and written incrementally to runs/<run>/tosca_file_commits.csv (schema: filename, repo_full_name, original_path, commit_hash, commit_date, author_name, author_email, subject). This enables longitudinal analysis of TOSCA file evolution.

Phase 3 — Validation & Metadata Extraction

Applies a three-tier, dialect-agnostic quality model to each extracted file:

Tier	Name	Criterion
1	Parseable	Valid YAML + mapping root + non-empty `tosca_definitions_version`
2	Structurally meaningful	Passes 6 structural coherence checks
3	Version-classified	Version string matches one of 25 known spec versions

Structural metadata (node types, templates, relationships, inputs, outputs, etc.) is extracted from all Tier 1 files and saved to tosca_metadata.json.

Phase 4 — Dataset Construction

Merges repository-level metadata (Phase 1) with file-level metadata (Phase 3). Exports the final dataset as CSV and Parquet to runs/<run_name>/, organizes TOSCA files by repository in artifacts/<run>/dataset/, and writes summary.json to both locations.

Generating Figures

# Default output: runs/<run_name>/figures/
uv run python scripts/analysis/generate_figures.py --run <run_name>

# Custom output directory
uv run python scripts/analysis/generate_figures.py --run <run_name> --out path/to/output

Dataset Schema

Each row in tosca_dataset.csv / tosca_dataset.parquet represents one valid TOSCA file.

File-level columns

Column	Description
`filename`	Flat filename: `owner__repo__path__to__file.yaml`
`original_path`	Original path within the repository
`tosca_version`	Version string, e.g. `tosca_simple_yaml_1_3`
`tosca_profile`	Dialect classification (e.g. `tosca_simple_yaml_1_X`, `cloudify`, `alien4cloud`)
`known_version`	Whether the version matches a recognised spec version
`meaningful`	Passes all Tier 2 structural coherence checks
`coherence_warnings`	JSON list of Tier 2 issue codes
`description`	File-level description field if present
`has_topology_template`	Contains a `topology_template` (TOSCA v1.x) or `service_template` (TOSCA 2.0) block
`has_imports`	Imports other definitions
`node_types_count`	Number of node type definitions
`node_types`	JSON list of node type names
`relationship_types_count`	Number of relationship type definitions
`relationship_templates_count`	Number of relationship template instantiations
`capability_types_count`	Number of capability type definitions
`data_types_count`	Number of data type definitions
`policy_types_count`	Number of policy type definitions
`group_types_count`	Number of group type definitions
`artifact_types_count`	Number of artifact type definitions
`interface_types_count`	Number of interface type definitions
`node_templates_count`	Number of node template instantiations
`node_templates`	JSON list of node template names
`inputs_count`	Number of topology input parameters
`outputs_count`	Number of topology output parameters
`file_size_bytes`	File size in bytes
`line_count`	Number of lines

Repository-level columns

Column	Description
`repo_full_name`	`owner/repo` identifier
`repo_owner`	Repository owner (user or organisation)
`repo_name`	Repository name
`repo_html_url`	URL to the repository
`repo_description`	Repository description
`repo_stars`	Star count
`repo_forks`	Fork count
`repo_language`	Primary programming language
`repo_topics`	Topics/tags (JSON list)
`repo_last_updated`	Last update date (ISO 8601)
`repo_created_at`	Creation date (ISO 8601)
`repo_size_kb`	Repository size in KB
`repo_fork`	Whether this repo is a fork of another
`repo_license`	SPDX license identifier
`repo_open_issues`	Number of open issues
`repo_watchers`	Watcher/star count
`repo_archived`	Whether the repository is archived
`repo_disabled`	Whether the repository is disabled
`repo_visibility`	Visibility (`public`, `private`)
`source_queries`	Queries that matched this repo (JSON list)

Configuration

All queries, markers, and run settings live in scripts/mining/config.yaml:

run_name: default          # identifies the output directory (runs/default/)

github:
  tokens:
    - "${GITHUB_TOKEN}"
  repo_search_queries: [...] # keyword queries
  code_search_queries: [...] # version-string code search queries
  topic_queries: [...]       # topic queries

codeberg:
  tokens:
    - "${CODEBERG_TOKEN}"  # optional
  repo_search_queries: [...]

cloner:
  tosca_markers: [...]       # strings checked in first 2KB of each YAML file
  retry_attempts: 3
  retry_delay_seconds: 5
  skip_dirs: [...]
  delete_after: false        # set to true to reclaim disk after extraction

To run with different queries, create a new config file with a different run_name:

uv run tosca-mine --config scripts/mining/nfv.yaml --phase all

Reproducibility

uv.lock pins the exact dependency tree; uv sync installs an identical environment on any machine.
Each run saves a config.yaml snapshot inside runs/<run_name>/ so the exact queries are always recorded alongside the output.
runs/ is pushed to this repository; artifacts/ is gitignored.
The pipeline guarantees process replicability, not result reproducibility: repository churn (repos created/deleted/made private) means repeated discovery runs may yield slightly different results (~0.3% variance in final file count across runs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TOSCAmine

Overview

Repository Structure

Prerequisites

Setup

Running the Pipeline

Pipeline Phases

Phase 1 — Discovery

Phase 2 — Clone & Extract

Phase 3 — Validation & Metadata Extraction

Phase 4 — Dataset Construction

Generating Figures

Dataset Schema

File-level columns

Repository-level columns

Configuration

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TOSCAmine

Overview

Repository Structure

Prerequisites

Setup

Running the Pipeline

Pipeline Phases

Phase 1 — Discovery

Phase 2 — Clone & Extract

Phase 3 — Validation & Metadata Extraction

Phase 4 — Dataset Construction

Generating Figures

Dataset Schema

File-level columns

Repository-level columns

Configuration

Reproducibility

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages