Skip to content

jade-lab/TOSCAmine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TOSCAmine

A configurable, reproducible pipeline for mining and curating TOSCA (Topology and Orchestration Specification for Cloud Applications) blueprints from public code repositories.


Overview

TOSCAmine discovers, clones, validates, and assembles a curated dataset of TOSCA blueprints from GitHub and Codeberg. The pipeline is organized into four phases, each producing a checkpoint saved in runs/<run_name>/. Large artifacts (cloned repos, raw extracted files) are stored in artifacts/, which is gitignored.


Repository Structure

tosca-mining-framework/
├── README.md
├── pyproject.toml                  # uv/hatchling config, entry point
├── uv.lock                         # pinned dependency tree
├── .python-version                 # Python 3.11
├── .env.example                    # Template for API tokens
│
├── scripts/
│   ├── mining/                     # Pipeline source code
│   │   ├── __init__.py
│   │   ├── main.py                 # CLI orchestrator
│   │   ├── config.yaml             # Queries, markers, run settings
│   │   ├── config/
│   │   │   ├── __init__.py
│   │   │   └── loader.py           # Config loading, path derivation
│   │   ├── utils/
│   │   │   ├── __init__.py
│   │   │   ├── api_client.py       # Shared HTTP client (rate-limit, retry)
│   │   │   └── stats_tracker.py    # Per-run stats → mining_stats.md
│   │   └── pipeline/
│   │       ├── __init__.py
│   │       ├── phase1_discover/    # GitHub and Codeberg searchers
│   │       │   ├── __init__.py
│   │       │   ├── github.py
│   │       │   └── codeberg.py
│   │       ├── phase2_clone/       # Full clone + TOSCA extraction + commit history
│   │       ├── phase3_validate/    # Three-tier quality model + metadata
│   │       └── phase4_build/       # Merge → CSV + Parquet + summary
│   └── analysis/
│       └── generate_figures.py     # Generates paper figures from summary.json
│
├── runs/                           # Pushed to git — one directory per run
│   └── <run_name>/
│       ├── config.yaml             # Snapshot of the config used
│       ├── discovered_repos.json   # Phase 1 output
│       ├── clone_progress.json     # Phase 2 progress (resumable)
│       ├── tosca_file_commits.csv  # Phase 2 output — per-file commit history
│       ├── tosca_metadata.json     # Phase 3 output
│       ├── tosca_dataset.csv       # Final dataset (CSV)
│       ├── tosca_dataset.parquet   # Final dataset (Parquet)
│       ├── summary.json            # Run statistics
│       ├── mining_stats.md         # Per-phase statistics
│       └── figures/                # Generated figures (optional)
│
└── artifacts/                      # Gitignored — large files
    └── <run_name>/
        ├── repos/                  # Cloned repositories
        ├── tosca_files/            # Extracted TOSCA files
        └── dataset/                # Organized per-repo file structure (Phase 4)

Prerequisites

  • Python 3.11
  • uv — dependency management and execution
  • Git
  • GitHub Personal Access Token with public_repo scope (required for discovery)
  • Codeberg token — optional, the API is public without one

Setup

# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies (reads uv.lock for exact reproducibility)
uv sync

# 3. Configure API tokens
cp .env.example .env
# Edit .env and set at minimum:
#   GITHUB_TOKEN=ghp_...

Running the Pipeline

The tool is invoked via uv run tosca-mine. The run_name in config.yaml identifies the output directory.

# Run the full pipeline
uv run tosca-mine --phase all

# Run a single phase
uv run tosca-mine --phase discover
uv run tosca-mine --phase clone
uv run tosca-mine --phase validate
uv run tosca-mine --phase build

# Resume from a specific phase (earlier phases untouched)
uv run tosca-mine --from-phase clone      # clone → validate → build
uv run tosca-mine --from-phase validate   # validate → build

# Redo a phase from scratch
uv run tosca-mine --phase validate --overwrite

# Create a new isolated run
uv run tosca-mine --phase all --new-run           # runs/default_20260519_143022/
uv run tosca-mine --phase all --config nfv.yaml   # runs/<run_name in nfv.yaml>/

# Test with a small subset
uv run tosca-mine --phase all --max-repos 10 --keep-repos

Run isolation: the tool never silently overwrites or merges existing runs. If a run directory already has output, it stops with an error. Use a different run_name in config.yaml (or --new-run) to keep runs isolated and comparable.


Pipeline Phases

Phase 1 — Discovery

Searches two forges for TOSCA repositories:

Forge Strategies Auth
GitHub Keyword search, Code search, Topic search Token required
Codeberg Keyword search Token optional

GitHub supports three complementary strategies:

  • Keyword search — repository name/description search with TOSCA-related terms
  • Code search — searches for known tosca_definitions_version strings in YAML files (date-partitioned to work around the 1,000-result API cap)
  • Topic search — repository search by topic tags (e.g. tosca, cloudify, alien4cloud)

Results are saved incrementally to discovered_repos.json after each strategy.

Phase 2 — Clone & Extract

Full-clones each discovered repository, walks the file tree for .yaml/.yml files, and copies any file whose first 2 KB contains a TOSCA marker (tosca_definitions_version, cloudify_dsl, alien_dsl, etc.) to artifacts/<run>/tosca_files/. Failed clones are retried with exponential backoff.

Per-file commit history is extracted via git log --follow and written incrementally to runs/<run>/tosca_file_commits.csv (schema: filename, repo_full_name, original_path, commit_hash, commit_date, author_name, author_email, subject). This enables longitudinal analysis of TOSCA file evolution.

Phase 3 — Validation & Metadata Extraction

Applies a three-tier, dialect-agnostic quality model to each extracted file:

Tier Name Criterion
1 Parseable Valid YAML + mapping root + non-empty tosca_definitions_version
2 Structurally meaningful Passes 6 structural coherence checks
3 Version-classified Version string matches one of 25 known spec versions

Structural metadata (node types, templates, relationships, inputs, outputs, etc.) is extracted from all Tier 1 files and saved to tosca_metadata.json.

Phase 4 — Dataset Construction

Merges repository-level metadata (Phase 1) with file-level metadata (Phase 3). Exports the final dataset as CSV and Parquet to runs/<run_name>/, organizes TOSCA files by repository in artifacts/<run>/dataset/, and writes summary.json to both locations.


Generating Figures

# Default output: runs/<run_name>/figures/
uv run python scripts/analysis/generate_figures.py --run <run_name>

# Custom output directory
uv run python scripts/analysis/generate_figures.py --run <run_name> --out path/to/output

Dataset Schema

Each row in tosca_dataset.csv / tosca_dataset.parquet represents one valid TOSCA file.

File-level columns

Column Description
filename Flat filename: owner__repo__path__to__file.yaml
original_path Original path within the repository
tosca_version Version string, e.g. tosca_simple_yaml_1_3
tosca_profile Dialect classification (e.g. tosca_simple_yaml_1_X, cloudify, alien4cloud)
known_version Whether the version matches a recognised spec version
meaningful Passes all Tier 2 structural coherence checks
coherence_warnings JSON list of Tier 2 issue codes
description File-level description field if present
has_topology_template Contains a topology_template (TOSCA v1.x) or service_template (TOSCA 2.0) block
has_imports Imports other definitions
node_types_count Number of node type definitions
node_types JSON list of node type names
relationship_types_count Number of relationship type definitions
relationship_templates_count Number of relationship template instantiations
capability_types_count Number of capability type definitions
data_types_count Number of data type definitions
policy_types_count Number of policy type definitions
group_types_count Number of group type definitions
artifact_types_count Number of artifact type definitions
interface_types_count Number of interface type definitions
node_templates_count Number of node template instantiations
node_templates JSON list of node template names
inputs_count Number of topology input parameters
outputs_count Number of topology output parameters
file_size_bytes File size in bytes
line_count Number of lines

Repository-level columns

Column Description
repo_full_name owner/repo identifier
repo_owner Repository owner (user or organisation)
repo_name Repository name
repo_html_url URL to the repository
repo_description Repository description
repo_stars Star count
repo_forks Fork count
repo_language Primary programming language
repo_topics Topics/tags (JSON list)
repo_last_updated Last update date (ISO 8601)
repo_created_at Creation date (ISO 8601)
repo_size_kb Repository size in KB
repo_fork Whether this repo is a fork of another
repo_license SPDX license identifier
repo_open_issues Number of open issues
repo_watchers Watcher/star count
repo_archived Whether the repository is archived
repo_disabled Whether the repository is disabled
repo_visibility Visibility (public, private)
source_queries Queries that matched this repo (JSON list)

Configuration

All queries, markers, and run settings live in scripts/mining/config.yaml:

run_name: default          # identifies the output directory (runs/default/)

github:
  tokens:
    - "${GITHUB_TOKEN}"
  repo_search_queries: [...] # keyword queries
  code_search_queries: [...] # version-string code search queries
  topic_queries: [...]       # topic queries

codeberg:
  tokens:
    - "${CODEBERG_TOKEN}"  # optional
  repo_search_queries: [...]

cloner:
  tosca_markers: [...]       # strings checked in first 2KB of each YAML file
  retry_attempts: 3
  retry_delay_seconds: 5
  skip_dirs: [...]
  delete_after: false        # set to true to reclaim disk after extraction

To run with different queries, create a new config file with a different run_name:

uv run tosca-mine --config scripts/mining/nfv.yaml --phase all

Reproducibility

  • uv.lock pins the exact dependency tree; uv sync installs an identical environment on any machine.
  • Each run saves a config.yaml snapshot inside runs/<run_name>/ so the exact queries are always recorded alongside the output.
  • runs/ is pushed to this repository; artifacts/ is gitignored.
  • The pipeline guarantees process replicability, not result reproducibility: repository churn (repos created/deleted/made private) means repeated discovery runs may yield slightly different results (~0.3% variance in final file count across runs).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages