LogCopilot

Overview

LogCopilot is an LLM-based log analysis framework. It extracts structured log knowledge from raw logs, retrieves that knowledge for natural-language log analysis questions, generates LogQL queries, executes them against Grafana Loki, and synthesizes final answers.

The implementation is organized around two pipelines:

Offline indexing: log preprocessing, template matching, sequence/workflow mining, parameter extraction, template/parameter/workflow summarization, and embedding generation.
Online analysis: question dispatch, knowledge retrieval, LogQL generation/execution with repair, knowledge refinement, and answer synthesis.

Repository Contents

The repository contains the maintained Python package, Q&A sheets, template metadata, Loki configuration, and paper-support scripts.

Large artifacts are distributed separately through Zenodo:

Full raw logs for HDFS, OpenSSH, OpenStack, and TrainTicket.
A packaged Loki image with logs already loaded.
Generated knowledge files under storage/ for direct reproduction.

Requirements

Python 3.12 or newer
uv
Docker, when running the Loki-backed query flow
An OpenAI-compatible chat and embedding endpoint for LLM workflows

Install the package and development dependencies:

uv sync --extra cli --dev

Create a .env file in the repository root before running LLM or Loki-backed workflows:

OPENAI_API_KEY=<openai_api_key>
OPENAI_BASE_URL=<openai_base_url>
LOKI_BASE_URL=http://localhost:3100

If you use the default OpenAI endpoint, OPENAI_BASE_URL can be omitted.

Datasets

The benchmark contains 400 questions, 100 for each system.

System	Time Span	# Messages	# Templates	Raw Size	# Questions	Type
HDFS	38.7h	11,167,740	46	1.5GB	100	Distributed system
OpenSSH	682.4h	638,947	38	68MB	100	Server application
OpenStack	64.4h	207,632	48	59MB	100	Distributed system
TrainTicket	652.1h	1,644,848	180	453MB	100	Microservice system

The Q&A sheets and template files are under datasets/<application>/. Full raw logs and generated knowledge files are available from Zenodo.

Start Loki

For the fastest reproduction path, use the packaged Loki image from Zenodo:

docker load -i logcopilot_loki.tar

Start Loki with the provided configuration:

mkdir -p /home/loki/config
cp setup_loki/loki-config.yaml /home/loki/config/
docker run -d -v /home/loki/config:/mnt/config -p 3100:3100 logcopilot/loki:v3

On Windows or other systems, replace /home/loki/config with an absolute host directory and mount that directory to /mnt/config in the container.

Check readiness:

curl localhost:3100/ready

Loki is ready when the command returns ready.

Prepare Knowledge Files

The online query flow expects generated knowledge files in storage/. Download the generated knowledge archive from Zenodo and place the files like this:

storage/
  hdfs/
    parameters_gpt-4o-2024-08-06.json
    parameters_embeddings_text-embedding-3-large.parquet
    templates_gpt-4o-2024-08-06.json
    templates_embeddings_text-embedding-3-large.parquet
    workflows_gpt-4o-2024-08-06.json
    workflows_embeddings_text-embedding-3-large.parquet
  openssh/
    ...
  openstack/
    ...
  trainticket/
    ...

The workflow files are used by knowledge Q&A. The query flow can run without workflow files, but they should be included for the full LogCopilot pipeline.

Reproduce Query Flow

Run a smoke reproduction over the first 10 HDFS questions:

uv run python -m logcopilot.phases.query_flow --application hdfs --query_chat_model gpt-4o-2024-08-06 --loc 1 10

Outputs are written to results/hdfs/:

query_flow_process_<timestamp>.json: retrieved context, generated LogQL, execution results, and repair attempts.
query_flow_finalized_<timestamp>.json: final answer reports.

To run all questions for one system:

uv run python -m logcopilot.phases.query_flow --application hdfs --query_chat_model gpt-4o-2024-08-06 --loc 1 100

Change --application to openssh, openstack, or trainticket for the other datasets.

Rebuild Knowledge

The Zenodo storage/ files are the recommended path for artifact review. To rebuild the knowledge files from raw logs, first place all.log under each datasets/<application>/ directory, then run the indexing phases.

Rebuilding knowledge can take a long time and consumes LLM and embedding API tokens. Use the generated Zenodo files unless you specifically need to validate the indexing pipeline itself.

The following example rebuilds HDFS knowledge using gpt-4o-2024-08-06 and text-embedding-3-large:

uv run python -m logcopilot.phases.preprocess --application hdfs --datasets_dir datasets --storage_dir storage
uv run python -m logcopilot.phases.extract_parameters --application hdfs --datasets_dir datasets --storage_dir storage --extract_parameters.chat_model gpt-4o-2024-08-06
uv run python -m logcopilot.phases.summarize_parameter --application hdfs --datasets_dir datasets --storage_dir storage --summarize_parameter.chat_model gpt-4o-2024-08-06
uv run python -m logcopilot.phases.summarize_template --application hdfs --datasets_dir datasets --storage_dir storage --summarize_template.chat_model gpt-4o-2024-08-06
uv run python -m logcopilot.phases.summarize_workflow --application hdfs --datasets_dir datasets --storage_dir storage --summarize_workflow.chat_model gpt-4o-2024-08-06
uv run python -m logcopilot.phases.embed_text --application hdfs --storage_dir storage --embed_text.chat_model gpt-4o-2024-08-06 --embed_text.embedding_model text-embedding-3-large --embed_text.documents TEMPLATES
uv run python -m logcopilot.phases.embed_text --application hdfs --storage_dir storage --embed_text.chat_model gpt-4o-2024-08-06 --embed_text.embedding_model text-embedding-3-large --embed_text.documents PARAMETERS
uv run python -m logcopilot.phases.embed_text --application hdfs --storage_dir storage --embed_text.chat_model gpt-4o-2024-08-06 --embed_text.embedding_model text-embedding-3-large --embed_text.documents WORKFLOWS

Repeat with --application openssh, --application openstack, or --application trainticket for the other datasets.

Library Usage

Main modules:

logcopilot.index: parsing, template matching, sequence/workflow mining, parameter extraction, and summarization.
logcopilot.context: embedding-based retrieval and context packing.
logcopilot.query: dispatch, context building, LogQL generation/execution, knowledge refinement, and final answer synthesis.
logcopilot.tokenizer: tokenizer abstractions and factory helpers.

Example LogQL generator setup:

from zoneinfo import ZoneInfo

from langchain_core.rate_limiters import InMemoryRateLimiter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from logcopilot.query.context.builder import LocalContextBuilder
from logcopilot.query.logql.execution import LogQLExecutor
from logcopilot.query.logql.generation import LogQLGenerator
from logcopilot.tokenizer import TokenizerFactory, TokenizerType

limiter = InMemoryRateLimiter(requests_per_second=8, max_bucket_size=16)
tokenizer = TokenizerFactory.load_strategy(
    {"type": TokenizerType.TIKTOKEN, "model": "gpt-4o-2024-08-06"}
)

context_builder = LocalContextBuilder(
    templates=templates,
    template_embeddings=template_embeddings,
    embedding_llm=OpenAIEmbeddings(model="text-embedding-3-large"),
    parameters=parameters,
    tokenizer=tokenizer,
)

generator = LogQLGenerator(
    chat_llm=ChatOpenAI(model="gpt-4o-2024-08-06", rate_limiter=limiter),
    context_builder=context_builder,
    tokenizer=tokenizer,
    logql_executor=LogQLExecutor("http://localhost:3100", ZoneInfo("Asia/Shanghai")),
    chat_llm_params={"temperature": 0.0},
    context_builder_params={"top_k_templates": 5},
)

CLI

The package installs a logcopilot command:

uv run logcopilot --version

Full query and indexing CLI subcommands are not exposed in this package version. Use the logcopilot.phases.* modules above for reproduction and indexing.

Paper-Support Scripts

Additional utilities live under scripts/:

scripts/draw_venn.py
scripts/show_peaks.py
scripts/tune_fewshot.py

Install the scripts extra before running plotting utilities:

uv sync --extra scripts

Development Checks

uv run ruff check
uv run ruff format --check
uv run pyright
uv run mypy src scripts tests
uv run pytest

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
datasets		datasets
scripts		scripts
setup_loki		setup_loki
src/logcopilot		src/logcopilot
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LogCopilot

Overview

Repository Contents

Requirements

Datasets

Start Loki

Prepare Knowledge Files

Reproduce Query Flow

Rebuild Knowledge

Library Usage

CLI

Paper-Support Scripts

Development Checks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LogCopilot

Overview

Repository Contents

Requirements

Datasets

Start Loki

Prepare Knowledge Files

Reproduce Query Flow

Rebuild Knowledge

Library Usage

CLI

Paper-Support Scripts

Development Checks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages