LogCopilot is an LLM-based log analysis framework. It extracts structured log knowledge from raw logs, retrieves that knowledge for natural-language log analysis questions, generates LogQL queries, executes them against Grafana Loki, and synthesizes final answers.
The implementation is organized around two pipelines:
- Offline indexing: log preprocessing, template matching, sequence/workflow mining, parameter extraction, template/parameter/workflow summarization, and embedding generation.
- Online analysis: question dispatch, knowledge retrieval, LogQL generation/execution with repair, knowledge refinement, and answer synthesis.
The repository contains the maintained Python package, Q&A sheets, template metadata, Loki configuration, and paper-support scripts.
Large artifacts are distributed separately through Zenodo:
- Full raw logs for HDFS, OpenSSH, OpenStack, and TrainTicket.
- A packaged Loki image with logs already loaded.
- Generated knowledge files under
storage/for direct reproduction.
- Python 3.12 or newer
uv- Docker, when running the Loki-backed query flow
- An OpenAI-compatible chat and embedding endpoint for LLM workflows
Install the package and development dependencies:
uv sync --extra cli --devCreate a .env file in the repository root before running LLM or Loki-backed workflows:
OPENAI_API_KEY=<openai_api_key>
OPENAI_BASE_URL=<openai_base_url>
LOKI_BASE_URL=http://localhost:3100If you use the default OpenAI endpoint, OPENAI_BASE_URL can be omitted.
The benchmark contains 400 questions, 100 for each system.
| System | Time Span | # Messages | # Templates | Raw Size | # Questions | Type |
|---|---|---|---|---|---|---|
| HDFS | 38.7h | 11,167,740 | 46 | 1.5GB | 100 | Distributed system |
| OpenSSH | 682.4h | 638,947 | 38 | 68MB | 100 | Server application |
| OpenStack | 64.4h | 207,632 | 48 | 59MB | 100 | Distributed system |
| TrainTicket | 652.1h | 1,644,848 | 180 | 453MB | 100 | Microservice system |
The Q&A sheets and template files are under datasets/<application>/. Full raw logs and generated knowledge files are available from Zenodo.
For the fastest reproduction path, use the packaged Loki image from Zenodo:
docker load -i logcopilot_loki.tarStart Loki with the provided configuration:
mkdir -p /home/loki/config
cp setup_loki/loki-config.yaml /home/loki/config/
docker run -d -v /home/loki/config:/mnt/config -p 3100:3100 logcopilot/loki:v3On Windows or other systems, replace /home/loki/config with an absolute host directory and mount that directory to /mnt/config in the container.
Check readiness:
curl localhost:3100/readyLoki is ready when the command returns ready.
The online query flow expects generated knowledge files in storage/. Download the generated knowledge archive from Zenodo and place the files like this:
storage/
hdfs/
parameters_gpt-4o-2024-08-06.json
parameters_embeddings_text-embedding-3-large.parquet
templates_gpt-4o-2024-08-06.json
templates_embeddings_text-embedding-3-large.parquet
workflows_gpt-4o-2024-08-06.json
workflows_embeddings_text-embedding-3-large.parquet
openssh/
...
openstack/
...
trainticket/
...
The workflow files are used by knowledge Q&A. The query flow can run without workflow files, but they should be included for the full LogCopilot pipeline.
Run a smoke reproduction over the first 10 HDFS questions:
uv run python -m logcopilot.phases.query_flow --application hdfs --query_chat_model gpt-4o-2024-08-06 --loc 1 10Outputs are written to results/hdfs/:
query_flow_process_<timestamp>.json: retrieved context, generated LogQL, execution results, and repair attempts.query_flow_finalized_<timestamp>.json: final answer reports.
To run all questions for one system:
uv run python -m logcopilot.phases.query_flow --application hdfs --query_chat_model gpt-4o-2024-08-06 --loc 1 100Change --application to openssh, openstack, or trainticket for the other
datasets.
The Zenodo storage/ files are the recommended path for artifact review. To rebuild the knowledge files from raw logs, first place all.log under each datasets/<application>/ directory, then run the indexing phases.
Rebuilding knowledge can take a long time and consumes LLM and embedding API tokens. Use the generated Zenodo files unless you specifically need to validate the indexing pipeline itself.
The following example rebuilds HDFS knowledge using gpt-4o-2024-08-06 and text-embedding-3-large:
uv run python -m logcopilot.phases.preprocess --application hdfs --datasets_dir datasets --storage_dir storage
uv run python -m logcopilot.phases.extract_parameters --application hdfs --datasets_dir datasets --storage_dir storage --extract_parameters.chat_model gpt-4o-2024-08-06
uv run python -m logcopilot.phases.summarize_parameter --application hdfs --datasets_dir datasets --storage_dir storage --summarize_parameter.chat_model gpt-4o-2024-08-06
uv run python -m logcopilot.phases.summarize_template --application hdfs --datasets_dir datasets --storage_dir storage --summarize_template.chat_model gpt-4o-2024-08-06
uv run python -m logcopilot.phases.summarize_workflow --application hdfs --datasets_dir datasets --storage_dir storage --summarize_workflow.chat_model gpt-4o-2024-08-06
uv run python -m logcopilot.phases.embed_text --application hdfs --storage_dir storage --embed_text.chat_model gpt-4o-2024-08-06 --embed_text.embedding_model text-embedding-3-large --embed_text.documents TEMPLATES
uv run python -m logcopilot.phases.embed_text --application hdfs --storage_dir storage --embed_text.chat_model gpt-4o-2024-08-06 --embed_text.embedding_model text-embedding-3-large --embed_text.documents PARAMETERS
uv run python -m logcopilot.phases.embed_text --application hdfs --storage_dir storage --embed_text.chat_model gpt-4o-2024-08-06 --embed_text.embedding_model text-embedding-3-large --embed_text.documents WORKFLOWSRepeat with --application openssh, --application openstack, or --application trainticket for the other datasets.
Main modules:
logcopilot.index: parsing, template matching, sequence/workflow mining, parameter extraction, and summarization.logcopilot.context: embedding-based retrieval and context packing.logcopilot.query: dispatch, context building, LogQL generation/execution, knowledge refinement, and final answer synthesis.logcopilot.tokenizer: tokenizer abstractions and factory helpers.
Example LogQL generator setup:
from zoneinfo import ZoneInfo
from langchain_core.rate_limiters import InMemoryRateLimiter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from logcopilot.query.context.builder import LocalContextBuilder
from logcopilot.query.logql.execution import LogQLExecutor
from logcopilot.query.logql.generation import LogQLGenerator
from logcopilot.tokenizer import TokenizerFactory, TokenizerType
limiter = InMemoryRateLimiter(requests_per_second=8, max_bucket_size=16)
tokenizer = TokenizerFactory.load_strategy(
{"type": TokenizerType.TIKTOKEN, "model": "gpt-4o-2024-08-06"}
)
context_builder = LocalContextBuilder(
templates=templates,
template_embeddings=template_embeddings,
embedding_llm=OpenAIEmbeddings(model="text-embedding-3-large"),
parameters=parameters,
tokenizer=tokenizer,
)
generator = LogQLGenerator(
chat_llm=ChatOpenAI(model="gpt-4o-2024-08-06", rate_limiter=limiter),
context_builder=context_builder,
tokenizer=tokenizer,
logql_executor=LogQLExecutor("http://localhost:3100", ZoneInfo("Asia/Shanghai")),
chat_llm_params={"temperature": 0.0},
context_builder_params={"top_k_templates": 5},
)The package installs a logcopilot command:
uv run logcopilot --versionFull query and indexing CLI subcommands are not exposed in this package version. Use the logcopilot.phases.* modules above for reproduction and indexing.
Additional utilities live under scripts/:
scripts/draw_venn.pyscripts/show_peaks.pyscripts/tune_fewshot.py
Install the scripts extra before running plotting utilities:
uv sync --extra scriptsuv run ruff check
uv run ruff format --check
uv run pyright
uv run mypy src scripts tests
uv run pytest