Skip to content

[WIP] Agentic Training#217

Open
tastelikefeet wants to merge 103 commits into
modelscope:mainfrom
tastelikefeet:feat/agentic2
Open

[WIP] Agentic Training#217
tastelikefeet wants to merge 103 commits into
modelscope:mainfrom
tastelikefeet:feat/agentic2

Conversation

@tastelikefeet
Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of features to the Twinkle framework, including a streaming OdpsIterableDataset, an InfonceLoss implementation for contrastive embedding training, and a robust set of quality preprocessors such as QualityPreprocessor, HardFilter, DeadLoopFilter, and MessageSanityFilter. It also adds support for embedding tasks in both Transformers and Megatron backends, alongside several training and dataset generation scripts. The review identified several critical issues: an indexing bug in _parse_multi_negative_sentences causing incorrect slicing, silent data loss in parse_tool_call_stream when tool call parsing returns empty, unexpected column removal in Dataset.map when remove_columns is omitted, serialization/pickling issues with the ODPS client, ineffectiveness of the repetition filter for CJK languages, and incorrect rejection of valid terminal tool calls in message_sanity.py.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +65 to +73
split_indices = torch.nonzero(labels, as_tuple=False).squeeze().tolist()
if isinstance(split_indices, int):
split_indices = [split_indices]
split_indices.append(len(labels))
split_indices = np.array(split_indices) + np.array(list(range(len(split_indices))))
split_tensors = []
for i in range(len(split_indices) - 1):
start, end = split_indices[i], split_indices[i + 1]
split_part = sentences[start:end]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

In _parse_multi_negative_sentences, the line split_indices = np.array(split_indices) + np.array(list(range(len(split_indices)))) incorrectly shifts the group start indices. Since labels is a 1-D mask where 1 represents the anchor of each group, the indices of 1 are already the exact start indices of each group in the flat sentences tensor. Adding the range offset shifts the indices further to the right, causing anchors of subsequent groups to be treated as negatives in previous groups, subsequent groups to lose their anchors entirely, and out-of-bounds slicing at the end of the tensor. This line should be removed.

Suggested change
split_indices = torch.nonzero(labels, as_tuple=False).squeeze().tolist()
if isinstance(split_indices, int):
split_indices = [split_indices]
split_indices.append(len(labels))
split_indices = np.array(split_indices) + np.array(list(range(len(split_indices))))
split_tensors = []
for i in range(len(split_indices) - 1):
start, end = split_indices[i], split_indices[i + 1]
split_part = sentences[start:end]
split_indices = torch.nonzero(labels, as_tuple=False).squeeze().tolist()
if isinstance(split_indices, int):
split_indices = [split_indices]
split_indices.append(len(labels))
split_tensors = []
for i in range(len(split_indices) - 1):
start, end = split_indices[i], split_indices[i + 1]
split_part = sentences[start:end]

Comment on lines +185 to +187
for tc in parsed:
events.append({'tool_calls': [self._format_tc_delta(state, tc)]})
state['pending'] = buf[close_idx + len(self._TOOL_CALL_CLOSE):]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In parse_tool_call_stream, if self.parse(block) returns an empty list [] (e.g., due to a regex mismatch on a slightly malformed tool call), the loop silently advances state['pending'] without appending any event. This causes the entire tool call block to be silently dropped and lost from the stream. To prevent silent data loss, the block should be emitted as raw content when parsing yields no tool calls.

Suggested change
for tc in parsed:
events.append({'tool_calls': [self._format_tc_delta(state, tc)]})
state['pending'] = buf[close_idx + len(self._TOOL_CALL_CLOSE):]
if parsed:
for tc in parsed:
events.append({'tool_calls': [self._format_tc_delta(state, tc)]})
else:
events.append({'content': block})
state['pending'] = buf[close_idx + len(self._TOOL_CALL_CLOSE):]

Comment thread src/twinkle/dataset/base.py Outdated
Comment on lines +235 to +240
with processing_lock(key):
if 'remove_columns' not in kwargs:
features = getattr(self.datasets[key], 'features', None)
if features is not None:
kwargs['remove_columns'] = list(features.keys())
self.datasets[key] = self.datasets[key].map(preprocess_func, **kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In Dataset.map, if remove_columns is not specified in kwargs, it automatically sets remove_columns to all features of the dataset. This deviates from standard Hugging Face Dataset.map behavior (where columns are preserved by default) and is inconsistent with the self._mixed branch (lines 225-227) where remove_columns is not automatically set. This inconsistency will cause silent data loss or unexpected behavior depending on whether mix_dataset was called. It is safer to let the user explicitly specify remove_columns when they want to remove columns.

            with processing_lock(key):
                self.datasets[key] = self.datasets[key].map(preprocess_func, **kwargs)

Comment thread src/twinkle/dataset/odps_dataset.py Outdated
Comment on lines +20 to +26
def _make_odps_generator(
odps,
table_name: str,
partition: Optional[str] = None,
columns: Optional[List[str]] = None,
row_converter: Optional[Callable] = None,
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The generators _make_odps_generator and _make_multi_partition_generator capture the active odps client instance in their closures. Since ODPS clients hold active network connections/sessions, they are not picklable. This will cause PicklingError when the dataset is serialized (e.g., when using PyTorch DataLoader with num_workers > 0 or Ray distributed actors). To make the dataset robust and serializable, consider passing the connection parameters (or a factory function) to the generator creators and instantiating the ODPS client inside the generator function _gen itself.

Comment on lines +107 to +113
def _high_repetition_with_threshold(text: str, threshold: float, ngram_size: int = 8, ngram_min_words: int = 30) -> bool:
words = text.split()
if len(words) < ngram_min_words:
return False
ngrams = [' '.join(words[i:i + ngram_size]) for i in range(len(words) - ngram_size + 1)]
unique_ratio = len(set(ngrams)) / len(ngrams)
return (1.0 - unique_ratio) > threshold
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_high_repetition_with_threshold splits the text using text.split(), which assumes space-separated words. For CJK languages (Chinese, Japanese, Korean) which do not use spaces, text.split() returns a single-element list, making the repetition filter completely ineffective. To support multilingual texts, consider splitting by English words or individual CJK characters.

Suggested change
def _high_repetition_with_threshold(text: str, threshold: float, ngram_size: int = 8, ngram_min_words: int = 30) -> bool:
words = text.split()
if len(words) < ngram_min_words:
return False
ngrams = [' '.join(words[i:i + ngram_size]) for i in range(len(words) - ngram_size + 1)]
unique_ratio = len(set(ngrams)) / len(ngrams)
return (1.0 - unique_ratio) > threshold
def _high_repetition_with_threshold(text: str, threshold: float, ngram_size: int = 8, ngram_min_words: int = 30) -> bool:
import re
words = re.findall(r'\b\w+\b|[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff\uac00-\ud7a3]', text)
if len(words) < ngram_min_words:
return False
ngrams = [' '.join(words[i:i + ngram_size]) for i in range(len(words) - ngram_size + 1)]
unique_ratio = len(set(ngrams)) / len(ngrams)
return (1.0 - unique_ratio) > threshold

Comment on lines +193 to +195
actual_ids.add(tid)
j += 1
# Must have at least one matching response; all responses must reference valid calls
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_validate_tool_call_matching rejects any conversation where an assistant's tool_calls is not followed by a tool response. However, in SFT datasets, it is common and valid for a conversation to end with the assistant's tool call (terminal tool call). This filter will incorrectly discard these valid training samples. Consider allowing terminal tool calls by only enforcing matching if there are subsequent messages after the assistant's tool call.

Suggested change
actual_ids.add(tid)
j += 1
# Must have at least one matching response; all responses must reference valid calls
# Must have at least one matching response; all responses must reference valid calls
if not actual_ids:
if j < len(messages):
return False
elif not actual_ids.issubset(expected_ids):
return False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant