[WIP] Agentic Training by tastelikefeet · Pull Request #217 · modelscope/twinkle

tastelikefeet · 2026-06-03T08:00:47Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

into feat/agentic2

gemini-code-assist

Code Review

This pull request introduces a comprehensive suite of features to the Twinkle framework, including a streaming OdpsIterableDataset, an InfonceLoss implementation for contrastive embedding training, and a robust set of quality preprocessors such as QualityPreprocessor, HardFilter, DeadLoopFilter, and MessageSanityFilter. It also adds support for embedding tasks in both Transformers and Megatron backends, alongside several training and dataset generation scripts. The review identified several critical issues: an indexing bug in _parse_multi_negative_sentences causing incorrect slicing, silent data loss in parse_tool_call_stream when tool call parsing returns empty, unexpected column removal in Dataset.map when remove_columns is omitted, serialization/pickling issues with the ODPS client, ineffectiveness of the repetition filter for CJK languages, and incorrect rejection of valid terminal tool calls in message_sanity.py.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-03T08:04:47Z

+    split_indices = torch.nonzero(labels, as_tuple=False).squeeze().tolist()
+    if isinstance(split_indices, int):
+        split_indices = [split_indices]
+    split_indices.append(len(labels))
+    split_indices = np.array(split_indices) + np.array(list(range(len(split_indices))))
+    split_tensors = []
+    for i in range(len(split_indices) - 1):
+        start, end = split_indices[i], split_indices[i + 1]
+        split_part = sentences[start:end]


In _parse_multi_negative_sentences, the line split_indices = np.array(split_indices) + np.array(list(range(len(split_indices)))) incorrectly shifts the group start indices. Since labels is a 1-D mask where 1 represents the anchor of each group, the indices of 1 are already the exact start indices of each group in the flat sentences tensor. Adding the range offset shifts the indices further to the right, causing anchors of subsequent groups to be treated as negatives in previous groups, subsequent groups to lose their anchors entirely, and out-of-bounds slicing at the end of the tensor. This line should be removed.

Suggested change

split_indices = torch.nonzero(labels, as_tuple=False).squeeze().tolist()

if isinstance(split_indices, int):

split_indices = [split_indices]

split_indices.append(len(labels))

split_indices = np.array(split_indices) + np.array(list(range(len(split_indices))))

split_tensors = []

for i in range(len(split_indices) - 1):

start, end = split_indices[i], split_indices[i + 1]

split_part = sentences[start:end]

split_indices = torch.nonzero(labels, as_tuple=False).squeeze().tolist()

if isinstance(split_indices, int):

split_indices = [split_indices]

split_indices.append(len(labels))

split_tensors = []

for i in range(len(split_indices) - 1):

start, end = split_indices[i], split_indices[i + 1]

split_part = sentences[start:end]

gemini-code-assist · 2026-06-03T08:04:47Z

+            for tc in parsed:
+                events.append({'tool_calls': [self._format_tc_delta(state, tc)]})
+            state['pending'] = buf[close_idx + len(self._TOOL_CALL_CLOSE):]


In parse_tool_call_stream, if self.parse(block) returns an empty list [] (e.g., due to a regex mismatch on a slightly malformed tool call), the loop silently advances state['pending'] without appending any event. This causes the entire tool call block to be silently dropped and lost from the stream. To prevent silent data loss, the block should be emitted as raw content when parsing yields no tool calls.

Suggested change

for tc in parsed:

events.append({'tool_calls': [self._format_tc_delta(state, tc)]})

state['pending'] = buf[close_idx + len(self._TOOL_CALL_CLOSE):]

if parsed:

for tc in parsed:

events.append({'tool_calls': [self._format_tc_delta(state, tc)]})

else:

events.append({'content': block})

state['pending'] = buf[close_idx + len(self._TOOL_CALL_CLOSE):]

gemini-code-assist · 2026-06-03T08:04:47Z

+            with processing_lock(key):
+                if 'remove_columns' not in kwargs:
+                    features = getattr(self.datasets[key], 'features', None)
+                    if features is not None:
+                        kwargs['remove_columns'] = list(features.keys())
+                self.datasets[key] = self.datasets[key].map(preprocess_func, **kwargs)


In Dataset.map, if remove_columns is not specified in kwargs, it automatically sets remove_columns to all features of the dataset. This deviates from standard Hugging Face Dataset.map behavior (where columns are preserved by default) and is inconsistent with the self._mixed branch (lines 225-227) where remove_columns is not automatically set. This inconsistency will cause silent data loss or unexpected behavior depending on whether mix_dataset was called. It is safer to let the user explicitly specify remove_columns when they want to remove columns.

with processing_lock(key): self.datasets[key] = self.datasets[key].map(preprocess_func, **kwargs)

gemini-code-assist · 2026-06-03T08:04:47Z

+def _make_odps_generator(
+    odps,
+    table_name: str,
+    partition: Optional[str] = None,
+    columns: Optional[List[str]] = None,
+    row_converter: Optional[Callable] = None,
+):


The generators _make_odps_generator and _make_multi_partition_generator capture the active odps client instance in their closures. Since ODPS clients hold active network connections/sessions, they are not picklable. This will cause PicklingError when the dataset is serialized (e.g., when using PyTorch DataLoader with num_workers > 0 or Ray distributed actors). To make the dataset robust and serializable, consider passing the connection parameters (or a factory function) to the generator creators and instantiating the ODPS client inside the generator function _gen itself.

gemini-code-assist · 2026-06-03T08:04:47Z

+def _high_repetition_with_threshold(text: str, threshold: float, ngram_size: int = 8, ngram_min_words: int = 30) -> bool:
+    words = text.split()
+    if len(words) < ngram_min_words:
+        return False
+    ngrams = [' '.join(words[i:i + ngram_size]) for i in range(len(words) - ngram_size + 1)]
+    unique_ratio = len(set(ngrams)) / len(ngrams)
+    return (1.0 - unique_ratio) > threshold


_high_repetition_with_threshold splits the text using text.split(), which assumes space-separated words. For CJK languages (Chinese, Japanese, Korean) which do not use spaces, text.split() returns a single-element list, making the repetition filter completely ineffective. To support multilingual texts, consider splitting by English words or individual CJK characters.

Suggested change

def _high_repetition_with_threshold(text: str, threshold: float, ngram_size: int = 8, ngram_min_words: int = 30) -> bool:

words = text.split()

if len(words) < ngram_min_words:

return False

ngrams = [' '.join(words[i:i + ngram_size]) for i in range(len(words) - ngram_size + 1)]

unique_ratio = len(set(ngrams)) / len(ngrams)

return (1.0 - unique_ratio) > threshold

def _high_repetition_with_threshold(text: str, threshold: float, ngram_size: int = 8, ngram_min_words: int = 30) -> bool:

import re

words = re.findall(r'\b\w+\b|[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff\uac00-\ud7a3]', text)

if len(words) < ngram_min_words:

return False

ngrams = [' '.join(words[i:i + ngram_size]) for i in range(len(words) - ngram_size + 1)]

unique_ratio = len(set(ngrams)) / len(ngrams)

return (1.0 - unique_ratio) > threshold

gemini-code-assist · 2026-06-03T08:04:47Z

+                    actual_ids.add(tid)
+                j += 1
+            # Must have at least one matching response; all responses must reference valid calls


_validate_tool_call_matching rejects any conversation where an assistant's tool_calls is not followed by a tool response. However, in SFT datasets, it is common and valid for a conversation to end with the assistant's tool call (terminal tool call). This filter will incorrectly discard these valid training samples. Consider allowing terminal tool calls by only enforcing matching if there are subsequent messages after the assistant's tool call.

Suggested change

actual_ids.add(tid)

j += 1

# Must have at least one matching response; all responses must reference valid calls

# Must have at least one matching response; all responses must reference valid calls

if not actual_ids:

if j < len(messages):

return False

elif not actual_ids.issubset(expected_ids):

return False

tastelikefeet added 30 commits May 9, 2026 11:52

wip

6aade99

wip

99394a2

wip

27cd090

fix

9e31c07

fix

33b8b32

fix

bbed39d

fix

504cfa0

fix

2393272

fix

5b731ea

fix

7576ef7

fix

eb85331

fix

1c0a093

fix

af4a892

fix

04565b6

wip

95d47f4

fix

88ceb1d

fix

e14e582

fix

56182f3

fix

e4dee4a

fix

f728a8d

fix

1ee5235

fix

2bfda3d

fix

b6f6b8b

fix

73d828b

fix

7cb1845

fix

34e6b44

fix

ce46d94

fix

e0e836e

fix

5ab035b

fix

e265980

tastelikefeet added 11 commits June 1, 2026 10:19

fix

6646ccb

fix

9af9fa5

fix

b4dfb58

fix

9debe32

fix

0b74fb7

fix

bbf20b0

fix

102da4b

fix

06528a6

fix

650a534

Merge branch 'feat/agentic2' of https://github.com/tastelikefeet/twinkle

8044f88

into feat/agentic2

fix

f71a235

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

tastelikefeet added 18 commits June 3, 2026 18:54

fix

654b4e1

fix

8fa3430

fix

d04d5b5

fix

cfb8bbe

fix

72c2695

fix

c4802d4

fix

c858d8c

fix

4ebc1b2

fix

2bab7f8

fix

8eba2a9

fix

f7ff145

fix

faadadc

fix

4719b5f

fix

4c6ea99

fix

3042538

fix

a2edde6

fix

ac61e1e

fix

9941dc9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Agentic Training#217

[WIP] Agentic Training#217
tastelikefeet wants to merge 103 commits into
modelscope:mainfrom
tastelikefeet:feat/agentic2

tastelikefeet commented Jun 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tastelikefeet commented Jun 3, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant