some concerns with sage

Hi authors, thanks for your work.

I have two concerns with regard to the sage framework.

In Section 3.2.4 (SAGE), during training with sequential rollout, it seems that skills generated in the first task are directly made available to the second task within the same scenario (i.e., an implicit “perfect retrieval” setting). This setup avoids the need for a retrieval mechanism during training and ensures that relevant skills are always accessible.  The pipeline is like `task₁ → generate skill → task₂ → utilize skill`

Also, thru this paper, I do not see anything about `persistent cross-scenario skill bank`, which means that the skills learned during training, are not applied in test / inference stage, right ?
So during inference stage, it is sequential rollout of `on-the-fly skill construction`, emphasizing more on `skill transfer across similar tasks under same scenario`

Anyway, there is no `retrieval from a large skill bank` kind of thing in the paper, i guess you eliminates the need for a retrieval model, no embedding, similarity search, ranking or filtering etc.... which also means that the evaluation in the experiments is not the same with real-world settings.



**My key concern is that the tasks grouped within the same scenario are extremely similar, which resembles a form of data leakage. As a result, the evaluation primarily measures performance on near-duplicate tasks, making it difficult to claim meaningful generalization or robustness. The improvements could simply be due to overfitting within the scenario rather than true cross-task transfer.**

During training, the policy is optimized under the assumption that relevant skills are always correctly provided.
At inference time, performance may depend heavily on the quality of the retrieval mechanism, which is not part of the training objective. in other words, most of the cases during inference stage in practices, do not have so many pre similar tasks that can be utilized for relevant skills for following tasks, right ? so it seems that the assumption in the paper is too clean and perfect, and the performance gain is actually from the similart tasks leakage.

to be blunt, in real settings, most cases do not have any similar tasks which are well-aligned and ready to help `(scenario = {task₁, task₂, task₃}, same api, human crafted order)`. cross-scenario tasks are most common, and the agent needs cross-scenario capabilities with skills learned across different tasks which share little similarities, right ?
at least, in this paper, since the agent is not trained to learn how to retrieve skills from messy and large skills bank, i highly question the improvement in the experiments are purely from the manually crafted `scenario`. 
you also show some of the findings in 4.4 Ablation Studies : Skill Library Agent with Retrieval in Practice..
therefore, in the settings where there are only two tasks, skills are from running task_1, whcih provides `perfect skill access` without any retrieval training.



examples from paper, and the tasks are extremely similar which means that it is like a leakage and shows no skill robustness and generalization
<img width="1695" height="774" alt="Image" src="https://github.com/user-attachments/assets/69187f78-9509-49d1-b3af-f4ddfa68689f" />

---

another concern is about the reward assignment in grpo, 
In SAGE, GRPO is applied over task chains constructed from sequential rollouts, where the agent first executes task₁ to generate skills and then executes task₂ conditioned on the updated skill library. The entire chain is treated as a single trajectory, and a chain-level reward is assigned based on both task success and skill usage. This reward is then used to compute group-relative advantages, which are broadcast to all tokens in the trajectory for policy optimization.

chain-level scalar reward: 
```
R_chain = task success + skill generation reward + skill usage reward + penalty
A_i = R_i - mean(R_group) then broadcast to all tokens
```
for example, 
case 1: `task_1 --> good skill --> task_2 utilizes it properly --> success : reward=2`
case 2: `task_1 --> trash skill --> task_2 does not utilizes it but still works out --> success : reward=2`
so the reward is the same and back distribute the reward to the tokens, which means that model cannot learn which skill is good or bad. 
( heuristic reward for skill generation is also not appropriate)


So the same scalar reward is assigned to all tokens across both tasks, making it unclear which part of the trajectory (e.g., which generated skill or which skill usage decision) is responsible for success or failure. The same scalar reward is assigned to all tokens across both tasks, making it unclear which part of the trajectory (e.g., which generated skill or which skill usage decision) is responsible for success or failure.  This may also relate to the observation in your ablation (Section 4.4) where skill usage improves efficiency and SGC but can degrade TGC, possibly due to inappropriate skill application. 


also, since the context/prompt are no longer the identical , comparison in terms of the rewards does not make sense anymore
<img width="816" height="375" alt="Image" src="https://github.com/user-attachments/assets/89a08465-8195-4f0c-a137-d7d631db1311" />



Could you clarify how the framework enables the model to learn which skills are beneficial versus detrimental, given the lack of explicit skill-level or step-level credit assignment?
Do you see this as a limitation of the current approach, and do you anticipate that more fine-grained reward signals or hierarchical formulations might be necessary to address this?


---
looking forward to hear your view of points.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some concerns with sage #13

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

some concerns with sage #13

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions