Skip to content

some concerns with sage #13

@yanan1116

Description

@yanan1116

Hi authors, thanks for your work.

I have two concerns with regard to the sage framework.

In Section 3.2.4 (SAGE), during training with sequential rollout, it seems that skills generated in the first task are directly made available to the second task within the same scenario (i.e., an implicit “perfect retrieval” setting). This setup avoids the need for a retrieval mechanism during training and ensures that relevant skills are always accessible. The pipeline is like task₁ → generate skill → task₂ → utilize skill

Also, thru this paper, I do not see anything about persistent cross-scenario skill bank, which means that the skills learned during training, are not applied in test / inference stage, right ?
So during inference stage, it is sequential rollout of on-the-fly skill construction, emphasizing more on skill transfer across similar tasks under same scenario

Anyway, there is no retrieval from a large skill bank kind of thing in the paper, i guess you eliminates the need for a retrieval model, no embedding, similarity search, ranking or filtering etc.... which also means that the evaluation in the experiments is not the same with real-world settings.

My key concern is that the tasks grouped within the same scenario are extremely similar, which resembles a form of data leakage. As a result, the evaluation primarily measures performance on near-duplicate tasks, making it difficult to claim meaningful generalization or robustness. The improvements could simply be due to overfitting within the scenario rather than true cross-task transfer.

During training, the policy is optimized under the assumption that relevant skills are always correctly provided.
At inference time, performance may depend heavily on the quality of the retrieval mechanism, which is not part of the training objective. in other words, most of the cases during inference stage in practices, do not have so many pre similar tasks that can be utilized for relevant skills for following tasks, right ? so it seems that the assumption in the paper is too clean and perfect, and the performance gain is actually from the similart tasks leakage.

to be blunt, in real settings, most cases do not have any similar tasks which are well-aligned and ready to help (scenario = {task₁, task₂, task₃}, same api, human crafted order). cross-scenario tasks are most common, and the agent needs cross-scenario capabilities with skills learned across different tasks which share little similarities, right ?
at least, in this paper, since the agent is not trained to learn how to retrieve skills from messy and large skills bank, i highly question the improvement in the experiments are purely from the manually crafted scenario.
you also show some of the findings in 4.4 Ablation Studies : Skill Library Agent with Retrieval in Practice..
therefore, in the settings where there are only two tasks, skills are from running task_1, whcih provides perfect skill access without any retrieval training.

examples from paper, and the tasks are extremely similar which means that it is like a leakage and shows no skill robustness and generalization
Image


another concern is about the reward assignment in grpo,
In SAGE, GRPO is applied over task chains constructed from sequential rollouts, where the agent first executes task₁ to generate skills and then executes task₂ conditioned on the updated skill library. The entire chain is treated as a single trajectory, and a chain-level reward is assigned based on both task success and skill usage. This reward is then used to compute group-relative advantages, which are broadcast to all tokens in the trajectory for policy optimization.

chain-level scalar reward:

R_chain = task success + skill generation reward + skill usage reward + penalty
A_i = R_i - mean(R_group) then broadcast to all tokens

for example,
case 1: task_1 --> good skill --> task_2 utilizes it properly --> success : reward=2
case 2: task_1 --> trash skill --> task_2 does not utilizes it but still works out --> success : reward=2
so the reward is the same and back distribute the reward to the tokens, which means that model cannot learn which skill is good or bad.
( heuristic reward for skill generation is also not appropriate)

So the same scalar reward is assigned to all tokens across both tasks, making it unclear which part of the trajectory (e.g., which generated skill or which skill usage decision) is responsible for success or failure. The same scalar reward is assigned to all tokens across both tasks, making it unclear which part of the trajectory (e.g., which generated skill or which skill usage decision) is responsible for success or failure. This may also relate to the observation in your ablation (Section 4.4) where skill usage improves efficiency and SGC but can degrade TGC, possibly due to inappropriate skill application.

also, since the context/prompt are no longer the identical , comparison in terms of the rewards does not make sense anymore
Image

Could you clarify how the framework enables the model to learn which skills are beneficial versus detrimental, given the lack of explicit skill-level or step-level credit assignment?
Do you see this as a limitation of the current approach, and do you anticipate that more fine-grained reward signals or hierarchical formulations might be necessary to address this?


looking forward to hear your view of points.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions