Skip to content

[WIP] OH Versa #8598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 268 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
268 commits
Select commit Hold shift + click to select a range
26c4f72
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 30, 2025
818533f
Merge branch 'main' into codeact_browsing
adityasoni9998 Jan 30, 2025
a699a0d
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 30, 2025
d34c412
Merge branch 'main' into codeact_browsing
adityasoni9998 Jan 30, 2025
e66a113
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Feb 1, 2025
ee1173e
Merge branch 'main' into codeact_browsing
adityasoni9998 Feb 1, 2025
83fa5b0
Rename visual browsing flag in agent config.
adityasoni9998 Feb 1, 2025
65bf992
Browser output condenser to condense observation outputs from browser…
adityasoni9998 Feb 1, 2025
3b05b68
Merge branch 'main' into browser_condenser
adityasoni9998 Feb 1, 2025
d35a225
Merge branch 'main' into browser_condenser
adityasoni9998 Feb 2, 2025
0b1ec8a
Merge remote-tracking branch 'upstream/main' into main
adityasoni9998 Feb 5, 2025
f49b7a9
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Feb 6, 2025
0f24032
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Feb 6, 2025
904b2b3
Remove free disk space steps from workflows to test if they are neces…
neubig Feb 5, 2025
6e7aa15
Fix memory leak in JSON encoder (#6620)
neubig Feb 5, 2025
0b361f3
Update and Improve zh-TW Traditional Chinese locale (#6621)
PeterDaveHello Feb 5, 2025
7e08383
chore(deps): bump the version-all group across 1 directory with 15 up…
dependabot[bot] Feb 6, 2025
f1911be
Only show start project button in conversations (#6626)
mamoodi Feb 6, 2025
7c1c19c
chore(frontend): Migrate from NextUI to HeroUI via codemod (#6635)
amanape Feb 6, 2025
945bdd7
Better error logging in posthog (#6346)
neubig Feb 6, 2025
03f4745
Add o1 to verfied models (#6642)
mamoodi Feb 6, 2025
3fa1fb7
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Feb 7, 2025
b5391f5
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Feb 20, 2025
3b504e1
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Feb 27, 2025
677488d
Merge branch 'main' into browser_condenser
adityasoni9998 Feb 27, 2025
3b7d86e
Code for evaluation run 2 on GAIA.
adityasoni9998 Mar 1, 2025
e561aa9
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Mar 1, 2025
19e5e5b
Changes to GAIA prompt and browser tool.
adityasoni9998 Mar 1, 2025
e9c32f7
Merge branch 'main' into eval_fixes
adityasoni9998 Mar 1, 2025
840b350
Added search engine tool in openhands. Refine prompt for GAIA. Made b…
adityasoni9998 Mar 3, 2025
edea1e9
Optimize memory usage in FileEditObservation (#6622)
neubig Feb 7, 2025
72a4ece
fix: handle SAAS mode properly in useSettings hook (#6646)
tofarr Feb 7, 2025
8cf58c6
feat: Add LocalRuntime (#5284)
xingyaoww Feb 7, 2025
f2781a5
chore(frontend): Take into account other error message types (#6647)
amanape Feb 7, 2025
102f2e4
fix: set `tool_choice` to none for non-fncall models (#6652)
xingyaoww Feb 7, 2025
04d5691
fix(6223): More properly add 'pyproject.toml' and 'poetry.lock' to th…
zchn Feb 8, 2025
4b23a77
chore(deps): bump the version-all group across 1 directory with 3 upd…
dependabot[bot] Feb 10, 2025
3d2268e
Removed in page callback (#6657)
tofarr Feb 10, 2025
6dce886
[Bug fix]: Standardize SecretStr use (#6660)
malhotra5 Feb 10, 2025
0588311
Add comprehensive OpenHands glossary (#6310)
rbren Feb 10, 2025
477b369
[Enhancement]: Handle GH token refresh inside runtime (#6632)
malhotra5 Feb 10, 2025
01d23ae
chore(deps): bump the version-all group in /frontend with 4 updates (…
dependabot[bot] Feb 10, 2025
7113d38
[Resolver]: Add target branch param (#6668)
malhotra5 Feb 10, 2025
8e55ced
Fix issue #6262: Add success/failure indicators for file read/edit op…
neubig Feb 10, 2025
1a5c252
Fix for issue where temp file is empty (#6669)
tofarr Feb 10, 2025
fcfc807
fix: Normalize whitespace when comparing patch context lines (#6541)
neubig Feb 10, 2025
a0c9451
fix: adding support for environment variables type dict (#6672)
fredysierra Feb 10, 2025
27e39fb
chore(deps): bump docker/setup-qemu-action from 3.3.0 to 3.4.0 (#6666)
dependabot[bot] Feb 10, 2025
b60acd5
chore(deps): bump the version-all group across 1 directory with 9 upd…
dependabot[bot] Feb 10, 2025
fba1f1b
hotfix: Typecheck routes during frontend build (#6676)
amanape Feb 10, 2025
508dea4
Bump OpenHands ACI to 0.2.1 (#6678)
xingyaoww Feb 10, 2025
a98f31f
Clean up global in llm.py (we figured it's not needed) (#6675)
enyst Feb 10, 2025
807b0bf
feat(runtime): use `prlimit` to limit resource usage of command to av…
xingyaoww Feb 11, 2025
25b73f9
fix(frontend): fix public github repo cannot be selected (#6680)
xyeric Feb 11, 2025
b71b723
refactor(runtime): Use openhands-aci file editor directly in runtime …
xingyaoww Feb 11, 2025
26f235f
Fix debug in remote runtime (#6688)
rbren Feb 11, 2025
e2add5c
chore(deps-dev): bump @tanstack/eslint-plugin-query from 5.66.0 to 5.…
dependabot[bot] Feb 11, 2025
e59878a
refactor: do not add DEBUG env var when it is not set (#6690)
xingyaoww Feb 11, 2025
dfbd928
Revert "Only show start project button in conversations" (#6698)
amanape Feb 12, 2025
70d7982
using all available system memory when RUNTIME_MAX_MEMORY_GB is not s…
xingyaoww Feb 12, 2025
1f01edb
Fix log formatting error (#6699)
tofarr Feb 12, 2025
7f5d767
chore: Throw a 404 instead of returning defaults if settings does not…
amanape Feb 12, 2025
bc18869
Feat: Add selected branch param to backend (#6508)
malhotra5 Feb 12, 2025
759a03f
Agent session no longer stuck in starting on raised exception (#6703)
tofarr Feb 13, 2025
6c49c5e
Release 0.24.0 (#6689)
mamoodi Feb 13, 2025
3626e29
More effective remote runtime identification (#6714)
tofarr Feb 13, 2025
dc4ca3a
fix: Filter `AgentCondensationObservation` events from agent state (#…
csmith49 Feb 13, 2025
8710951
chore(deps): bump the version-all group across 1 directory with 5 upd…
dependabot[bot] Feb 13, 2025
a418d0d
Evaluation harness: Add agent config option (#6662)
li-boxuan Feb 13, 2025
72237e4
feat(resolver): implement gitlab resolver (#6458)
wtiger9218 Feb 13, 2025
227f2f7
fix: Simplify nested f-string to fix pydoc-markdown parsing (#6717)
malhotra5 Feb 14, 2025
c50bc33
hotfix(Resolver): Workflow definition is out of sync with released pa…
malhotra5 Feb 14, 2025
f16bab0
feat(frontend): Settings screen (#6550)
amanape Feb 14, 2025
1227e50
[Resolver]: Prep env in expectation of release (#6735)
malhotra5 Feb 14, 2025
3ea122f
chore: upgrade `openhands-aci` to 0.2.2 (#6731)
ryanhoangt Feb 14, 2025
93e5441
chore(deps): bump the version-all group across 1 directory with 12 up…
dependabot[bot] Feb 15, 2025
5e469c5
docs: improve docstrings for CLI and config utils (#5398)
young010101 Feb 15, 2025
42ff2b7
Add a sanity test for load_app_config and get_agent_config_arg (#6723)
li-boxuan Feb 15, 2025
67fbf59
A few fixes for TAC evaluation harness (#6586)
li-boxuan Feb 15, 2025
643664a
Show docker build errors (#6695)
kripper Feb 15, 2025
af2e91d
fix: no interaction when clearing poetry cache (#6752)
arpandaze Feb 17, 2025
c6f4ddd
chore(deps): bump the version-all group in /frontend with 4 updates (…
dependabot[bot] Feb 17, 2025
debd000
docs(runtime): fix broken links of benchmarks (#6744)
nbyidiandian Feb 17, 2025
ee0225c
feat: implement optimistic updates for conversation deletion (#6745)
tofarr Feb 17, 2025
a3e134b
Added iterate method and additional tests for search functions (#6756)
tofarr Feb 17, 2025
e5599ef
Better LLM retry behavior (#6557)
rbren Feb 17, 2025
cb9eeae
Fix caps in status message (#6761)
rbren Feb 17, 2025
849d271
Improve SensitiveDataFilter and add comprehensive tests (#6755)
tofarr Feb 17, 2025
2988b10
fix: disable prlimit since limiting --vm breaks nodejs (#6765)
xingyaoww Feb 17, 2025
a1ffdce
Enable the multi conversation UI for all users (#6374)
tofarr Feb 17, 2025
c42c9cb
hotfix(Secrets): Add event stream filter for refreshed secret (#6764)
malhotra5 Feb 17, 2025
2f3bd3d
[Docs]: Cloud Openhands (#6747)
malhotra5 Feb 17, 2025
7aeaac5
Upgrade tree sitter (#6740)
neubig Feb 17, 2025
e5679f6
Update OpenHands Cloud docs with correct permissions and instructions…
mamoodi Feb 17, 2025
dbe2b33
Add selected branch to convo metadata (#6773)
malhotra5 Feb 17, 2025
1ef0d8f
CSS Fixes (#6770)
tofarr Feb 18, 2025
a195ada
docs: add guide for minimum computing and storage requirements (#6575)
abhiejam Feb 18, 2025
a5ea1db
hotfix: Consistent background color (#6786)
amanape Feb 18, 2025
d2398fe
hotfix(frontend): Input set/unset state and disable runtime input (#6…
amanape Feb 18, 2025
ffa5fbe
chore(deps): bump the version-all group across 1 directory with 9 upd…
dependabot[bot] Feb 18, 2025
5ac5e22
hotfix: Conversation panel toggle should change color given state (#6…
amanape Feb 18, 2025
4f00bf4
fix(frontend): Hide modal when in settings page if first time (#6792)
amanape Feb 18, 2025
a4c60d9
feat(SaaS): Billing settings screen (#6495)
amanape Feb 18, 2025
d5ee338
enh: Refactor `Event` -> `Message` pipeline outside of `CodeActAgent`…
csmith49 Feb 18, 2025
bf70965
Add `sysbox` support to remote runtime for eval; Add memory monitor, …
xingyaoww Feb 18, 2025
843f092
Fix type checking errors in resolver directory (#6738)
neubig Feb 19, 2025
ce218a1
Fix `diskcache` breaking CI & eval intermittently (#6817)
ryanhoangt Feb 19, 2025
6108434
Fix mypy errors in storage directory (#6809)
neubig Feb 19, 2025
d0ca8db
fix: Avoid infinite loop with rolling condensers and history truncati…
csmith49 Feb 19, 2025
bb559a2
Fix download workspace zip file event loop hanging (#6722)
diwu-sf Feb 19, 2025
83ecc20
Update openhands-aci to 0.2.5 (#6834)
ryanhoangt Feb 19, 2025
ec1ce2f
feat: better error logging for remote runtime (#6805)
xingyaoww Feb 19, 2025
578bf14
hotfix azure (#6806)
enyst Feb 19, 2025
659ca02
Release 0.25.0 (#6782)
mamoodi Feb 19, 2025
a253243
Update documentation with new settings page (#6716)
mamoodi Feb 19, 2025
336eb98
Clean up NullObservations from the stream (#6260)
enyst Feb 19, 2025
e5fc1af
Refactor I/O utils; allow 'task' command line parameter in cli.py (#6…
enyst Feb 19, 2025
1ddb68e
fix: LLM summarization prompt handles user messages (#6837)
csmith49 Feb 19, 2025
58ff252
hotfix: Remove external link in billing settings UI (#6841)
amanape Feb 20, 2025
507aa39
hotfix: Set proper minimum and maximum defaults that can be entered i…
amanape Feb 20, 2025
dc045d2
Fix: Less squashed logo (#6853)
tofarr Feb 20, 2025
2984ed7
Add conversation age limit configuration (#6763)
rbren Feb 20, 2025
5a8056b
chore(frontend): Standardize custom colors used throughout the app (#…
amanape Feb 20, 2025
85c5141
[Bug]: Fix workflow definition for installation phase of resolver (#6…
malhotra5 Feb 20, 2025
cf727c9
Fix: Simplify prompt caching for new Anthropic API (#6860)
enyst Feb 20, 2025
18d6dee
chore(deps): bump the version-all group across 1 directory with 10 up…
dependabot[bot] Feb 21, 2025
aaf3e08
Docs: Clarify config.toml usage in evaluation harness (#6828)
xingyaoww Feb 21, 2025
316322a
Add enable_history_truncation option to disable history truncation (#…
li-boxuan Feb 21, 2025
58cfd16
Save complete trajectory in presence of history truncation (#6751)
li-boxuan Feb 21, 2025
f8d468c
Fix jumpy conversation panel (#6874)
tofarr Feb 21, 2025
d26d925
chore(frontend): Remove latest conversation text in home screen (#6851)
amanape Feb 21, 2025
fe42b7e
fix: Add missing type annotations in utils/ directory (#6687)
neubig Feb 21, 2025
4915d20
Fix mypy errors in agenthub directory (#6811)
neubig Feb 21, 2025
25b0192
(feat): Enable memory condensation from settings page (#6868)
csmith49 Feb 21, 2025
d36a477
Add info logs for microagent loading and triggering (#6882)
rbren Feb 21, 2025
b32b4be
Fix: File Descriptor leak (#6883)
tofarr Feb 21, 2025
11c37e0
refactor : Improve frontend setup doc and locale error (#6850)
dai-dao Feb 21, 2025
afc7605
Fix: Increase Entropy Requirement for Secret Redaction to Reduce Fals…
tofarr Feb 22, 2025
aa640af
Revert "Fix: File Descriptor leak" (#6887)
enyst Feb 22, 2025
cf1265d
Fix for regression where conversations are not clickable (#6886)
tofarr Feb 22, 2025
c26c747
Keep the first user message by default in condensers (#6888)
enyst Feb 23, 2025
65524f1
Use LLM APIs responses in token counting (#5604)
enyst Feb 23, 2025
cb3edf2
Display session ID in CLI mode
enyst Feb 24, 2025
8cd6e60
hotfix: Fix switch color regression (#6881)
amanape Feb 24, 2025
30d76c3
Daytona Runtime (#6863)
idagelic Feb 24, 2025
974a46c
Fix mypy errors in security/invariant directory (#6908)
neubig Feb 24, 2025
8ba8990
Fix mypy errors in core directory (#6901)
neubig Feb 24, 2025
5a16979
Fix file descriptor leak (#6897)
tofarr Feb 24, 2025
bae3887
Small rename to long term memory (#6914)
enyst Feb 24, 2025
591e0d9
Handle Docker version string with +dfsg1 (#6732)
kripper Feb 24, 2025
3e455ad
chore: Make remote runtime class default to None (#6919)
xingyaoww Feb 24, 2025
03cdded
Replace shebang with /usr/bin/env bash for improved portability (#6876)
mateuszkwiatkowski Feb 24, 2025
2422d70
Revert "Fix file descriptor leak (#6897)" (#6921)
tofarr Feb 24, 2025
584aa7c
add extended generic section (#5932)
celek Feb 24, 2025
53f8dc1
Release 0.26.0 (#6915)
mamoodi Feb 24, 2025
6d248ee
Add documentation checkbox to PR template (#6924)
mamoodi Feb 24, 2025
aebfa76
chore(frontend): Claude 3.7 is visible in dropdown for selection (#6931)
amanape Feb 25, 2025
facd03a
feat(llm): Add Claude 3.7 backend configurations (#6937)
neubig Feb 25, 2025
693da0a
Add pause_closed_runtimes config to pause instead of stop runtimes (#…
rbren Feb 25, 2025
218dc54
[Feat]: Adding endpoint for suggested tasks Openhands could tackle (#…
malhotra5 Feb 26, 2025
e2fb796
fix: `task_str` validation not required for trajectory replay (#6957)
ryanhoangt Feb 26, 2025
9858c89
Refactor llm config from toml and clean up (#6923)
enyst Feb 26, 2025
7fa0959
Add ability to define custom runtime classes (#6955)
raymyers Feb 26, 2025
04175c7
Fix fd leak (#6950)
tofarr Feb 26, 2025
1a9b284
Add selected_repo to command line (#6949)
enyst Feb 26, 2025
a44fd2c
hotfix(frontend): Consistent buttons and their styles throughout the …
amanape Feb 26, 2025
3815cfc
Fix microagent matching to the user message, not previous enhancement…
enyst Feb 26, 2025
4f5c7d2
Refactor agent_config loading from toml (#6967)
enyst Feb 26, 2025
16c54e0
refactor: codeact tools into separate files (#6978)
xingyaoww Feb 26, 2025
288f46a
chore(deps): bump the version-all group across 1 directory with 11 up…
dependabot[bot] Feb 26, 2025
994f4f8
Azure completion_tokens fix (take two) (#6975)
enyst Feb 27, 2025
855ccb8
Add system event listeners for monitoring (#6929)
raymyers Feb 27, 2025
32fda5f
[agent] System message update (#6787)
xingyaoww Feb 27, 2025
39ae2bd
chore(deps): bump react-icons from 5.4.0 to 5.5.0 in /docs in the ver…
dependabot[bot] Feb 27, 2025
2ebc816
[eval] Upgrade SWE-Bench to use official image and latest harness (#6…
xingyaoww Feb 27, 2025
7d0befc
Refactor sandbox and security configurations (#6973)
enyst Feb 27, 2025
cb8cfe0
Fix for error cleaning stale (#6971)
tofarr Feb 27, 2025
d1546f4
Feat out of credits msg (#6969)
tofarr Feb 27, 2025
c839a5f
hotfix(frontend): Truncate long conversation card titles (#7001)
amanape Feb 27, 2025
17dda08
Fix image tag inconsistency in forked-PR workflows (#6998)
zchn Feb 27, 2025
82abb23
Refactor: Moving ConversationInfo to server module (#6981)
tofarr Feb 27, 2025
57a1768
Release 0.27.0 (#6993)
mamoodi Feb 27, 2025
065ab7b
feat: add sound and browser notifications for agent state changes (#6…
xingyaoww Feb 27, 2025
9c48604
Page Refresh now restarts agent loop if status is STOPPED or ERROR (#…
tofarr Feb 27, 2025
242e5b1
add add_agent.md (#6891)
jaybutera Feb 27, 2025
033f004
chore(deps): bump the version-all group across 1 directory with 7 upd…
dependabot[bot] Feb 27, 2025
fd27d4f
Re-add separators between user messages (#7004)
enyst Feb 27, 2025
f2503fe
[agent] Add "thinking" tool only (#6977)
xingyaoww Feb 27, 2025
a74d972
Add Memory Monitor VSCode Extension (#6951)
xingyaoww Feb 27, 2025
9db439f
Separate additional_info template (#6996)
enyst Feb 27, 2025
7612a56
fix: Remove nested git repositories before adding files in SWE-bench …
magic3007 Feb 28, 2025
f5c7942
Refactor to a helper class for the agent's history (ConversationMemor…
enyst Feb 28, 2025
cdd9f58
chore(deps-dev): bump llama-index from 0.12.20 to 0.12.21 in the llam…
dependabot[bot] Feb 28, 2025
8d0e423
[agent] improve finish tool for sonnet 3.7 (#7002)
xingyaoww Feb 28, 2025
80b9552
feat: Adding sandbox property runtime_binding_address to specify whic…
fredysierra Feb 28, 2025
21ef089
Add diff for edit observation and display in UI (#7014)
ryanhoangt Feb 28, 2025
fae10c8
Fix: Update context window exceeded detection (#7024)
csmith49 Feb 28, 2025
8a9fdbe
Support docker_runtime_kwargs dict (#7025)
kripper Feb 28, 2025
69a1c9a
Keycloak changes (#6986)
chuckbutkus Feb 28, 2025
0dec0fb
Bug fixes (#6460)
kripper Feb 28, 2025
979135b
Add Docker microagent for installation and usage (#7027)
xingyaoww Feb 28, 2025
b69ecc5
Add Kubernetes microagent (#7028)
xingyaoww Feb 28, 2025
de09b36
Fix URL for staging stack (#7030)
chuckbutkus Feb 28, 2025
3e4dd38
Structured logging mode (#7034)
raymyers Mar 1, 2025
ea95639
Remove hard error on session reuse (#7026)
rbren Mar 1, 2025
06e6988
Create CITATION.cff (#7037)
kiwamizamurai Mar 1, 2025
96e4831
Separate microagent template (#7041)
enyst Mar 1, 2025
7295344
Update docker.py to support podman (#6778)
Pepsi1x1 Mar 1, 2025
92cc519
Fix argument in swe-bench grading scripts (#7046)
enyst Mar 2, 2025
d6605de
Updates to the ISSUE TRIAGE (#7043)
mamoodi Mar 2, 2025
78db84f
chore: update daytona readme (#7053)
idagelic Mar 2, 2025
9665406
chore: daytona readme quick start verbosity (#7056)
idagelic Mar 2, 2025
d050f48
Fix GitLab CI environment variable check (issue #7050) (#7052)
enyst Mar 2, 2025
3004073
More explicit feedback message about how to report errors to develope…
neubig Mar 2, 2025
869e291
[Refactor] split runtime initialization (create, connect, init) in cl…
enyst Mar 2, 2025
f73edd7
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Mar 3, 2025
d071acf
Update browser tool description
adityasoni9998 Mar 3, 2025
a578cf3
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Mar 13, 2025
35ab168
Merge branch 'main' into eval_fixes
adityasoni9998 Mar 13, 2025
a880f55
Added fixes for file downloads, goto action timeouts, and some other …
adityasoni9998 Mar 13, 2025
a05b39d
Added fixes for output formatting, scrolling, timeouts for actions ot…
adityasoni9998 Mar 14, 2025
bc47edc
Fix for context window errors for large web-pages using filter visibl…
adityasoni9998 Mar 21, 2025
235d470
Added planning prompt to agent which is added to event stream every k…
adityasoni9998 Mar 23, 2025
d030d77
Minor fixes to prompt and added TODOs.
adityasoni9998 Mar 23, 2025
7efc928
Code for run 5 on GAIA: use planning, forced finish upon reaching max…
adityasoni9998 Mar 28, 2025
d4acd7d
Minor fixes to code and prompt.
adityasoni9998 Mar 28, 2025
555ba35
minor fix to run_infer.sh for path.
adityasoni9998 Apr 3, 2025
416aa73
Initial Code for TAC evaluation
adityasoni9998 Apr 3, 2025
378d5a8
Update eval code for TAC
adityasoni9998 Apr 3, 2025
7fff512
Code for TAC eval (keeping a simple prompt for now)
adityasoni9998 Apr 3, 2025
b139681
Minor changes - increase timeout for init task env and fix result sum…
adityasoni9998 Apr 6, 2025
0492464
Fix for system installation of packages in OpenHands docker runtime
adityasoni9998 Apr 8, 2025
83b176b
Final code for TAC eval
adityasoni9998 Apr 18, 2025
2d78897
Code for swe-bench multimodal eval
adityasoni9998 Apr 20, 2025
49f8146
Added fixes for swe-bench multimodal eval code.
adityasoni9998 Apr 28, 2025
f5846b8
Add Tavily search API to OpenHands
adityasoni9998 May 4, 2025
fa9e4bf
Update run_infer.py to fix patch generation
adityasoni9998 May 5, 2025
2549b7f
More fixes to patch generation in OpenHands
adityasoni9998 May 5, 2025
89c3d3f
Minor changes to prompt and minor fixes
adityasoni9998 May 7, 2025
86c021e
Final code for SWE-Bench Multimodal Eval on test set
adityasoni9998 May 11, 2025
2f999ce
anonymize pyproject
adityasoni9998 May 16, 2025
643adcf
Revert "anonymize pyproject"
adityasoni9998 May 19, 2025
c469110
Merge commit 'd648d249d8c1f7ad8ac7d67b5f0e3249986e9427' into xw/versa…
xingyaoww May 20, 2025
3442a5b
Merge commit '3873d9f0027f0211e464bf0ec104578fea935b8f' into xw/versa…
xingyaoww May 20, 2025
a578de7
revert file permission
xingyaoww May 20, 2025
63fcdfc
cleanup script
xingyaoww May 20, 2025
253af33
revert some unnecessary change
xingyaoww May 20, 2025
10dc84a
revert some unnecessary change
xingyaoww May 20, 2025
29267be
use browsergym fork
xingyaoww May 21, 2025
ed9b8a4
merge main
xingyaoww May 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 135 additions & 22 deletions evaluation/benchmarks/gaia/run_infer.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
import asyncio
import base64
import functools
import io
import os
import re
import shutil
import zipfile

import huggingface_hub
import numpy as np
import pandas as pd
from datasets import load_dataset
from PIL import Image

from evaluation.benchmarks.gaia.scorer import question_scorer
from evaluation.utils.shared import (
Expand All @@ -28,7 +34,11 @@
from openhands.core.config.utils import get_agent_config_arg
from openhands.core.logger import openhands_logger as logger
from openhands.core.main import create_runtime, run_controller
from openhands.events.action import AgentFinishAction, CmdRunAction, MessageAction
from openhands.events.action import (
AgentFinishAction,
CmdRunAction,
MessageAction,
)
from openhands.events.observation import CmdOutputObservation
from openhands.runtime.base import Runtime
from openhands.utils.async_utils import call_async_from_sync
Expand All @@ -40,16 +50,26 @@
'CodeActAgent': functools.partial(codeact_user_response, encapsulate_solution=True),
}

# TODO: change this message as per the finish tool you are using.
AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': 'When you think you have solved the question, please first send your answer to user through message and then exit.\n'
'CodeActAgent': 'When you think you have solved the question, please use the finish tool and include your final answer in the message parameter of the finish tool. Your final answer MUST be encapsulated within <solution> and </solution>.\n\n'
}
# AGENT_CLS_TO_INST_SUFFIX = {
# 'CodeActAgent': 'When you think you have solved the question, please first send your answer to user through message and then exit using the finish tool.\n\n'
# }


def get_config(
metadata: EvalMetadata,
) -> AppConfig:
search_api_key = os.environ.get('SEARCH_API_KEY', None)
assert search_api_key is not None, 'Environment variable SEARCH_API_KEY is not set.'

sandbox_config = get_default_sandbox_config_for_eval()
sandbox_config.base_container_image = 'python:3.12-bookworm'
sandbox_config.runtime_startup_env_vars = {
'SEARCH_API_KEY': search_api_key,
}
config = AppConfig(
default_agent=metadata.agent_class,
run_as_openhands=False,
Expand All @@ -66,7 +86,10 @@ def get_config(
else:
logger.info('Agent config not provided, using default settings')
agent_config = config.get_agent_config(metadata.agent_class)
agent_config.enable_prompt_extensions = False
print(agent_config)
# agent_config.enable_prompt_extensions = False
# agent_config.enable_som_visual_browsing = True

return config


Expand All @@ -86,31 +109,91 @@ def initialize_runtime(
obs = runtime.run_action(action)
assert obs.exit_code == 0

action = CmdRunAction(command='mkdir -p /workspace/downloads')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
assert obs.exit_code == 0

if instance['file_name'] != '':
# if this question comes with a file, we need to save it to the workspace
assert metadata.data_split is not None
extension_name = instance['file_name'].split('.')[-1]
src_file = os.path.join(
DATASET_CACHE_DIR, '2023', metadata.data_split, instance['file_name']
)
assert os.path.exists(src_file)
dest_file = os.path.join('/workspace', instance['file_name'])
runtime.copy_to(src_file, dest_file)

# rename to file.extension_name
extension_name = instance['file_name'].split('.')[-1]
action = CmdRunAction(
command=f'mv /workspace/{instance["file_name"]} /workspace/file.{extension_name}'
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
assert obs.exit_code == 0
if extension_name == 'zip':
temp_dir = os.path.join(
DATASET_CACHE_DIR, '2023', metadata.data_split, 'tmp_file'
)
os.makedirs(temp_dir, exist_ok=True)
with zipfile.ZipFile(src_file, 'r') as zip_ref:
zip_ref.extractall(temp_dir)
for root, dirs, files in os.walk(temp_dir):
for file in files:
dest_file = '/workspace'
runtime.copy_to(os.path.join(root, file), dest_file)
shutil.rmtree(temp_dir)
elif extension_name not in ['jpg', 'png']:
dest_file = '/workspace'
runtime.copy_to(src_file, dest_file)

# rename to file.extension_name
action = CmdRunAction(
command=f'mv /workspace/{instance["file_name"]} /workspace/file.{extension_name}'
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
assert obs.exit_code == 0

action = CmdRunAction(command='cd /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
assert obs.exit_code == 0

logger.info(f'{"-" * 50} END Runtime Initialization Fn {"-" * 50}')
action = CmdRunAction(
command='apt-get update && apt-get install -y ffmpeg && apt-get install -y ffprobe'
)
runtime.run_action(action)
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")


def image_to_png_base64_url(
image: np.ndarray | Image.Image, add_data_prefix: bool = True
):
"""Convert a numpy array to a base64 encoded png image url."""
if isinstance(image, np.ndarray):
image = Image.fromarray(image)
if image.mode in ('RGBA', 'LA'):
image = image.convert('RGB')
buffered = io.BytesIO()
image.save(buffered, format='PNG')

image_base64 = base64.b64encode(buffered.getvalue()).decode()
return (
f'data:image/png;base64,{image_base64}'
if add_data_prefix
else f'{image_base64}'
)


def image_to_jpg_base64_url(
image: np.ndarray | Image.Image, add_data_prefix: bool = True
):
"""Convert a numpy array to a base64 encoded jpeg image url."""
if isinstance(image, np.ndarray):
image = Image.fromarray(image)
if image.mode in ('RGBA', 'LA'):
image = image.convert('RGB')
buffered = io.BytesIO()
image.save(buffered, format='JPEG')

image_base64 = base64.b64encode(buffered.getvalue()).decode()
return (
f'data:image/jpeg;base64,{image_base64}'
if add_data_prefix
else f'{image_base64}'
)


def process_instance(
Expand All @@ -134,16 +217,43 @@ def process_instance(
dest_file = None

# Prepare instruction
instruction = f'{instance["Question"]}\n'
instruction = f"""You have one question to answer. It is paramount that you provide a correct answer.
Give it all you can: I know for a fact that you have access to all the relevant tools to solve it and find the correct answer (the answer does exist). Failure or 'I cannot answer' or 'None found' will not be tolerated, success will be rewarded.
You must make sure you find the correct answer! You MUST strictly follow the task-specific formatting instructions for your final answer.
Here is the task:\n{instance['Question']}\n\n"""
logger.info(f'Instruction: {instruction}')
image_urls = []
if dest_file:
instruction += f'\n\nThe mentioned file is provided in the workspace at: {dest_file.split("/")[-1]}'

instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
instruction += 'Please encapsulate your final answer (answer ONLY) within <solution> and </solution>.\n'
if extension_name not in ['jpg', 'png', 'zip']:
instruction += f'To solve this task you will have to use the attached file provided in the workspace at location: {dest_file}\n\n'
elif extension_name == 'zip':
filenames = []
src_file = os.path.join(
DATASET_CACHE_DIR, '2023', metadata.data_split, instance['file_name']
)
with zipfile.ZipFile(src_file, 'r') as zip_ref:
filenames = zip_ref.namelist()
filenames = [f'/workspace/{file}' for file in filenames]
filenames = ', '.join(filenames)
instruction += f'To solve this task you will have to use the attached files provided in the workspace at locations: {filenames}\n\n'
else:
src_file = os.path.join(
DATASET_CACHE_DIR, '2023', metadata.data_split, instance['file_name']
)
instruction += 'Image: To solve this task you will have to use the image shown below.\n\n'
image = Image.open(src_file)
if extension_name == 'jpg':
image_urls.append(image_to_jpg_base64_url(image))
else:
image_urls.append(image_to_png_base64_url(image))
instruction += """IMPORTANT: When seeking information from a website, REFRAIN from arbitrary URL navigation. You should utilize the designated search engine tool with precise keywords to obtain relevant URLs or use the specific website's search interface. DO NOT navigate directly to specific URLs as they may not exist.\n\nFor example: if you want to search for a research paper on Arxiv, either use the search engine tool with specific keywords or navigate to arxiv.org and then use its interface.\n"""
instruction += 'IMPORTANT: You should NEVER ask for Human Help.\n'
instruction += 'IMPORTANT: Please encapsulate your final answer (answer ONLY) within <solution> and </solution>. Your answer will be evaluated using string matching approaches so it important that you STRICTLY adhere to the output formatting instructions specified in the task (e.g., alphabetization, sequencing, units, rounding, decimal places, etc.)\n'
instruction += (
'For example: The answer to the question is <solution> 42 </solution>.\n'
)
instruction += "IMPORTANT: Your final answer should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, express it numerically (i.e., with digits rather than words), do not use commas, and do not include units such as $ or percent signs unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities). If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.\n"

# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX.get(metadata.agent_class, '')
logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
Expand All @@ -156,7 +266,9 @@ def process_instance(
state: State | None = asyncio.run(
run_controller(
config=config,
initial_user_action=MessageAction(content=instruction),
initial_user_action=MessageAction(
content=instruction, image_urls=image_urls
),
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
metadata.agent_class
Expand All @@ -175,7 +287,7 @@ def process_instance(
for event in reversed(state.history):
if event.source == 'agent':
if isinstance(event, AgentFinishAction):
model_answer_raw = event.thought
model_answer_raw = event.final_thought
break
elif isinstance(event, CmdRunAction):
model_answer_raw = event.thought
Expand Down Expand Up @@ -222,6 +334,7 @@ def process_instance(
error=state.last_error if state and state.last_error else None,
test_result=test_result,
)
runtime.close()
return output


Expand Down
4 changes: 2 additions & 2 deletions evaluation/benchmarks/gaia/scripts/run_infer.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/usr/bin/env bash
#!/bin/bash
set -eo pipefail

source "evaluation/utils/version_control.sh"
Expand Down Expand Up @@ -39,7 +39,7 @@ echo "LEVELS: $LEVELS"
COMMAND="poetry run python ./evaluation/benchmarks/gaia/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \
--max-iterations 54 \
--level $LEVELS \
--data-split validation \
--eval-num-workers $NUM_WORKERS \
Expand Down
Loading
Loading