ishidalab

university

https://takashiishida.github.io

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

tksii authored a paper 5 days ago

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

skydddoogg updated a dataset 11 days ago

ishidalab/capcode

tksii authored a paper 22 days ago

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

View all activity

Papers

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

View all Papers

tksii

authored a paper 5 days ago

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

Paper • 2606.16613 • Published 18 days ago • 9

skydddoogg

updated a dataset 11 days ago

ishidalab/capcode

Viewer • Updated 11 days ago • 378 • 142 • 1

tksii

authored 3 papers 22 days ago

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Paper • 2604.02986 • Published Apr 3 • 3

LLM Routing with Dueling Feedback

Paper • 2510.00841 • Published Oct 1, 2025

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Paper • 2606.07379 • Published 28 days ago • 5

skydddoogg

authored a paper 22 days ago

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Paper • 2606.07379 • Published 28 days ago • 5

skydddoogg

in ishidalab/capcode 22 days ago

Add task category and license metadata

#2 opened 22 days ago by

nielsr

skydddoogg

submitted a paper to Daily Papers 22 days ago

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Paper • 2606.07379 • Published 28 days ago • 5

skydddoogg

published a dataset 25 days ago

ishidalab/capcode

Viewer • Updated 11 days ago • 378 • 142 • 1

tksii

updated a dataset about 1 month ago

ishidalab/capbencher

Viewer • Updated May 30 • 15.5k • 36 • 2

skydddoogg

authored a paper 4 months ago

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Paper • 2505.18102 • Published May 23, 2025 • 2

tksii

authored 2 papers 4 months ago

EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Paper • 2506.08762 • Published Jun 10, 2025 • 1

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Paper • 2505.18102 • Published May 23, 2025 • 2

skydddoogg

updated a dataset 4 months ago

ishidalab/capbencher

Viewer • Updated May 30 • 15.5k • 36 • 2

tksii

published a dataset 5 months ago

ishidalab/capbencher

Viewer • Updated May 30 • 15.5k • 36 • 2

tksii

updated a dataset 5 months ago

ishidalab/capbencher

Viewer • Updated May 30 • 15.5k • 36 • 2

AI & ML interests

Recent Activity

Papers

Team members 4

ishidalab's activity

Add task category and license metadata