Publications

Workshop
A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

ICLR Workshop World Models, 2026

We propose a framework for evaluating goal-directedness in LLM agents, integrating behavioural evaluation with interpretability analyses of internal representations.

Paper Code

Conference
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping

ICLR, 2026

LLMs dishonesty breaks jailbreak evaluations but activation probes can catch it.

Paper Code

Conference
ASIDE: Architectural Separation of Instructions and Data in Language Models

Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert

ICLR, 2026

We introduce an architectural change to LLMs that separates instructions from data, to improve their security.

Paper Code

Preprint
Towards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI

Elisa Nguyen, Johannes Bertram, Evgenii Kortukov, Jean Y. Song, Seong Joon Oh

ArXiv, 2024

Making XAI more useful by first asking the users about their needs, with a focus on Training Data Attribution.

Paper

Conference
Studying Large Language Model Behaviors Under Context-Memory Conflicts With Real Documents

Evgenii Kortukov, Alexander Rubinstein, Elisa Nguyen, Seong Joon Oh

COLM, 2024

Studying context-memory knowledge conflicts as they appear in practice: how does factual knowledge influence LLM reading behaviors?

Paper Code

Workshop
Exploring Practitioner Perspectives On Training Data Attribution Explanations

Elisa Nguyen, Evgenii Kortukov, Jean Song, Seong Joon Oh

NeurIPS Workshop XAIA, 2023

Interviewing ML practitioners to explore the human factor of training data attribution explanations.

Paper

Journal
Non-Stationary Linear Bandits With Dimensionality Reduction for Large-Scale Recommender Systems

Saeed Ghoorchian, Evgenii Kortukov, Setareh Maghsudi

IEEE Open Journal of Signal Processing, 2023

Utilizing random projections to apply an MAB-algorithm to a recommender system scenario, where the data is high-dimensional and user preferences change over time.

Paper Code

Journal
Contextual Multi-Armed Bandit with Costly Feature Observation in Non-stationary Environments

Saeed Ghoorchian, Evgenii Kortukov, Setareh Maghsudi

IEEE Open Journal of Signal Processing, 2023

We study contextual multi-armed bandit problem where the features are costly and the agent has to simultaneously learn the reward distributions and the feature importances. The environment undergoes distribution shifts making the problem more challenging.

Paper Code

Evgenii Kortukov

Publications

WorkshopA Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

ConferenceStrategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

ConferenceASIDE: Architectural Separation of Instructions and Data in Language Models

PreprintTowards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI

ConferenceStudying Large Language Model Behaviors Under Context-Memory Conflicts With Real Documents

WorkshopExploring Practitioner Perspectives On Training Data Attribution Explanations

JournalNon-Stationary Linear Bandits With Dimensionality Reduction for Large-Scale Recommender Systems

JournalContextual Multi-Armed Bandit with Costly Feature Observation in Non-stationary Environments

Workshop
A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Conference
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Conference
ASIDE: Architectural Separation of Instructions and Data in Language Models

Preprint
Towards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI

Conference
Studying Large Language Model Behaviors Under Context-Memory Conflicts With Real Documents

Workshop
Exploring Practitioner Perspectives On Training Data Attribution Explanations

Journal
Non-Stationary Linear Bandits With Dimensionality Reduction for Large-Scale Recommender Systems

Journal
Contextual Multi-Armed Bandit with Costly Feature Observation in Non-stationary Environments