10 June 2026, by Aleksandra Bakalova | L
Abstract: We propose a method for extracting the algorithms learned by Transformers. We test it on models trained on algorithmic tasks and formal languages. By translating trained models into (D-)RASP programs and simplifying them with circuit discovery, we find that Transformers that generalize well to longer inputs often rely on small, interpretable (D-)RASP programs. This provides direct evidence that such models internally learn simple algorithmic solutions.
Paper: arXiv | Software: GitHub | Keywords: NLP, interpretability, transformers
Learn More16th January 2026, by Sweta Mahajan | V
Abstract: We propose a novel Concept Bottleneck Model (CBM) approach called Discover-then-Name-CBM (DN-CBM) that inverts the typical paradigm of first defining the concepts, then learning them. Instead, we use sparse autoencoders to discover concepts that have been learnt by the model, and then name them accordingly. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model; overall resulting in performant and interpretable CBMs.
Paper: ECCV Proceedings | Software: GitHub | Keywords: CV, concept bottleneck models
Learn More15th December 2025, by Yifan Wang | L
Abstract: We introduce RSA-Control, a novel framework for controllable text generation (CTG) that does not require additional training and is grounded in principles of pragmatics via Rational Speech Acts (RSA). By employing recursive reasoning between imaginary speakers and listeners, RSA-Control steers large language models (LLMs) to produce text where desired attributes can be better perceived by listeners. This framework exemplifies a fused neuroexplicit approach, where neural models are combined with explicit knowledge in a post-hoc manner.
Paper: aclanthology | Software: GitHub | Keywords: NLP, controlled generation, rational speech act
Learn More14th November 2025, by Ji-Ung Lee | A F L V
Abstract: Neuroexplicit models are a type of machine learning model that combines deep learning with explicit AI; allowing them to utilize the generalization capabilities of deep neural models and at the same time, to exploit human-understandable, explicit components. Neurosymbolic models are the most prominent, but by far not the sole kind of neuroexplicit models. In this blog post, we will draw an outline of neuroexplicit models and by doing so, provide a new perspective on taxonomizing the increasing number of AI models.
Learn More