Poster
in
Workshop: Workshop on Reasoning and Planning for Large Language Models
TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action
Zixian Ma · Jianguo Zhang · Zhiwei Liu · Jieyu Zhang · Juntao Tan · Manli Shu · Juan Carlos Niebles · Shelby Heinecke · Huan Wang · Caiming Xiong · Ranjay Krishna · silvio savarese
While open-source multi-modal language models perform well on simple questionanswering tasks, they often fail on complex questions that require multiple capa-bilities, such as fine-grained recognition, visual grounding, and reasoning, andthat demand multi-step solutions. We present TACO, a family of multi-modallarge action models designed to improve performance on such complex, multi-step and multi-modal tasks. During inference, TACO produces chains-of-thought-and–action (CoTA), executes intermediate steps by invoking external tools such asOCR, depth estimation and calculator, then integrates both the thoughts and actionoutputs to produce coherent responses. To train TACO, we create a large datasetof 1M+ synthetic CoTA traces generated with GPT-4o and Python programs. Wethen experiment with various data filtering and mixing techniques and obtain afinal subset of 293K high-quality CoTA examples. This dataset enables TACO tolearn complex reasoning and action paths, surpassing existing models trained oninstruct tuning data with only direct answers. Our model TACO outperforms theinstruction-tuned baseline across 8 benchmarks, achieving a 3.9% improvementon average, with gains up to 20% in MMVet tasks involving OCR, mathematicalreasoning and spatial reasoning.