We use LLaVA-v1.5, a extensively used open-sourced MLLM, as our base mannequin and practice it utilizing our contrastive tuning framework (HALVA). We then consider its efficiency on object hallucination mitigation and normal visible query answering duties (VQA) towards fine-tuning–primarily based approaches, HA-DPO and EOS. We think about LLaVA-v1.5 because the decrease certain and GPT-4V as a powerful reference level given its efficiency on commonplace benchmarks.
We use the AMBER benchmark and Caption Hallucination Evaluation with Picture Relevance (CHAIR) metric to guage MLLM efficiency on picture description duties, assessing each hallucination fee and the extent of element of their generated picture descriptions. The latter facet is quantified by calculating the proportion of ground-truth objects current within the picture which are precisely captured within the mannequin’s output. Our purpose is to mitigate hallucinations whereas retaining or bettering the richness of picture descriptions. As proven within the left plot under, HALVA captures extra ground-truth objects whereas hallucinating lower than HA-DPO. Furthermore, whereas EOS achieves a barely decrease hallucination fee, it degrades the extent of element within the picture descriptions, performing worse than HALVA.
We additionally use the F1-score to check the efficiency of MLLMs on visible query answering duties utilizing the AMBER benchmark for object hallucination and TextVQA benchmark for normal imaginative and prescient language accuracy. As proven in the correct plot under, each HA-DPO and EOS underperform HALVA in mitigating object hallucination and even deteriorate normal vision-language talents in comparison with the bottom mannequin.