
RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads
Vijayasri Iyer, Maahin Rathinagiriswaran, Jyothikamalesh S
arXiv:2602.12877, 2025
- RoadscapesQA is a novel multitask and multimodal dataset that bridges the gap in autonomous driving research for unstructured environments.
- Collected over 5 hours of driving footage (9,000 final images). and implemented a resource-efficient pipeline using YOLO-World for initial object detection, followed by rule-based heuristics to automatically generate 7 ground-truth QA pairs per image.
- The dataset supports four key tasks: object detection, drivable area segmentation, object counting, and image-level visual question answering (VQA).
- GPT-4o demonstrated the strongest semantic reasoning in Surrounding Description with a similarity score of 0.701.
- Conducted a detailed hallucination analysis, finding that models struggle most with fine-grained Object Description (50.8%–61.6% hallucination rates).
Computer VisionVLMAutonomous DrivingGenAIYOLO

