MLMM : MULTI-MODAL LANGUAGE MODELS BASED MULTI-AGENTS COLLABORATED SOLVING CV PROBLEMS
Keywords:
Software engineering, Artificial intelligence, Deep learningAbstract
To enhance the system’s ability to interact with and comprehend the visual world, we integrate vision capabilities into a large language model (LLM) framework, specifically the AutoGen framework. By employing a multi-agent conversation methodology, our work aims to mitigate errors in image generation and improve the system’s output to better meet user expectations. Our work is based on the Meta Llama 3 family of pretrained and instruction-tuned generative text models, specifically optimized for dialogue applications, LLaVA for image recognization and the stable diffusion model for image generation. Our work is efforting on addressing vision-related problems and the potential for further enhancements with the support of more sophisticated models.References
[1] Yao, S, Zhao, J, Yu, D, et al. ReAct: Synergizing Reasoning and Acting in Language Models, 2023. DOI: https://doi.org/10.48550/arXiv.2210.03629.
[2] Wang, L, Ma, C, Feng, X, et al. A Survey on Large Language Model based Autonomous Agents. Frontiers of Computer Science, 2023, 18. DOI: https://doi.org/10.48550/arXiv.2308.11432.
[3] Li, C, Gan, Z, Yang, Z, et al. Multimodal Foundation Models: From Specialists to General- Purpose Assistants. Foundations and Trends in Computer Graphics and Vision, 2023, 16(1-2): 1-214. DOI: https://doi.org/10.48550/arXiv.2309.10020.
[4] Li, C, Liu, H, Li, L, et al. Elevater: A benchmark and toolkit for evaluating language- augmented visual models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, eds., Advances in Neural Information Processing Systems, 2022, 35, 9287-9301. DOI: https://doi.org/10.48550/arXiv.2204.08790.
[5] OpenAI. GPT-4 Technical Report. 2023.
[6] Yang, H, Yue, S, He, Y. Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions, 2023. DOI: https://doi.org/10.48550/arXiv.2306.02224.
[7] Patil, SG, Zhang, T, Wang, X, et al. Gorilla: Large Language Model Connected with Massive APIs, 2023. DOI: https://doi.org/10.48550/arXiv.2305.15334.
[8] Nakajima, Y. BabyAGI, 2023.
[9] Li, G, Hammoud, H A A K, Itani, H, et al. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. In Thirty-Seventh Conference on Neural Information Processing Systems, 2023, 2264, 51991 - 52008. DOI: https://doi.org/10.48550/arXiv.2303.17760.
[10] Wu, Q, Bansal, G, Zhang, J, et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, 2023. DOI: https://doi.org/10.48550/arXiv.2308.08155.
[11] Li, LH, Yatskar, M, Yin, D, et al. Visualbert: A simple and performant baseline for vision and language, 2019. DOI: https://doi.org/10.48550/arXiv.1908.03557.
[12] Hu, X, Gan, Z, Wang, J, et al. Scaling up vision-language pretraining for image captioning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 17959-17968. DOI: 10.1109/CVPR52688.2022.01745.
[13] Zhou, LW, Hamid Palangi, Zhang L, et al. Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(07): 13041-13049. DOI: https://doi.org/10.1609/aaai.v34i07.7005.
[14] Gan, Z, Li, L, Li, C, et al. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends in Computer Graphics and Vision, 2022, 14(3-4): 163-352. DOI: https://doi.org/10.1561/0600000105.
[15] Li, M, Lv, T, Chen, J, et al. Trocr: Transformer-based optical character recognition with pre-trained models. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(11): 13094-13102. DOI: https://doi.org/10.1609/aaai.v37i11.26538.
[16] Chen, T, Saxena, S, Li, L, et al. Pix2seq: A language modeling framework for object detection. 2022. DOI: https://doi.org/10.48550/arXiv.2109.10852.
[17] Alayrac, J-B, Donahue, J, Luc, P, et al. Flamingo: a visual language model for few-shot learning. 2022. DOI: https://doi.org/10.48550/arXiv.2204.14198.
[18] Tsimpoukelli, M, Menick, J, Cabi S, et al. Multimodal few-shot learning with frozen language models. NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021, 16, 200-212.
[19] Liu, H, Li, C, Wu, Q, et al. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine, eds., Advances in Neural Information Processing Systems, 2024, 36, 34892-34916. DOI: https://doi.org/10.48550/arXiv.2304.08485.
[20] Rombach, R, Blattmann, A, Lorenz, D, et al. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 10684-10695. DOI: https://doi.org/10.48550/arXiv.2112.10752.