MLMM : MULTI-MODAL LANGUAGE MODELS BASED MULTI-AGENTS COLLABORATED SOLVING CV PROBLEMS
Volume 6, Issue 3, Pp 59-62, 2024
DOI: 10.61784/jcsee3020
Author(s)
YanQiao Ji
Affiliation(s)
Liaoning Equipment Manufacturing Vocational and Technical College, Shenyang 110161, Liaoning, China.
Corresponding Author
YanQiao Ji
ABSTRACT
To enhance the system’s ability to interact with and comprehend the visual world, we integrate vision capabilities into a large language model (LLM) framework, specifically the AutoGen framework. By employing a multi-agent conversation methodology, our work aims to mitigate errors in image generation and improve the system’s output to better meet user expectations. Our work is based on the Meta Llama 3 family of pretrained and instruction-tuned generative text models, specifically optimized for dialogue applications, LLaVA for image recognization and the stable diffusion model for image generation. Our work is efforting on addressing vision-related problems and the potential for further enhancements with the support of more sophisticated models.
KEYWORDS
Software engineering; Artificial intelligence; Deep learning
CITE THIS PAPER
YanQiao Ji. MIMM: Multi-modal language models based multi-agents collaborated solving CV problems. Journal of Computer Science and Electrical Engineering. 2024, 6(3): 59-62. DOI: 10.61784/jcsee3020.
REFERENCES
[1] Yao, S, Zhao, J, Yu, D, et al. ReAct: Synergizing Reasoning and Acting in Language Models, 2023. DOI: https://doi.org/10.48550/arXiv.2210.03629.
[2] Wang, L, Ma, C, Feng, X, et al. A Survey on Large Language Model based Autonomous Agents. Frontiers of Computer Science, 2023, 18. DOI: https://doi.org/10.48550/arXiv.2308.11432.
[3] Li, C, Gan, Z, Yang, Z, et al. Multimodal Foundation Models: From Specialists to General- Purpose Assistants. Foundations and Trends in Computer Graphics and Vision, 2023, 16(1-2): 1-214. DOI: https://doi.org/10.48550/arXiv.2309.10020.
[4] Li, C, Liu, H, Li, L, et al. Elevater: A benchmark and toolkit for evaluating language- augmented visual models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, eds., Advances in Neural Information Processing Systems, 2022, 35, 9287-9301. DOI: https://doi.org/10.48550/arXiv.2204.08790.
[5] OpenAI. GPT-4 Technical Report. 2023.
[6] Yang, H, Yue, S, He, Y. Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions, 2023. DOI: https://doi.org/10.48550/arXiv.2306.02224.
[7] Patil, SG, Zhang, T, Wang, X, et al. Gorilla: Large Language Model Connected with Massive APIs, 2023. DOI: https://doi.org/10.48550/arXiv.2305.15334.
[8] Nakajima, Y. BabyAGI, 2023.
[9] Li, G, Hammoud, H A A K, Itani, H, et al. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. In Thirty-Seventh Conference on Neural Information Processing Systems, 2023, 2264, 51991 - 52008. DOI: https://doi.org/10.48550/arXiv.2303.17760.
[10] Wu, Q, Bansal, G, Zhang, J, et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, 2023. DOI: https://doi.org/10.48550/arXiv.2308.08155.
[11] Li, LH, Yatskar, M, Yin, D, et al. Visualbert: A simple and performant baseline for vision and language, 2019. DOI: https://doi.org/10.48550/arXiv.1908.03557.
[12] Hu, X, Gan, Z, Wang, J, et al. Scaling up vision-language pretraining for image captioning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 17959-17968. DOI: 10.1109/CVPR52688.2022.01745.
[13] Zhou, LW, Hamid Palangi, Zhang L, et al. Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(07): 13041-13049. DOI: https://doi.org/10.1609/aaai.v34i07.7005.
[14] Gan, Z, Li, L, Li, C, et al. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends in Computer Graphics and Vision, 2022, 14(3-4): 163-352. DOI: https://doi.org/10.1561/0600000105.
[15] Li, M, Lv, T, Chen, J, et al. Trocr: Transformer-based optical character recognition with pre-trained models. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(11): 13094-13102. DOI: https://doi.org/10.1609/aaai.v37i11.26538.
[16] Chen, T, Saxena, S, Li, L, et al. Pix2seq: A language modeling framework for object detection. 2022. DOI: https://doi.org/10.48550/arXiv.2109.10852.
[17] Alayrac, J-B, Donahue, J, Luc, P, et al. Flamingo: a visual language model for few-shot learning. 2022. DOI: https://doi.org/10.48550/arXiv.2204.14198.
[18] Tsimpoukelli, M, Menick, J, Cabi S, et al. Multimodal few-shot learning with frozen language models. NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021, 16, 200-212.
[19] Liu, H, Li, C, Wu, Q, et al. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine, eds., Advances in Neural Information Processing Systems, 2024, 36, 34892-34916. DOI: https://doi.org/10.48550/arXiv.2304.08485.
[20] Rombach, R, Blattmann, A, Lorenz, D, et al. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 10684-10695. DOI: https://doi.org/10.48550/arXiv.2112.10752.