ENHANCING NAMED ENTITY RECOGNITION VIA TEST-TIME SCALING MODEL
Keywords:
Named entity recognition, Test-time scaling, Large language model, Zero-shotAbstract
This paper addresses the challenge of Named Entity Recognition (NER) using large language models (LLMs) in zero-shot and few-shot settings. While LLMs demonstrate promising capabilities, they often generate hallucinations—spurious or inaccurate outputs—that hinder reliable performance. To overcome this limitation, we propose use chain-of-thought scaling approach in which the model explicitly reasons through an inferred thought process prior to outputting final entity labels. We evaluate our method on the CoNLL-2003 and FewNERD benchmarks, demonstrating consistent performance gains over strong baseline models and attaining an F1 improvement in FewNERD from 0.45 to 0.55 in zero-shot NER. Our findings suggest that explicitly structured reasoning significantly mitigates hallucinations and enhances label precision, even without extensive task-specific fine-tuning. This work provides a blueprint for scaling and refining NER in resource-constrained scenarios, and paves the way for broader applications of reasoning-based LLM strategies to complex information extraction tasks.References
[1] Zhou ZH, Chawla NV, Jin Y, et al. Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives [Discussion Forum]. IEEE Computational Intelligence Magazine, 2014, 9(4): 62–74. DOI: 10.1109/MCI.2014.2350953.
[2] Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2000.
[3] Chang Y, Wang X, Wang J, et al. A Survey on Evaluation of Large Language Models. 2023. DOI: 10.48550/arXiv.2307.03109.
[4] Li B, Fang G, Yang Y, et al. Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness. 2023. DOI: 10.48550/arXiv.2304.11633.
[5] Ma Y, Cao Y, Hong Y, et al. Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples! 2023. DOI: 10.48550/arXiv.2303.08559.
[6] Wan Z, Cheng F, Mao Z, et al. GPT-RE: In-context Learning for Relation Extraction using Large Language Models, 2023. DOI: 10.48550/arXiv.2305.02105.
[7] Sang EF, De Meulder F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003.
[8] Ding N, Xu G, Chen Y, et al. Few-NERD: A Few-shot Named Entity Recognition Dataset. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021(1): 3198–213, DOI: 10.18653/v1/2021.acl-long.248.
[9] Chiu JP, Nichols E. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the association for computational linguistics, 2016, 4: 357–70.
[10] Collobert R, Weston J, Bottou L, et al. Natural Language Processing (Almost) from Scratch. Natural Language Processing, 2011, 45.
[11] Hammerton J. Named entity recognition with long short-term memory. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, 2003: 172–5.
[12] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019.
[13] Li X, Feng J, Meng Y, et al. A unified MRC framework for named entity recognition. arXiv preprint arXiv:1910.11476, 2019.
[14] Sarzynska-Wawer J, Wawer A, Pawlak A, et al. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 2021, 304: 114135.
[15] Liu AT, Xiao W, Zhu H, et al. QaNER: Prompting Question Answering Models for Few-shot Named Entity Recognition. 2022. DOI: 10.48550/arXiv.2203.01543.
[16] Yan H, Gui T, Dai J, et al. A Unified Generative Framework for Various NER Subtasks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021(1): 5808–22, DOI: 10.18653/v1/2021.acl-long.451.
[17] Hoffmann J, Borgeaud S, Mensch A, et al. Training Compute-Optimal Large Language Models. 2022.
[18] Kaplan J, McCandlish S, Henighan T, et al. Scaling Laws for Neural Language Models. 2020.
[19] Snell C, Lee J, Xu K, et al. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. 2024.
[20] Welleck S, Bertsch A, Finlayson M, et al. From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models. 2024.
[21] OpenAI. Learning to Reason with LLMs. 2024.
[22] Gao Z, Niu B, He X, et al. Interpretable Contrastive Monte Carlo Tree Search Reasoning. 2024.
[23] Zhang Y, Yang J, Yuan Y, et al. Cumulative Reasoning with Large Language Models. 2024.
[24] Huang Z, Zou H, Li X, et al. O1 Replication Journey – Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? 2024.
[25] Qin Y, Li X, Zou H, et al. O1 Replication Journey: A Strategic Progress Report – Part 1. 2024.
[26] Wang P, Li L, Shao Z, et al. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. 2024.
[27] DeepSeek-AI, Guo D, Yang D, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025.
[28] Li C, Xue M, Zhang Z, et al. START: Self-taught Reasoner with Tools. 2025, DOI: 10.48550/arXiv.2503.04625.
[29] Bai J, Bai S, Chu Y, et al. Qwen Technical Report. 2023.