TOWARDS FOUNDATION MODELS FOR LIDAR SEMANTIC SEGMENTATION IN AUTONOMOUS DRIVING

YiFan Zhao; ZiWei Huang

doi:10.61784/ejst3146

Authors

YiFan Zhao (Corresponding Author) School of Information Science and Engineering, Hunan Institute of Engineering, Xiangtan 411100, Hunan, China.
ZiWei Huang School of Information Science and Engineering, Hunan Institute of Engineering, Xiangtan 411100, Hunan, China.

Keywords:

LiDAR semantic segmentation, Foundation models, Autonomous driving, Point cloud, Self-supervised learning

Abstract

LiDAR semantic segmentation (LSS), which assigns semantic labels to each point in a 3D scan, is a core perception task in autonomous driving. Over the past decade, fully supervised methods have achieved remarkable progress, with benchmark performance on SemanticKITTI improving from 14.6 mIoU in 2017 to over 75 mIoU in recent state-of-the-art models. Despite these advances, conventional supervised paradigms remain constrained by three fundamental limitations: dependence on large-scale dense annotations, restricted closed-set semantic understanding, and limited robustness under domain shifts and adverse environments.Recent advances in foundation models—including vision-language models, self-supervised pretraining frameworks, and segmentation foundation models—have opened a new direction for LiDAR perception by enabling transferable, label-efficient, and open-vocabulary 3D understanding. Motivated by this paradigm shift, this survey provides a systematic review of LiDAR semantic segmentation from supervised learning to foundation-model-driven approaches. We organize existing methods into five representative paradigms: cross-modal 2D-to-3D feature distillation, Segment Anything Model (SAM)-guided segmentation, open-vocabulary vision-language learning, LiDAR-specific self-supervised pretraining, and generalized 3D foundation models.Beyond taxonomy and benchmark comparison on SemanticKITTI and nuScenes, we further examine practical deployment factors—including inference latency, edge-device efficiency, robustness in adverse weather, and multimodal sensor fusion—that remain insufficiently captured by standard evaluation protocols. Finally, we identify six open research challenges and argue that the field is undergoing a fundamental transition: from adapting 2D foundation priors to developing native 3D LiDAR foundation models for autonomous driving.

References

[1] Geiger J, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. CVPR, 2012: 3354–3361.

[2] Behley J, Garbade M, Milioto A, et al. SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. ICCV, 2019.

[3] Caesar H, Bankiti V, Lang A H, et al. nuScenes: A multimodal dataset for autonomous driving. CVPR, 2020: 11618–11628.

[4] Fong W K, Mohan R, Hurtado J V, et al. Panoptic nuScenes: A large-scale benchmark for LiDAR panoptic segmentation and tracking. IEEE Robotics and Automation Letters, 2022, 7(2): 3795–3802.

[5] Sun P, Kretzschmar H, Dotiwalla X, et al. Scalability in perception for autonomous driving: Waymo Open Dataset. CVPR, 2020: 2446–2454.

[6] Mao J, Niu M, Jiang C, et al. One million scenes for autonomous driving: ONCE dataset. NeurIPS, 2021.

[7] Qi C R, Su H, Mo K, et al. PointNet: Deep learning on point sets for 3D classification and segmentation. CVPR, 2017: 652–660.

[8] Qi C R, Yi L, Su H, et al. PointNet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017.

[9] Thomas H, Qi C R, Deschaud JE, et al. KPConv: Flexible and deformable convolution for point clouds. ICCV, 2019: 6410–6419.

[10] Hu Q, Yang B, Xie L, et al. RandLA-Net: Efficient semantic segmentation of large-scale point clouds. CVPR, 2020: 11108–11117.

[11] Choy C, Gwak J, Savarese S. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. CVPR, 2019: 3075–3084.

[12] Zhu X, Zhou H, Wang T, et al. Cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation. CVPR, 2021: 9939–9948.

[13] Tang H, Liu Z, Zhao S, Lin Y, Lin J, Wang H, Han S. Searching efficient 3D architectures with sparse point-voxel convolution. ECCV, 2020.

[14] Lai X, Liu J, Jiang L, et al. Stratified Transformer for 3D point cloud segmentation. CVPR, 2022.

[15] Lai X, Chen Y, Lu F, et al. Spherical Transformer for LiDAR-based 3D recognition. CVPR, 2023.

[16] Wu X, Jiang L, Wang PS, et al. Point Transformer V3: Simpler, faster, stronger. CVPR, 2024: 4840–4851.

[17] Zhang W, Song H, Zhang Z, et al. From sparse semantics to rich instances: Empowering label-efficient LiDAR panoptic segmentation via geometric priors. Neural Networks, 2026, 200: 108767.

[18] Xiao A, Huang J, Xuan W, et al. 3D semantic segmentation in the wild: Learning generalized models for adverse-condition point clouds. CVPR, 2023.

[19] Xiao A, Huang J, Guan D, et al. Transfer Learning from Synthetic to Real LiDAR Point Cloud for Semantic Segmentation. AAAI, 2022.

[20] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision (CLIP). ICML, 2021.

[21] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.

[22] He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners. CVPR, 2022.

[23] Kirillov A, Mintun E, Ravi N, et al. Segment Anything. ICCV, 2023.

[24] Ravi N, Gabeur V, Hu YT, et al. SAM 2: Segment Anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.

[25] Oquab M, Darcet T, Moutakanni T, et al. DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.

[26] Sautier C, Puy G, Gidaris S, et al. Image-to-Lidar self-supervised distillation for autonomous driving data. CVPR, 2022: 9891–9901.

[27] Xie S, Gu J, Guo D, et al. PointContrast: Unsupervised pre-training for 3D point cloud understanding. ECCV, 2020.

[28] Liu Y, Kong L, Cen J, et al. Segment any point cloud sequences by distilling vision foundation models. NeurIPS, 2023.

[29] Ošep A, Meinhardt T, Ferroni F, et al. Better call SAL: Towards learning to Segment Anything in LiDAR. ECCV, 2024.

[30] Xu J, Wang S, Ni Z, et al. SAM4D: Segment Anything in camera and LiDAR streams. ICCV, 2025.

[31] Thengane V, Zhu X, Bouzerdoum S, et al. Foundational models for 3D point clouds: A survey and outlook. arXiv preprint arXiv:2501.18594, 2025.

[32] Sathyam R, Li Y. Foundation models for autonomous driving perception: A survey through core capabilities. arXiv preprint arXiv:2509.08302, 2025.

[33] Milioto A, Vizzo I, Behley J, et al. RangeNet++: Fast and accurate LiDAR semantic segmentation. IROS, 2019: 4213–4220.

[34] Cortinhal T, Tzelepis G, Aksoy E E. SalsaNext: Fast, uncertainty-aware semantic segmentation of LiDAR point clouds. ISVC, 2020.

[35] Cheng H, Han X, Xiao G. CENet: Toward concise and efficient LiDAR semantic segmentation for autonomous driving. ICME, 2022.

[36] Sautier C, Puy G, Boulch A, et al. BEVContrast: Self-supervision in BEV space for automotive LiDAR. 2026.

[37] Lin Z, Wang Y, Qi S, et al. BEV-MAE: Bird's eye view masked autoencoders for point cloud pre-training in autonomous driving scenarios. AAAI, 2024.

[38] Liu Z, Yang X, Tang H, et al. FlatFormer: Flattened window attention for efficient point cloud transformer. CVPR, 2023.

[39] Unal O, Dai D, Van Gool L. Scribble-supervised LiDAR semantic segmentation. CVPR, 2022.

[40] Liu J, Zhang T, Sun J, et al. You only click once: Single point weakly supervised 3D instance segmentation for autonomous driving. arXiv preprint arXiv:2502.19698, 2025.

[41] Lin L, Yang J, Zhao X, et al. MWSIS: Multimodal weakly supervised instance segmentation with 2D box annotations for autonomous driving. AAAI, 2024.

[42] Hess G, Jaxing J, Svensson E, et al. Masked autoencoder for self-supervised pre-training on LiDAR point clouds. WACV, 2023.

[43] Krispel G, Schinagl D, Fruhwirth-Reisinger C, et al. MAELi: Masked autoencoder for large-scale LiDAR point clouds. WACV, 2024.

[44] Min C, Xu X, Zhao D, et al. Occupancy-MAE: Self-supervised pre-training large-scale LiDAR point clouds with masked occupancy autoencoders. IEEE Transactions on Intelligent Vehicles, 2023.

[45] Wu Y, Zhang M, Cui J, et al. Fine-grained image-to-LiDAR contrastive distillation with visual foundation models. NeurIPS, 2024.

[46] Zhang Y, Wu X, Lao Y, et al. Concerto: Joint 2D-3D self-supervised learning emerges spatial representations. NeurIPS, 2025.

[47] Zhao H, Jiang L, Jia J, et al. Point Transformer. ICCV, 2021.

[48] Wu X, Lao Y, Jiang L, et al. Point Transformer V2: Grouped vector attention and partition-based pooling. NeurIPS, 2022.

[49] Cheng R, Razani R, Taghavi E, et al. (AF)2-S3Net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. CVPR, 2021.

[50] Wang PS. OctFormer: Octree-based transformers for 3D point clouds. ACM Transactions on Graphics, 2023, 42(4).

[51] Yang Y, Yang YQ, Wang X, et al. Swin3D: A pretrained transformer backbone for 3D indoor scene understanding. NeurIPS, 2023.

[52] Li J, Dai H, Han H, et al. MSeg3D: Multi-modal 3D semantic segmentation for autonomous driving. CVPR, 2023.

[53] Liu Y, Chen R, Li X, et al. UniSeg: A unified multi-modal LiDAR segmentation network and the OpenPCSeg codebase. ICCV, 2023.

[54] Zhang Z, Girdhar R, Joulin A, et al. Self-supervised pretraining of 3D features on any point-cloud. ICCV, 2021.

[55] Liu YC, Huang YK, Chiang HY, et al. Learning from 2D: Contrastive pixel-to-point knowledge transfer for 3D pretraining. arXiv preprint arXiv:2104.04687, 2021.

[56] Mahmoud A, Hu J S K, Kuai T, et al. Self-supervised image-to-point distillation via semantically tolerant contrastive loss. CVPR, 2023.

[57] Zhang Y, Hou J. Is contrastive distillation enough for learning comprehensive 3D representations?. NeurIPS, 2024.

[58] Xu X, Kong L, Shuai H, et al. 4D contrastive superflows are dense 3D representation learners. ECCV, 2024.

[59] Yang Y, Wu X, He T, et al. SAM3D: Segment Anything in 3D scenes. arXiv preprint arXiv:2306.03908, 2023.

[60] Kühn P J, Nguyen D A, Kuijper A, et al. RangeSAM: On the potential of visual foundation models for range-view represented LiDAR segmentation. arXiv preprint arXiv:2509.15886, 2025

[61] Yan W, Qian Y, Wang C, et al. SAM4UDASS: When SAM meets unsupervised domain adaptive semantic segmentation in intelligent vehicles. IEEE Transactions on Intelligent Vehicles, 2024.

[62] Peng S, Genova K, Jiang C, et al. OpenScene: 3D scene understanding with open vocabularies. CVPR, 2023.

[63] Chen R, Liu Y, Kong L, et al. CLIP2Scene: Towards label-efficient 3D scene understanding by CLIP. CVPR, 2023.

[64] Ding R, Yang J, Xue C, et al. PLA: Language-driven open-vocabulary 3D scene understanding. CVPR, 2023.

[65] Yang J, Ding R, Deng W, et al. RegionPLC: Regional point-language contrastive learning for open-world 3D scene understanding. CVPR, 2024.

[66] Ding R, Yang J, Xue C, et al. Lowis3D: Language-driven open-world instance-level 3D scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

[67] Samet N, Puy G, Marlet R. LOSC: LiDAR Open-vocabulary Segmentation Consolidator. arXiv preprint arXiv:2507.07605, 2025.

[68] Wang P, Wang Y, Li S, et al. Open vocabulary 3D scene understanding via geometry guided self-distillation. ECCV, 2024.

[69] He Q, Peng J, Jiang Z, et al. UniM-OV3D: Uni-modality open-vocabulary 3D scene understanding with fine-grained feature representation. IJCAI, 2024.

[70] Chang F, Li S, Li Y, et al. VLM-3D: End-to-end vision-language models for open-world 3D perception. arXiv preprint arXiv:2508.09061, 2025.

[71] Chen Y, Xu Z, Huang X, et al Weakly Supervised LiDAR Semantic Segmentation via Scatter Image Annotation. IEEE Transactions on Multimedia, 2025, 27: 4121-4136. DOI: 10.1109/TMM.2025.3535350.

[72] Zhang C, Yan J, Wei Y, et al. OccNeRF: Advancing 3D occupancy prediction in LiDAR-free environments. arXiv preprint arXiv:2312.09243, 2023.

[73] Nunes L, Wiesmann L, Marcuzzi R, et al. Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving. CVPR, 2023: 21674–21683.

[74] Yu X, Tang L, Rao Y, et al. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. CVPR, 2022: 19313–19322.

[75] Pang Y, Wang W, Tay F E H, et al. Masked autoencoders for point cloud self-supervised learning. ECCV, 2022.

[76] Zhang R, Guo Z, Fang R, et al. Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. NeurIPS, 2022.

[77] Xue L, Gao M, Xing C, et al. ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding. CVPR, 2023.

[78] Xue L, Yu N, Zhang S, et al. ULIP-2: Towards scalable multimodal pre-training for 3D understanding. CVPR, 2024.

[79] Wu X, DeTone D, Frost D, et al. Sonata: Self-supervised learning of reliable point representations. CVPR, 2025.

[80] Liu Y, Kong L, Wu X, et al. Multi-space alignments towards universal LiDAR segmentation. CVPR, 2024.

[81] Haidar S, Chariot A, Darouich M, et al. Are we ready for real-time LiDAR semantic segmentation in autonomous driving?. arXiv preprint arXiv:2410.08365, 2024.

[82] Park J, Kim K, Shim H. Rethinking data augmentation for robust LiDAR semantic segmentation in adverse weather. ECCV, 2024.

[83] Zhao H, Zhang J, Chen Z, et al. UniMix: Towards domain-adaptive and generalizable LiDAR semantic segmentation in adverse weather. CVPR, 2024.

[84] Park J, Lee H, Kang I, et al. No thing, nothing: Highlighting safety-critical classes for robust LiDAR semantic segmentation in adverse weather. arXiv preprint arXiv:2503.15910, 2025.

[85] Yang L, Zhang L, Liu J, et al. Towards generalised range-view LiDAR segmentation in adverse weather. arXiv preprint arXiv:2506.08979, 2025.

[86] Dreißig M, Scheuble D, Piewak F, et al. Survey on LiDAR perception in adverse weather conditions. IEEE Intelligent Vehicles Symposium, 2023.

[87] Saltori C, Galasso F, Fiameni G, et al. CoSMix: Compositional semantic mix for domain adaptation in 3D LiDAR segmentation. ECCV, 2022.

[88] Saltori C, Krivosheev E, Lathuilière S, et al. GIPSO: Geometrically informed propagation for online adaptation in 3D LiDAR segmentation. ECCV, 2022.

[89] Saltori C, Ošep A, Ricci E, et al. Walking your LiDOG: A journey through multiple domains for LiDAR semantic segmentation. ICCV, 2023.

[90] Lyu Y, Jiang G, Liu H, et al. ALISE: Annotation-free LiDAR instance segmentation for autonomous driving. arXiv preprint arXiv:2510.05752, 2025.

[91] Camuffo E, Milani S. Continual learning for LiDAR semantic segmentation: Class-incremental and coarse-to-fine strategies on sparse data. CVPR Workshops, 2023.

TOWARDS FOUNDATION MODELS FOR LIDAR SEMANTIC SEGMENTATION IN AUTONOMOUS DRIVING

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

DOI:

How to Cite