eprintid: 12369
rev_number: 9
eprint_status: archive
userid: 2
dir: disk0/00/01/23/69
datestamp: 2024-05-30 20:51:04
lastmod: 2024-05-30 20:51:05
status_changed: 2024-05-30 20:51:04
type: article
metadata_visibility: show
creators_name: Khan, Hikmat Ullah
creators_name: Anam, Rimsha
creators_name: Anwar, Muhammad Waqas
creators_name: Jamal, Muhammad Hasan
creators_name: Bajwa, Usama Ijaz
creators_name: Diez, Isabel de la Torre
creators_name: Silva Alvarado, Eduardo René
creators_name: Soriano Flores, Emmanuel
creators_name: Ashraf, Imran
creators_id: 
creators_id: 
creators_id: 
creators_id: 
creators_id: 
creators_id: 
creators_id: eduardo.silva@funiber.org
creators_id: emmanuel.soriano@uneatlantico.es
creators_id: 
title: A deep learning approach for Named Entity Recognition in Urdu language
ispublished: pub
subjects: uneat_eng
divisions: uneatlantico_produccion_cientifica
divisions: unincol_revistas_cientificas
divisions: uninimx_produccion_cientifica
divisions: uninipr_produccion_cientifica
divisions: unic_produccion_cientifica
full_text_status: public
abstract: Named Entity Recognition (NER) is a natural language processing task that has been widely explored for different languages in the recent decade but is still an under-researched area for the Urdu language due to its rich morphology and language complexities. Existing state-of-the-art studies on Urdu NER use various deep-learning approaches through automatic feature selection using word embeddings. This paper presents a deep learning approach for Urdu NER that harnesses FastText and Floret word embeddings to capture the contextual information of words by considering the surrounding context of words for improved feature extraction. The pre-trained FastText and Floret word embeddings are publicly available for Urdu language which are utilized to generate feature vectors of four benchmark Urdu language datasets. These features are then used as input to train various combinations of Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), CRF, and deep learning models. The results show that our proposed approach significantly outperforms existing state-of-the-art studies on Urdu NER, achieving an F-score of up to 0.98 when using BiLSTM+GRU with Floret embeddings. Error analysis shows a low classification error rate ranging from 1.24% to 3.63% across various datasets showing the robustness of the proposed approach. The performance comparison shows that the proposed approach significantly outperforms similar existing studies.
date: 2024-03
publication: PLOS ONE
volume: 19
number: 3
pagerange: e0300725
id_number: doi:10.1371/journal.pone.0300725
refereed: TRUE
issn: 1932-6203
official_url: http://doi.org/10.1371/journal.pone.0300725
access: open
language: en
citation:   Artículo Materias > Ingeniería <http://repositorio.uneatlantico.es/view/subjects/uneat=5Feng.html> Universidad Europea del Atlántico > Investigación > Artículos y libros <http://repositorio.uneatlantico.es/view/divisions/uneatlantico=5Fproduccion=5Fcientifica.html>
Fundación Universitaria Internacional de Colombia > Investigación > Revistas Científicas <http://repositorio.uneatlantico.es/view/divisions/unincol=5Frevistas=5Fcientificas.html>
Universidad Internacional Iberoamericana México > Investigación > Producción Científica <http://repositorio.uneatlantico.es/view/divisions/uninimx=5Fproduccion=5Fcientifica.html>
Universidad Internacional Iberoamericana Puerto Rico > Investigación > Producción Científica <http://repositorio.uneatlantico.es/view/divisions/uninipr=5Fproduccion=5Fcientifica.html>
Universidad Internacional do Cuanza > Investigación > Producción Científica <http://repositorio.uneatlantico.es/view/divisions/unic=5Fproduccion=5Fcientifica.html> Abierto Inglés Named Entity Recognition (NER) is a natural language processing task that has been widely explored for different languages in the recent decade but is still an under-researched area for the Urdu language due to its rich morphology and language complexities. Existing state-of-the-art studies on Urdu NER use various deep-learning approaches through automatic feature selection using word embeddings. This paper presents a deep learning approach for Urdu NER that harnesses FastText and Floret word embeddings to capture the contextual information of words by considering the surrounding context of words for improved feature extraction. The pre-trained FastText and Floret word embeddings are publicly available for Urdu language which are utilized to generate feature vectors of four benchmark Urdu language datasets. These features are then used as input to train various combinations of Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), CRF, and deep learning models. The results show that our proposed approach significantly outperforms existing state-of-the-art studies on Urdu NER, achieving an F-score of up to 0.98 when using BiLSTM+GRU with Floret embeddings. Error analysis shows a low classification error rate ranging from 1.24% to 3.63% across various datasets showing the robustness of the proposed approach. The performance comparison shows that the proposed approach significantly outperforms similar existing studies. metadata Khan, Hikmat Ullah; Anam, Rimsha; Anwar, Muhammad Waqas; Jamal, Muhammad Hasan; Bajwa, Usama Ijaz; Diez, Isabel de la Torre; Silva Alvarado, Eduardo René; Soriano Flores, Emmanuel y Ashraf, Imran mail SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, eduardo.silva@funiber.org, emmanuel.soriano@uneatlantico.es, SIN ESPECIFICAR     <http://repositorio.uneatlantico.es/id/eprint/12369/1/journal.pone.0300725.pdf>     (2024) A deep learning approach for Named Entity Recognition in Urdu language.  PLOS ONE, 19 (3).  e0300725.  ISSN 1932-6203     
document_url: http://repositorio.uneatlantico.es/id/eprint/12369/1/journal.pone.0300725.pdf