AMBIENTUM BIOETHICA BIOLOGIA CHEMIA DIGITALIA DRAMATICA EDUCATIO ARTIS GYMNAST. ENGINEERING EPHEMERIDES EUROPAEA GEOGRAPHIA GEOLOGIA HISTORIA HISTORIA ARTIUM INFORMATICA IURISPRUDENTIA MATHEMATICA MUSICA NEGOTIA OECONOMICA PHILOLOGIA PHILOSOPHIA PHYSICA POLITICA PSYCHOLOGIA-PAEDAGOGIA SOCIOLOGIA THEOLOGIA CATHOLICA THEOLOGIA CATHOLICA LATIN THEOLOGIA GR.-CATH. VARAD THEOLOGIA ORTHODOXA THEOLOGIA REF. TRANSYLVAN
|
|||||||
Rezumat articol ediţie STUDIA UNIVERSITATIS BABEŞ-BOLYAI În partea de jos este prezentat rezumatul articolului selectat. Pentru revenire la cuprinsul ediţiei din care face parte acest articol, se accesează linkul din titlu. Pentru vizualizarea tuturor articolelor din arhivă la care este autor/coautor unul din autorii de mai jos, se accesează linkul din numele autorului. |
|||||||
STUDIA INFORMATICA - Ediţia nr.2 din 2013 | |||||||
Articol: |
TEXT REPRESENTATION AND GENERAL TOPIC ANNOTATION BASED ON LATENT DIRICHLET ALLOCATION. Autori: DIANA INKPEN. |
||||||
Rezumat: We propose a low-dimensional text representation method for topic classification. A Latent Dirichet Allocation (LDA) model is built on a large amount of unlabelled data, in order to extract potential topic clusters. Each document is represented as a distribution over these clusters.We experiment with two datasets. We collected the first dataset from the FriendFeed social network and we manually annotated part of it with 10 general classes. The second dataset is a standard text classification bench-mark, Reuters 21578, the R8 subset (annotated with 8 classes). We show that classification based on the LDA representation leads to acceptable results, while combining a bag-of-words representation with the LDA representation leads to further improvements. We also propose a multi-level LDA representation that catches topic cluster distributions from generic ones to more specific ones.2010 Mathematics Subject Classification. 62Fxx Parametric inference, 62Pxx Applications.1998 CR Categories and Descriptors. code [I.2.7 Natural Language Processing]:Subtopic - Text analisys code [H.3.1 Content Analysis and Indexing]: Subtopic - Linguistic processing; Key words and phrases. automatic text classification, topic detection, latent Dirichlet allocation.This paper has been presented at the International Conference KEPT2013: Knowledge Engineering Principles and Techniques, organized by Babeș-Bolyai University, Cluj-Napoca, July 5-7 2013. |
|||||||