A hybrid LDA/n-gram approach for Persian Text classification
کد مقاله : 1242-CFIS (R1)
میر محسن پدرام *1، مهناز کامران2
1دانشگاه خوارزمی
2گروه مهندسی برق و کامپیوتر، دانشکده فنی و مهندسی، دانشگاه خوارزمی
چکیده مقاله:
classification is one of the most important problems in machine learning. There are various classification algorithms for text classification. Most of this algorithms use Bag of Words to represent document collection. The main problem with bag of words models is its inability to semantic relationship recognition and huge size of the feature space. In this paper, we use a hybrid model which uses topic modeling and n-gram to build feature space and apply classification based on this merged feature space which will reduce size of the feature space and improve performance in comparison to earlier works. Naïve Bayes (NB), Support Vector Machines (SVM), Bayes Net and Random Forest, are used for evaluation.
کلیدواژه ها:
text classification; Latent Dirichlet Allocation; Web page classification; Topic modeling
وضعیت : مقاله برای ارائه شفاهی پذیرفته شده است