هفتمین کنگره مشترک سیستمهای فازی و هوشمند ایران

A hybrid LDA/n-gram approach for Persian Text classification

کد مقاله : 1242-CFIS (R1)

نویسندگان

میر محسن پدرام *¹، مهناز کامران²

¹دانشگاه خوارزمی

²گروه مهندسی برق و کامپیوتر، دانشکده فنی و مهندسی، دانشگاه خوارزمی

چکیده مقاله

classification is one of the most important problems in machine learning. There are various classification algorithms for text classification. Most of this algorithms use Bag of Words to represent document collection. The main problem with bag of words models is its inability to semantic relationship recognition and huge size of the feature space. In this paper, we use a hybrid model which uses topic modeling and n-gram to build feature space and apply classification based on this merged feature space which will reduce size of the feature space and improve performance in comparison to earlier works. Naïve Bayes (NB), Support Vector Machines (SVM), Bayes Net and Random Forest, are used for evaluation.

کلیدواژه ها

text classification; Latent Dirichlet Allocation; Web page classification; Topic modeling

وضعیت: پذیرفته شده برای ارائه شفاهی