8, mutual detection cannot detect the estimated number of spam blogs (defined by S) from updated blog data, due to the narrow condition of R.
Parameter Adjustment for Change in Number of Spam Blogs
To calculate the number of spam blogs in one day's data on August 20th, 2008, 100 samples were selected from the data and 14 spam blogs were counted, which suggests that the estimated spam rate is 14%.
To illustrate the advantages and disadvantages of machine learning, SVM is employed to filter spam blogs in the data set August 20th, 2008, which was also processed in section 3.
Hence, the estimated number of spam blogs is approximately 4175.
To select words as features of data for SVM, document frequency (DF) is employed to find which are effective for filtering spam blogs.
To prepare a training data set for SVM learning, which contains 2643 spam and 2643 non-spam blogs, 2643 blogs are randomly selected as non-spam blogs from the unknown data set of 205,702 blogs, although some of the selected blogs may be spam blogs.