Because parameter S is the estimated rate of spam blogs based on an observation of a data set and the estimated number of spam blogs in step 5, i.
S: rate of spam blogs in all blogs (the rate is defined by the results of sampling from all blogs)
To investigate filtering precision, 20 samples from the determined spam blogs (selected from the first day of each month) were checked manually and judged to be spam.
8, mutual detection cannot detect the estimated number of spam blogs (defined by S) from updated blog data, due to the narrow condition of R.
Parameter Adjustment for Change in Number of Spam Blogs
To calculate the number of spam blogs in one day's data on August 20th, 2008, 100 samples were selected from the data and 14 spam blogs were counted, which suggests that the estimated spam rate is 14%.
To illustrate the advantages and disadvantages of machine learning, SVM is employed to filter spam blogs in the data set August 20th, 2008, which was also processed in section 3.
Hence, the estimated number of spam blogs is approximately 4175.
To select words as features of data for SVM, document frequency (DF) is employed to find which are effective for filtering spam blogs.
To prepare a training data set for SVM learning, which contains 2643 spam and 2643 non-spam blogs, 2643 blogs are randomly selected as non-spam blogs from the unknown data set of 205,702 blogs, although some of the selected blogs may be spam blogs.
The estimated number of spam blogs in the test data set is 2089 and that of non-spam blogs is 555, based on the spam distribution in fig.
The results show that 13 of the 20 blogs are spam blogs.