YouTube Spam Collection v. 1
The YouTube Spam Collection v. 1 is a public set of YouTube comments that have been collected for spam research. It has five datasets composed by 1,956 real and non-encoded messages that were labeled as legitimate (ham) or spam.
This corpus has been collected using the YouTube Data API v3.
The samples were extracted from the comments section of five videos that were among the 10 most viewed on YouTube during the collection period. The table below lists the datasets, the YouTube video ID, the amount of samples in each class and the total number of samples per dataset.
|Dataset||YouTube ID||# Spam||# Ham||Total||Link|
Note: the chronological order of the comments were kept.
The collection is composed by one CSV file per dataset, where each line has the following attributes:
We offer one example bellow:
Nora,2013-11-28T19:52:35,please like :D
We would appreciate:
- If you find this collection useful, make a reference to the paper below and the web page: http://dcomp.sor.ufscar.br/talmeida/youtubespamcollection/.
- Send us a message either to talmeida < AT > ufscar.br or tuliocasagrande < AT > acm.org in case you make use of the corpus.
Publication and More Information
We offer a comprehensive study of this corpus in the following paper. This work presents a number of interesting statistics, studies and baseline results for many traditional machine learning methods.
Alberto, T.C., Lochter J.V., Almeida, T.A. Filtragem Automática de Spam nos Comentários do YouTube. Anais do XII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC'15), Natal, RN, Brazil, 2015. (preprint)
Alberto, T.C., Lochter J.V., Almeida, T.A. TubeSpam: Comment Spam Filtering on YouTube. Proceedings of the 14th IEEE International Conference on Machine Learning and Applications (ICMLA'15), 1-6, Miami, FL, USA, December, 2015. (preprint)
The YouTube Spam Collection has been created by Tiago A. Almeida and Tulio C. Alberto.
© Tiago A. Almeida and Tulio C. Alberto, 2015.