3.2 YouTube Spam Comments (Text Classification)

As an example for text classification we work with 1956 comments from 5 different YouTube videos. Thankfully, the authors who used this dataset in an article on spam classification made the data freely available (Alberto, Lochter, and Almeida (2015)14).

The comments were collected via the YouTube API from five of the ten most viewed videos on YouTube in the first half of 2015. All 5 are music videos. One of them is “Gangnam Style” by Korean artist Psy. The other artists were Katy Perry, LMFAO, Eminem, and Shakira.

Checkout some of the comments. The comments were manually labeled as spam or legitimate. Spam was coded with a “1” and legitimate comments with a “0”.

CONTENT CLASS
Huh, anyway check out this you[tube] channel: kobyoshi02 1
Hey guys check out my new channel and our first vid THIS IS US THE MONKEYS!!! I’m the monkey in the white shirt,please leave a like comment and please subscribe!!!! 1
just for test I have to say murdev.com 1
me shaking my sexy ass on my channel enjoy ^_^ 1
watch?v=vtaRGgvGtWQ Check this out . 1
Hey, check out my new website!! This site is about kids stuff. kidsmediausa . com 1
Subscribe to my channel 1
i turned it on mute as soon is i came on i just wanted to check the views… 0
You should check my channel for Funny VIDEOS!! 1
and u should.d check my channel and tell me what I should do next! 1

You can also go to YouTube and take a look at the comment section. But please do not get caught in YouTube hell and end up watching videos of monkeys stealing and drinking cocktails from tourists on the beach. The Google Spam detector has also probably changed a lot since 2015.

Watch the view-record breaking video “Gangnam Style” here.

If you want to play around with the data, you can find the RData file along with the R-script with some convenience functions in the book’s Github repository.


  1. Alberto, Túlio C, Johannes V Lochter, and Tiago A Almeida. “Tubespam: comment spam filtering on YouTube.” In Machine Learning and Applications (Icmla), Ieee 14th International Conference on, 138–43. IEEE. (2015).