3.2 YouTube Spam Comments (Text Classification)
As an example for text classification we will be using 1956 comments from 5 different YouTube videos. Thankfully the authors that used this dataset in an article about spam classification made the data freely available (Alberto, Lochter, and Almeida 201513).
The comments were collected through the YouTube API from five of the ten most viewed videos on YouTube in the first half of 2015. All of the 5 videos are music videos. One of them is “Gangnam Style” from Korean artist Psy. The other artists were Katy Perry, LMFAO, Eminem, and Shakira.
You can flip through some of the comments. The comments had been hand labeled as spam or legitimate. Spam has been coded with a ‘1’ and legitimate comments with a ‘0’.
|Huh, anyway check out this you[tube] channel: kobyoshi02||1|
|Hey guys check out my new channel and our first vid THIS IS US THE MONKEYS!!! I’m the monkey in the white shirt,please leave a like comment and please subscribe!!!!||1|
|just for test I have to say murdev.com||1|
|me shaking my sexy ass on my channel enjoy ^_^||1|
|watch?v=vtaRGgvGtWQ Check this out .||1|
|Hey, check out my new website!! This site is about kids stuff. kidsmediausa . com||1|
|Subscribe to my channel||1|
|i turned it on mute as soon is i came on i just wanted to check the views…||0|
|You should check my channel for Funny VIDEOS!!||1|
|and u should.d check my channel and tell me what I should do next!||1|
You can also go over to YouTube and have a look at the comment section. But please don’t get trapped in the YouTube hell, ending up watching videos about monkeys stealing and drinking cocktails from tourists on the beach. Also the Google Spam detector probably has changed a lot since 2015.
Alberto, Túlio C, Johannes V Lochter, and Tiago A Almeida. 2015. “Tubespam: Comment Spam Filtering on Youtube.” In Machine Learning and Applications (Icmla), 2015 Ieee 14th International Conference on, 138–43. IEEE.↩