Tags
What is spam?
Spam is unsolicited bulk messages, which usually involves using electronic messaging system. Most recognized form is email spam, although it also refers to abuses in other form like instant messaging, web search, Internet forum, etc. The primary objective for spam is for advertising indiscriminately towards large group of audiences. [1]
How does it work?
Spams are usually sent out by a “spammer”, which is a third-party that provides unsolicited email distribution services to companies that will advertise itself through emails.[2]Usually spammer find its target by the following ways:
- Newsgroups and chat rooms: some users leave their email addresses for screen names.
- The Web: Specialty search engines could be developed to capture email addresses from the web by simply looking for the telltale “@” sign.
- Online Subscriptions: Subscriptions to online newsletters or promotions will sell your email address to Spammers.[3]
What is spam filtering?
Spam filtering is the processing of messages according to specific criteria. By enabling spam filtering, each message is evaluated based on its headers and content. The message might got pass through unchanged for delivery to the user’s mailbox, redirected the message for delivery elsewhere, or even thrown away.[4]
How does spam filtering work?
- Personalized filters: You may set your own rules: filter will ask you to identify whether emails from a specified source or with certain contents are unsolicited.
- Header filters: Filters assess contents of the email header. Emails with empty or unusually long headers might be spam.
- Language filters: Filters identify any email that is not in your language and filter them out.[5]
- Bayesian analysis filters: Filters assess the email content by looking for key words and phrases to identify spam.
- Collaborative filter: Companies such as Google and Yahoo build database based on collective data of a group of users to identify spam.
What is the relationship with Data Science?
Several machine learning and information retrieval techniques have been used to build the spam filtering classifiers. Some of the underlying theories have already been mentioned in our Data Mining class:
- Bayesian Classifier: The idea is to use the Bayes’ rule to get the probability that a messages is spam given the message contains several words.
- K Nearest Neighbor Classifier: The idea is to classify a message according to the classes of the nearest neighbors in the training set.
- Support Vector Machine (SVM) Classfication: Since email spam is essentially a binary classification problem, we can also use SVM, which is now one of the most widely used machine learning techniques. The idea is to find a linear separation boundary that correctly classifies training examples.
Why should we care?
- Spams generate huge economic costs: For the users who are exposed to spam emails, the costs include wasting time in wading through irrelevant information and both missing important messages in the junk mail folder. The server hardware also requires more capacity as would be required in the absence of spam. Overall, according to the Justin and David’s estimates[6], the economic costs of spam emails on end users in the whole world are nearly $14 billion per year, while the revenues generated by email spam are only $160-$360 million per year. From the following table, we can see that the externality cost ratio is even higher than stealing cars.
- Spams contribute to energy consumption: The spam emails also contribute to green house emissions. McAfee released a report in 2008, showing globally the annual energy used to transmit, process and filter spam totals 33 billion kilowatt-hours (kWh), equivalent to the electricity used in 2.4 million homes.
[1]http://en.wikipedia.org/wiki/Spamming
[2]http://www.nolo.com/legal-encyclopedia/how-does-spam-work-30013.html
[3]http://computer.howstuffworks.com/internet/basics/spam1.html
[4] http://help.websiteos.com/websiteos/introduction_to_spam_filter.htm
[5]http://www.bullguard.com/bullguard-security-center/internet-security/internet-threats/how-does-a-spamfilter-work.aspx
[6] http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.26.3.87