Learning to Remove Internet Advertisment

How many times have you entered a web page in the Internet, but instead of easily read the page content, annoying advertisements drew away your attention?

Abstract

How many times have you entered a web page in the Internet, but instead of easily read the page content, annoying advertisements drew away your attention? The objective of this project was to devise a system that recognizes which images in the webpage are advertisements and removes them automatically. A learning approach was chosen – we implemented a SVM classifier which makes the decision if an image is an advertisement or not.

Background
Part of previous ad-removal systems relayed on hand-crafted rules. The use of learning systems increased the systems success, but it is clear that such a success is strongly dependent on the kind of learning examples that were taken as the learning machine input. It is popular to report that the learning input was taken randomly from the web, but several tests we made, show that sites which are difficult for classification are rarely chosen that way.
In this project we gathered a large (1300 samples), diverse data-base, that fully represent the wide range of web sites, including categories that may be difficult for classification, like sites that deal with advertising, money, finance, etc…
Our goal was to train the SVM classifier with these examples and use that classifier in an on-line demo that is capable to remove high percentage of Internet advertisements.

The solution
1. A system for collecting the samples from the Internet was developed, with a GUI for fast sample labeling by the user. This is the user screen of that system

1

For each sample we collected the relevant information from the HTML file in order to extract numeric features and textual features later on.

2. An off line system for training the SVM classifier was built. 3 classifiers were tested: linear classifier, polynomial kernel classifier, gaussian kernel classifier. Optimization of classifier’s hyper-parameters and textual threshold parameters was done for each classifier, and the best classifier was exported to the on-line system. The following statistics describes the results of classifying 977 learning samples and 323 testing samples:

Learning Samples Testing Samples
Success with normal pictures Success with ad pictures Overall Success
Success with normal pictures Success with ad
pictures
Overall Success
Linear
Classifier
96.9% 76.2% 92.7% 93.3% 89% 92.3%
Polynomial
Classifier
96.8% 64.7% 92.2% 94.8% 78.1% 91.1%
Gaussian
Classifier
96.9% 65.2% 92.3% 95.2% 46.6% 84.3%

It is clear that the linear classifier achieves the best results. The most significant advatage of the linear classifier is its success with advertisement images.

3. An on-line demo that uses the classifier to remove advertisements was developed. The application gets a site address from the user, and than opens two browser windows. One displays the original site, and the other displays the filtered one. Every advertisement image is replaced with this image:
2

Example of the system’s output:
Original site:

3

After ad removal:

4

Tools
Our work environment was Java-Eclipse and Matlab 6.5.
The data gathering system and the on-line application were implemented in Java.
The off-line learning phase was done in Matlab.

Conclusions and suggestions for future work:

  • It is clear that a learning machine is capable of removing high precentage of internet advertisements
  • Next step can be integration of the classifier with a browser or a proxy-server
  • Improvements in the future can be extraction of information from the HTML page which is linked by the image, or from the image itself, by image proccessing

Acknowledgment
We are grateful to our project supervisor Dori Peleg for his help and guidance throughout this work. We are also grateful to the lab’s stuff and to the Ollendorff Minerva Center Fund who supported this project.