Multi-agent decision support system for search engines marketing

(1)

KAMILA GRZKO

West Pomeranian University of Technology

Summary

The article presents search engine marketing areas and algorithmic approach to online marketing campaigns optimization using the decision support methods in the search engines data analysis. It proposes a methodical basis for building a multi-agent decision support system using the text mining methods and presents model of such system and its further evaluation.

Key words: Internet, search engines marketing, knowledge extraction 1. Introduction

The internet is a rich collection of hypertext documents allowing gaining access to all kinds of knowledge and services, however finding specific information might prove to be problematic. To make the process more effective there are many applications and search engine website available. The search application is usually integrated with the website via direct tools or an open API inter-face, which allows to use external applications and browse data online. Those tools are composed of data repository connected with databases, indexing robots and query processors. At the begin-ning the search engines were able only to process text documents, right now it is possible to look up pictures, sound files, e-mail addresses and newsletters. They are often used to search some more specific data, including trade and business information. According to the research conducted by Megapanel PBI/Gemius more than 8 millions of people in Poland use the internet search engines nowadays. The market is constantly changing and new ideas of how to exploit this appear. To-gether, with a growing popularity of such tools, increases their role in the online marketing. 2. Search engines marketing

The popularity of the web search engines is related directly with the results they generate. The most popular in Poland are Google, Netsprint and Onet. The algorithms of the search engines are constantly upgraded and are a set of rules that lists search results in a certain order. The basis are formed by the indexing algorithms which take into account many factors when determining the results order. Usually those factors are the number of links leading to this site, their quality and the popularity and key words of such website. Web portals with search engines rarely share their algo-rithms to avoid webmasters to optimize their websites for a particular search engine. Only special research allows to discover what influences the search results. The most popular search engine in Poland – Google takes into account several factors:

(2)

Rank = K*0.3 + D*0.25 + L*0.25 + U*0.1 + J*0.1 + R (1) Where: K – key words, D – the strenght of the domain, L – the incoming links, U – influence of users’ behavior, J – quality test result, R – automatically or manually given penalty points. An im-portant element of the search algorithm used by Google is the Page Rank algorithm, which in-volves page evaluation. The result is better if more external sites have links to this site and not to other. Page Rank value stretches from 0 to 1. Its value is valid for a certain document, not the en-tire portal. Page Rank is being calculated as follows:

)) ( / ) ( ... ) ( / ) ( ( ) 1 ( ) (A d d PRt1 Ct1 PRtn C tn PR = − + + + (2)

Where: d – suppress factor d 0.85, tl…tn – websites containing links to the site, A,C – number of links on the website [2]. This algorithm is constantly modified and many others search engines adopt similar solutions. Another algoirithm, called Traffic Index is used by NetSprint. In order to evaluate websites a number of factors have been introduce to make the result more objective. The importance is less crucial than measuring the traffic on the website. The main goal is to make it impossible to modify search results without true changes in the website’s content. NetSprint takes into consideration research conducted by Megapanel PBI/Gemius. Data gathered among internet users allows to make an accurate evaluation. Internet users submit their answers, which found the basis of the system [4]. Growing popularity of search engine portals makes that many companies decided to use it for reaching potential consumers with their offer. In the United States 40% of online marketing expenses is used for promotion in the search engines. Such companies can achieve various benefits. Search Engine Marketing improves the website’s visibility and thus makes it more accessible. The main goals of companies introducing SEM are: boost sales of goods and services, increase brand consciousness, decrease the overall advertising cost, gain superiority above competitors, and gain recognition among customers who value innovations [3]. Internet us-ers enter such website not by an accident, but because they search for a certain information or ser-vice. This kind of marketing makes that the website visitors is a well selected group [6]. One of the fields where marketers must be active is positioning their websites in order to gain the best possible position on search result lists. Companies make efforts to optimize their websites for the most popular search engines. They mostly use the SEO (Search Engine Optimization) technique [7]. Website positioning should take into account many factors. The success depends on: content (key words, title, content, description), technical components (domain, names of links and files, site structure, scripts), popularity (link popularity, Page Rank, traffic).

3. Data extraction methods in search engine applications

Along with rapid internet development comes increased number of indexed websites and in-formation noise, which make useful data harder to find. Contemporary search engines use intelli-gent algorithms, characterized by their ability to collect data from the environment and allowing them to use it to fulfill their tasks with best performance available [8]. The KDD (knowledge dis-covery in databases) process is a complex process, which the most important feature is data explo-ration that discovers relations between various pieces of information. The gathered information has the form of regularities, rules, tendencies or correlations that also includes hierarchy of importance of those attributes. One of the most widely used methods of data exploration is grouping, which

(3)

means assigning objects to groups (clusters) with the maximum similarity of objects within one group and minimum comparing to other groups. The purpose of algorithm is to find a rule that will allow division into groups. On input it receives objects described by a number of attributes and similarity criteria inside and outside a group [9]. Grouping involves unsupervised learning process – there are no predefined classes. This method is designed for large amount of data and is time and memory consuming [10]. Grouping algorithms found a broad application in search engines and are used to automate grouping of text documents, eliminating identical sites, eliminating words consid-ered as unimportant, extracting key words, creating thesauruses and making groups of websites in response to users query. The algorithms may have the form of decision trees (graphs), bayesow networks or neural networks and should be characterized by high efficiency of data processing and cluster generating [8].

Another important search engines’ element are text analysis algorithms. Data exploration aimed on text analysis uses the text mining algorithms. Those methods utilize extracting fragments of text for the purpose of further processing. K. Wcel defines text mining as „discovering hidden patterns and rules within large text collections”. The analyzed data is stored within text documents and text databases [11]. This search methods generally means finding key phrases and converting them into numerical data, basing on predefined patterns. Then the relations between the pieces of text are found and they are grouped in order to find useful information. The results are being ana-lyzed and form the basis for the final decisions. Information search systems often use two evalua-tion methods. One of them is accuracy which means similarity of a retrieved text to the submitted query: | ] [Re | | ] Re [Re | trieved trieved levant P₌ (3)

The second one is completeness, and defines what part of the documents matching the query has been found | ] [Re | | ] Re [Re | lewant trieved levant P= (4)

Where [Retriewed] – set of retrieved documents, [Relevant] – set of documents matching the query. The most often usage of Text Mining is search for and segmentation of similar documents, including pictures and multimedia files, gathering personal information, identification of acronyms. The major advantage is the ability to work with non-structural data. It makes possible to gather and use information within text files more effective without the need to update obsolete data [12]. 4. Model of a multi-agent decision support system and test research

The high complexity of algorithms and broad range of the marketing campaigns makes any op-timization efforts difficult without any tools. The article suggests construction of a DSS - decision support system, which goal is to support the marketing activities and the process of positioning in the search engines. The general idea is to use a set of agents, which task is to analyze the search results and analysis of the code of websites present on various positions in the ranking. The

(4)

re-trieved data is stored inside a database and then agents will be able to extract information from it, using useful tools and text mining methods, assign documents to clusters and return information useful for optimization of a website. In the cluster method agents will use the k-means algorithm. In order to calculate the discrepancy between a random point and a corresponding center, the dis-tance function is used (usually L2 quota). For this quota the sum of squares between points and

cor-responding centers of gravity equals the total inside-cluster variation [13].

¦

=

¦

_∈ − = j i C x i j k j x C C E( ) ₁_... || || (5)

Four kinds of agents have been created for the needs of the system. All the agents have specific functions and are responsible for different parts of the system. MA (management agent) supervises work of the platform, manages and coordinates other agents, SA (search agents) analyzes the code of websites listed in search results returned by search engines, CA (clusterisation agent) deals with the clusterisation of documents, IA (interface agent) is responsible for communication with the user and receives queries from him. The system is composed of four subsystems, presented on Figure 1. The IP module is responsible for information processing. Its main function is gathering information from the data sources. After interpretation, the data is transmitted to the data storage subsystem. The AP and AK agents are implemented inside the module. The IS (information storage) module is responsible for storage of information concerning users, key words, agents and their functions. The IM (information management) module manages the integrity and relevance of the data. When the system affirms that the data needs to be updated it starts the update procedure. The ID (information delivery) module is responsible for processing user’s queries and displaying the search results.

Data storage Internet Information management Pi IR Information delivery DK SD Clustering Visualisation

AK

Agent parameters

Tasks list Filtering

Calculations

Task request Information flow

Documents retrieval Results processing Decision support system Web users

(5)

The text mining uses the cauterization agent. It divides documents into groups, so that it is possible to state if the structure of a document has any influence on the search results. AK uses the Text Garden application and extracts documents from the database. Then it converts the .html files into .txt text files, and the files of the search engine into .bow (bag of words) file. The .bow file is used to divide documents into groups. For the document cauterization the k - means algorithm is used, and then the results are being displayed. At the beginning the agent performs cauterization of documents from Google. The test involves examination of a set of 12 documents, listed as the 1st, 5th, 6th, 9th, 10th, 15th, 20th, 25th, 50th, 90th, 85th, and 100 in the search results. The websites are ana-lyzed to discover whether there exist any similarities in their structures that have influence on their order on the list. The results for 2 clusters are displayed. Quality of processing for 12 documents Q=6.25262, Mean similarity MS = 0.521. In cluster 0 the system received 9 documents and value MS = 0.469. The values for the individual documents in cluster 0 were: 90G: 0.603913, 25G: 0.568047, 10G: 0.565841, 20G: 0.494054, 50G: 0.452739, 15G: 0.435805, 6G: 0.421556, 100G: 0.359408, 95G: 0.317053. Cluster K1 contains three documents with MS=0.678 and the following values 1G: 0.725187, 9G: 0.678475, 5G: 0.630543. With division into five clusters we get the quality measure of 8.67606: 0.723 and the following division of documents: K0 MS=1.000 5G: 1, K1 MS=0.628. Four documents and 25G: 0.779, 10G: 0.678, 6G: 0.600, 95G: 0.456. K2 and K3 included 1 document and K$ five documents with MS=0.632 and the values for each document: 90G: 0.781, 1G: 0.703, 15G: 0.659, 20G: 0.532, 50G: 0.487. With division into 2 clusters it is clearly visible that there is strong relation between the structure of a website and the search results. Eight documents from further places on the list and one from its beginning belong to the same clus-ter, while most of the documents from the beginning of the list belong to another cluster. Similar relation, however not as clearly noticeable, is present in division into five clusters, where websites with similar positions on the list are classified in the same cluster. Also, in the Onet search engine one can observe clear relation between the structure of a document and its position on the list. Each pair or three of documents (4 i 5), (9,10,11), (20, 35, 40), (100, 105) is classified within the same cluster. With division into five clusters such relation is also visible.

5. Afterword

The fast development of civilization brings the need to accept new challenges and develop new solutions to overcome them, with the use of new technologies. An important issue for many com-panies is to get advantage over their competitors, also in the search results. It becomes more diffi-cult to structuralize the growing amount of information. An important part of a DSS system is the correct choice of data exploration tools, which allow to extract data essential for generating accu-rate rules and conclusions. Decision support systems, based on the text mining methods, make it easier to make proper decisions when planning the structure of a website and the choosing key words. Data exploration techniques are another important element of decision support. The knowl-edge about the high rated and well positioned websites’ structure allows to achieve similar high position in the search results, what will result in more visits on your website. Ideas for further re-search of decision support systems include upgrade of the data extraction algorithms that will en-sure even better results. It is certainly a time consuming process, and requires high expenditures, however it may eventually lead to very good results.

(6)

Bibliography

1. Benicewicz A., Nowakowski M., Webpositioning. Skuteczne pozycjonowanie w Internecie, czyli jak efektywnie wypromowa witryn, Wydawnictwo MIKOM, Warszawa 2005.

2. Berry M.W., Borne M. (2005): Understanding Search Engines: Mathematical Modeling and Text Retrieval, Society for industrial and applied mathematic

3. Brown B.C (2007) The Ultimate Guide to Search Engine Marketing: Pay Per Click Advertising

4. Dokumentacja systemu Netsprint (2000), http://netsprint.pl

5. Dunford T. (2008): Advanced Search Engine Optimization: A Logical Approach, American Creations of Maui

6. Frckiewicz E. (2006):, Marketing Internetowy, Wydawnictwo Naukowe PWN, Warszawa.

7. Gemius/PBI (2008): Megapanel, pbi.og.pl

8. Jones K.B. (2008): Search Engine Optimization: Your visual blueprint for effective Internet marketing, Wiley Publishing

9. Frontczak T., Marketing Internetowy w Wyszukiwarkach, Wydawnictwo Helion, Gliwice 2006.

10. Hand D., Mannila H., Smyth P.(2005): Eksploracja danych. WNT, Warszawa.

11. Kłopotek M. A., Inteligentne wyszukiwarki internetowe (2001): Akademicka Oficyna Wydawnicza EXIT, Warszawa.

12. Langville A. N., Meyer C.D. (2006): Google's PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press.

13. Netsprint (2008): Traffic index, Materiały informacyjne firmy NetSprint Sp. z .o.o., 2008 14. Mcleod A. (2007): Marketing Internetowy w praktyce, Internetowe Wydawnictwo Złote

Myli, Gliwice.

15. SoeMoz.org (2008): A little piece of google algorithm revealed, http://SoeMoz.org 16. Zanasi A., Brebbia C. A., Ebecken N. F. F. (2006): Data Mining VII: Data, Text And

Web Mining And Their Business Applications, WIT Press.

Jarosław Jankowski Kamila Grzko

Computer Science Faculty

West Pomeranian University Of Technology 71-210 Szczecin, ul. ołnierska 49

e-mail: kgrzasko@wi.ps.pl jjankowski@wi.ps.pl