Application of Text Mining on the analysis of the publisher of the Studies and Proceedings of the Polish Association for Knowledge Management

(1)

6.1.1 Overview of the Studies and Proceedings of the Polish Association for Knowledge Management

The Polish Association for Knowledge Management was established on January 14, 2003 by the representatives of 9 leading Polish universities and the Systems Research Institute of the Polish Academy of Sciences. The other members include representatives of 26 academic centers from all over the Polish territory as well as entrepreneurs. The Polish Association for Knowledge Management was registered under KRS No. 0000163221 at the District Court in Bydgoszcz, 13th Commercial Division of the National Court Register on May 30, 2003.

- The primary objective of the Association is “to support development and disseminate achievements in organizational, legal and scientific principles of the creation, transfer and processing of knowledge, and – in particular: to support scientific research in all disciplines participating in knowledge management processes, including IT methods and resources

and to exchange experiences, scientific and educational achievements within the scope of knowledge management systems.”

The objectives are pursued mainly through the organization of seminars, symposiums, scientific conferences, general educational activities, the publishing of the volumes of the Studies and Proceedings, the practical promotion of information and knowledge, the development of expert opinions and testimonials, and acting as an intermediary in domestic and international scientific exchange.

The objects of the Association’s economic activity are primarily the following:

- Consultancy services with regards to computer hardware and software, databases and problems connected with the application of Information Technology;

- Scientific and research work ensuring development in all disciplines of our interest; - Other activities connected with professional organizations.

It is particularly noteworthy that income from the Association’s economic activity may not be distributed among its members, but it has to be used for statutory objectives and to cover the operational costs of the Association.

Since 2004, the Polish Association for Knowledge Management has been operating as a publisher of the periodic journal Studies and Proceedings of the Polish Association for Knowledge Management (Pol. Zeszyty Studia i Materiały Polskiego Stowarzyszenia Zarządzania Wiedzą). It is a collection of articles compiled according to the specific subject of a given volume. The publisher wants to produce at least four volumes per year. Until now, 45 volumes have been published, 9 of them containing only English-language articles.

All volumes published before 2011 are available to be downloaded in PDF format from the Association’s website. In this way, it is possible to access 36 complete volumes of the Studies and Proceedings of the Polish Association for Knowledge Management.

(2)

Each research paper published in the Studies is 11 to 13 pages long (which totals approximately 20, 000 characters with spaces). The articles are subject to detailed review, and when they receive negative reviews, they are not published. The percentage of them that fail the review reaches nearly 20 percent, which represents the high quality level of the publications. 6.2 The object of the study – 9 volumes published in English

The 9 volumes of the Studies and Proceedings published in English provided the material for analysis. The choice was by no means random, but it was basically determined by the linguistic limitations of open-source software, which does not include the ISO codes for the Polish language and, as a result, cannot identify Polish characters correctly, leaving them out or replacing them with random symbols. Also, certain processes included in the software often refer to spell-checking tools for a given language. Therefore, the lack of Polish dictionaries makes it impossible to analyze Polish language texts.

The English language volumes contain 159 articles with a total number of 1,478 pages. Thus, they constitute an extensive database, which allows development of knowledge maps and may lead to interesting analysis results, making use of the text mining method.

6.3 Research methodology 6.3.1 Data standardization stages

The first stage in any text mining process is the acquisition and standardization of the data. Acquisition mechanisms consist in matching portions of texts with models, defining the kind of the requested content. Therefore, in text analytics it is especially important to prepare the documents according to the software used. Documents should be in a digital format and written in English, as this is a precondition for the software to interpret the text accurately.

For this purpose, 9 volumes of the Studies and Proceedings were exported from their original PDF format (Adobe Acrobat portable file) to an MS Word format and then converted into regular text files (TXT). This process is very important, because transformation of a source text into a format accepted by the simplest word processor removes any style attributes or formatting marker, such as HTML.

The outcome of the conversion was a number of text files deprived of any formatting characters and uniformity with regards to character coding (UTF-8). The course of subsequent processes depended on the choice of the manner in which to represent information contained in the texts. The data were divided into separate files, with each new file corresponding to an article, giving rise to the development of a database with 159 files, a representative number of the articles published in the 9 volumes.

6.3.2 RapidMiner

Choosing the right program to perform text mining is very important, as providing a proper technical environment for text analysis must ensure reliability, stability and a high speed of calculation. There is a wide choice of text mining tools. Although commercial software is expensive, it offers a lot of functionalities, combining the ease of operation with a possibility to overcome any obstacle. For the purpose of this analysis however, a decision was made to use open-source solutions.

(3)

The leading free text mining environment is the RapidMiner software, developed at the University of Dortmund, whose popularity has been growing since 2004. The statistics published on the developer’s website (rapid-i.com) show that RapidMiner is used by nearly half of users who apply text mining in their projects. It is also interesting to mention that the software is used by such companies as Cisco Systems, HP, IBM, Ford, Honda and Philips.

The popularity of RapidMiner is not surprising, because the program is also a data mining tool, in which text mining is only a fragment of its larger potential (it has more than 500 operators).

The properties of the software include:

- Integration with all available operating systems;

- Reading individual files in such formats as .txt, .xls, .xml, or .csv, and all documents from folders or professional databases;

- Access to more than 500 processes for data processing, transformation, integration, data mining, text mining, filtering, aggregation, modeling, evaluation of results and visualization;

- Graphical presentation of processes; - Integration with a Weka library;

- Extensive support system and tutorials (as well as online training); - Access to free online updates.

The aforementioned features of RapidMiner were the decisive factors in order to choose the right tool for the analysis.

The program has a very clear and user-friendly interface (Fig. 6.1) with a particularly useful graphical representation of programmed processes and the function enabling the user to find a specific task in the program.

RapidMiner works very fast and is incredibly stable. Naturally, the speed of calculation depends on the hardware platform on which the software is installed. When performing the analyses, the program processes huge text data sets without any disturbances.

(4)

Figure 6.1 The main screen and the design window of RapidMiner Source: Own research.

6.3.3 Operations on documents in RapidMiner

Analyzing documents is generally associated with NLP (Natural Language Processing), since it usually concentrates on a single document. However, building relationships, creating statistics for the occurrence of specific words and phrases from a particular subject area requires larger-scale efforts and analyzing large volumes of background documents (a so-called corpus).

The basic, automatic processing of documents, used in this analysis consists of the following phases:

- Importing a set of documents into the program. During the process, it is possible to determine a few significant assumptions, such as the elimination of the words which occurred in the text less than x times. It is a useful option, if we want the results to apply only to the most common words and to reduce the size of the data set.

- Breaking the input text into words, the so-called tokenisation, a process in which the text is stripped of any punctuation marks, hyphens, bullets, double spaces, or numbers. The

(5)

system removes any characters that are not letters. As a result, a string of all words, separated by one space, from the document is obtained; (Fig. 6.2)

Figure 6.2 The result of tokenisation Source: Own research.

- Filtering out non-significant words on the basis of a so-called stop list. In this process, all elements that are irrelevant for the dictionary, e.g. words (prepositions, verbs etc.), characters, and phrases, and even articles in the English one are removed. Additionally, the filtering can be set to exclude words of a given length, e.g. leaving all words that are at least two-letters long; therefore, single letters, if they somehow survived the stop-list tool, will be removed as well. Thanks to this process, the set of words to undergo subsequent analysis becomes much smaller (Fig. 6.3);

Figure 6.3 A result of stop-list filtering Source: Own research.

- An important aspect of the correct interpretation of words by an application is also uniform character type. For this purpose, the Transform Cases instruction is used, which has two selectable attributes: “lower case” and “upper case”. If the lower case option is chosen, this process changes all words written in upper case, or beginning with one, into lower-case words, thus e.g. Bydgoszcz is transformed into Bydgoszcz, PROCESS into process, etc.;

- In some cases, the process called stemming has to be applied, in which the Porter’s algorithm helps eliminate similar words with suffixes and bring them to their root form,

(6)

e.g. where such words as work, works and working occur in the text, they are treated as three different words, but the stemming process will reduce them to one base word (work) with the occurrence recorded as three (Fig. 6.4);

Figure 6.4 An effect of stemming (before and after) Source: Own research.

- The last substantial function is the generation of phrases (n-grams). This process allows the creation of phrases with 2, 3 or more segments, as required. The program searches the text to find neighbouring words which, according to the dictionary, may form a valid phrase, e.g. business and intelligence, European and union, access and information. The resulting phrases are joined with an underscore (business intelligence, etc.) (Fig. 6.5).

Figure 6.5 The result of an n-gram generation process Source: Own research.

(7)

Passing for a text through all these phases ensures that the data, to be further processed or interpreted, are properly standardized and organized.

6.4 Text mining analysis of the studies and proceedings 6.4.1 Article statistics

The text mining analysis was performed on the 9 volumes of the Studies and Proceedings, published in English. The volumes were numbered from 1 to 9 and their characteristics are presented in Table 6.1 below.

Table 6.1 Characteristics of the volumes of the ‘Studies and Proceedings’

Source: Own research.

There are 159 articles in all the volumes, published on a total of 1,478 pages. The total number of words in the entire set is 493,176 and are made up with as many as 2,795,240 characters without spaces. These numbers give an average of 9.3 pages per article, or 54,797 words per volume.

The largest volume in terms of number of pages is No. 35 (Pos. 8 in Table 6.1) with 281 pages. Nevertheless, it contains approximately 76,000 words, which is about 13,000 less than in No. 42 (Pos. 9) with 89,000 words.

Figure 6.6 shows the structure of the number of words per volume, which is the ratio of the word count in each volume to the total number of words, expressed as a percentage. Volume No. 42 has the greatest share in the total number of words, with 18.23% of all words, followed by Volume No. 35 with a share of 15.43%. The next one is Volume No. 11 with 12.72% of the total number of words and the remaining 5 volumes contain approximately 9% of the words. Volume No. 15 contains the smallest number of words, 7.36%, and the smallest amount of pages, as well.

The next analyzed aspect was the number of articles per volume. It was established that Volume No. 15, the one with the least pages, consists of 13 articles, which is not the smallest number in the set. Volume No. 34 has the least publications of all the analyzed volumes (9), but its word count structure amounts to nearly 9%.

(8)

Figure 6.6 The word count structure in individual volumes Source: Own research.

6.4.2 Author statistics

The next step was the development of statistical information concerning the

authors of the publications. The total number of authors is 169; thus, in a dozen or

so cases the articles were published by 2, 3 or more authors. Another analytical

aspect consisted in the determination of the number of articles published by the

same author.

Figure 6.7 The number of authors and their articles Source: Own research.

(9)

The chart in Fig. 6.7 shows that a vast majority – 128 authors (which makes up nearly 76% of all the authors) published only one article in English and 26 authors wrote 2 articles. Next, only 7 authors wrote 3 articles, and 4 and 5 were written by 3 and 2 authors, respectively. The most prolific authors published 8, 9 and even 20 articles. An analysis of the source materials revealed that they were outstanding professors from the academic centers in Bydgoszcz.

In order to carry out further statistical analysis on the data relating to the authors, certain additional work had to be done. All articles were selected by their authors and reduced to a simplified electronic form using two subsequent processes:

- Exportation of the text from PDF to DOC, and then to TXT. - Tokenisation of the text files.

In this way, the data were ready to undergo text mining. After the text mining analysis, two parameters were specified to allow a summation of the results, their interpretation and the formulation of conclusions. These parameters were:

- The number of words.

- The number of characters, including spaces (according to the current standards in publishing).

In order to make the analysis more accurate, it was decided that where an article had two or more authors all of them should be assigned the total number of words/characters with spaces in the relevant publication.

The results show that the average number of words in 1 article per author is approximately 3,272, which is equivalent to 21,628 characters with spaces. However, such a high average number is determined by the bulky articles published by the most active authors in the English-language volumes. It is evident in the asymmetrical distribution of data, as the number of authors whose word counts were lower than the average amounted to 62% of all the authors (104 people). Quantitative data, included in word count brackets per author, is presented in the chart below (Fig. 6.8.).

Figure 6.8 Number of authors and their average word count per article Source: Own research.

(10)

The largest group of authors, 67 of them, contained -on average- from 2,000 to 2,999 words in each of their articles. The second largest group, represented by 51 authors, used from 3,000 to 3,999 words in their publications. Sixteen authors wrote the smallest number of words per article (i.e. 1,000 to 1,999).

The longest articles, falling in the brackets from 6,000 to 7,999 words, were written by only 7 authors, which represent 4.14% of them all.

The shortest article of all the nine volumes had 1,100 words or 6,936 characters with spaces, whereas the longest one consisted of 7,591 words, i.e. 43,630 characters with spaces.

At the subsequent stage of analysis, steam-and-leaf plots were developed with the number of articles written by individual authors. Class intervals were determined on the basis of the number of articles published in the Studies and Proceedings, in the following manner:

1 article; 2 articles; 3 to 5 articles; 6 and more articles.

The total number of words and characters with spaces in these groups is presented in the chart below (Fig. 6.9).

Figure 6.9 The total character and word count in class intervals Source: Own research.

The authors with 1 published article generated 431,806 words, or 2,842,088 characters with spaces. It is interesting to note that authors of 2 articles used more words than the authors of 3 to 5 articles. However, considering the number of characters with spaces, the correlation is the opposite. The last group consists of authors of 6 and more articles, whose output amounted to 11,585 words (751,476 characters with spaces).

(11)

A look into the results for individual authors can reveal a correlation between the number of articles published and the number of words and characters used. This relationship can be seen in the share of individual authors’ articles in the entire collection of English-language volumes of the Studies and Proceedings. Table 6.2 below shows the top ten authors with the biggest share in the publications (Tab. 6.2).

Table 6.2 The share of articles of the top ten authors in all volumes

Author No. 1, whose 20 articles were published in the Studies and Proceedings, contributed 10.10% of the content. The next two authors with a share of over 6% wrote 9 and 8 articles, respectively. Authors with 3 to 5 articles have a share exceeding 2.5%. The top ten is closed by two authors who, despite having produced more – 4 articles each, contributed to less than 2.5% of the total content.

6.4.3 Territorial distribution and academic centres

This chapter presents the results of the clusterized classification of authors and their articles according to their affiliation with specific academic centers and their territorial coverage. The publishing house of the Polish Association for Knowledge Management sets no restrictions as to the city or country where an academic institution operates. The analysis below shows that most of the authors are located in Poland; however, foreign academic centres have a substantial share in the works.

(12)

Figure 6.10 Territorial distribution of academic centers represented in the articles Source: Own research.

The pie chart above (Fig. 6.10) shows a percentage share of cities with academic centers and institutions whose authors published their articles in the Studies and Proceedings. Foreign authors make up 10% of the articles published in the English-language volumes. They are located in such countries as the Ukraine (2%), Sweden (2%), Slovakia (1%), Scotland (1%), Japan (1%), France (1%) and China (1%).

As regards to Polish institutions, the academic centers in Bydgoszcz (31%), Szczecin (26%) and Warsaw (20%) have the greatest percentage share in the volumes. Other Polish cities are represented in 1 to 3% of all articles.

The next stage is a summary of academic centers and institutions which shows the profiles of their research interests and scientific knowledge shared in the volumes published by the Polish Association for Knowledge Management.

ǇĚŐŽƐǌĐǌͲϯϭй ǌħƐƚŽĐŚŽǁĂͲϭй 'ĚĂŷƐŬͲϯй <ŽƐǌĂůŝŶͲϭй <ƌĂŬſǁͲϭй >ƵďůŝŶͲϮй BſĚǍͲϮй WŽǌŶĂŷͲϭй ZǌĞƐǌſǁͲϭй

(13)

Table 6.3 The academic centers represented in the Studies and Proceedings

Table 6.3 shows all academic centers and institutions represented in the English-language volumes published by the Polish Association for Knowledge Management. On one hand, the table presents the diversity of universities, schools and institutions that have taken part in the English-language volumes; on the other hand, it shows the contribution of their researchers who wrote the articles.

(14)

An absolute majority of the authors were affiliated with the University of Technology and Life Sciences in Bydgoszcz (49) and their articles represent the greatest share in the total number of words in all volumes published in English, reaching nearly a third of the content (32.55%). The West Pomeranian University of Technology in Szczecin takes second place with 36 authors and a 17.08% share in the volumes. The third largest contributor is the Warsaw University of Life Sciences (SGGW) with 12 authors who wrote 9.57% of articles. The next 2 schools, of which 8 authors published in the Studies and Proceedings, are the Szczecin University of Technology and the Warsaw School of Economics in Warsaw. Their respective percentage share in the structure amounted to 6.04 and 4.39%, respectively.

As a result of the analysis above, it was also established that more than 70% of the authors were engaged in scientific work at universities (Fig. 6.11), whereas a similar percentage (9.47%) of all the authors were research staff of a technical university or academy. The smallest group of authors represented private schools (3.55%).

Figure 6.11 Main types of academic centers with which the authors of the ‘Studies and Proceedings’ publications are affiliated

(15)

6.4.4 Standards of the articles

Articles published in the scientific papers of the Polish Association for Knowledge Management are subject to strict editorial rules. The Association’s website (http://www.pszw.edu.pl/) contains a model with guidelines concerning the structure of documents, number of pages, the size of individual sections and other elements, such as annotations etc.

At this analytical stage, the articles were checked for compliance with the publisher’s guidelines with regards to the following:

- number of characters with spaces (cws) – 20,000 cws minimum;

- document structure – identification of the author, introduction, keywords and summary. In the first case, i.e. the number of characters with spaces, the analysis proved that as many as 51 of the articles, or 32% of them all, failed to meet this requirement. The lowest score was achieved by an article with just 11,036 cws. However, this calculation is biased by the fact that the analytical software did not count characters included in drawings, graphics and diagrams, which are treated as blank spots. Furthermore, the publisher’s guidelines clearly state that the characters are counted “…including drawings, tables, diagrams and other substantive graphics – where 3,000 cm2 is the equivalent of 40,000 characters, and e.g. a full-page figure corresponds to approximately 3,200 characters.” [2] Therefore, further verification of this compliance by means of RapidMiner proved unfeasible.

The second aspect concerned the structure of documents. To perform this analysis, besides applying the processes described in Chapter 6.2, the words in all articles had to be associated. The texts were searched for all the above-listed words, identifying specific compulsory elements (introduction, keywords etc.) in the documents. The results are presented in Figure 6.12.

Figure 6.12 Similarity of documents in terms of their structural compliance Source: Own research.

The results of this process proved positive, which means that all 159 articles were correctly organized and edited (in line with the publisher’s standards).

(16)

6.4.5 Similarity of articles

The volumes of the Studies and Proceedings contain articles in a variety of subjects. In this chapter, only the similarities in the content of the articles will be discussed. In order to obtain possibly the most reliable results, the analysis covered all 159 articles.

Besides the standard processes, described in the previous chapters, such as tokenisation, lower-case transformation, or filtering, a new process to check similarities among the articles was applied. The Data To Similarity operator is to scan each document, and then to compare them in terms of occurrence of specific words, phrases and variations, as well as their location. The calculating mechanism is based on the notion of cosine similarity, and its task is to measure the angles between obtained similarity vectors in individual articles. The final result is a parameter of SIMILARITY, whose value falls within the range of 0 to 1, where 1 means a duplicate data set (in this particular case, a duplicated article), and 0 indicates that there is no correlation whatsoever between compared articles. The results can be also interpreted as percentages, where 100% means an absolute match and 0% its absolute absence, and e.g. if the value of similarity between Article No. 2 and Article No. 4 is 0.403, the two articles are similar in 40.3%.

This process was applied to all the articles, joined in pairs, non-redundant, so that all possible configurations of comparison could be considered. The presentation of the results consisted in the sorting of the similarity from the highest value (or the greatest similarity) to the lowest. The table below (Fig. 6.13) shows the first 57 pairs with the highest values within the similarity parameter.

Figure 6.13 Similarity of articles (all volumes) Source: Own research.

The analysis yielded some interesting results. Of all 159 publications, the articles identified as Article No. 6 and Article No. 24 were found to bear a similar value of 1, which may indicate that

(17)

their contents are identical. The articles were published in Volume No. 6 (Article No. 6) and Volume No. 9 (Article No. 8) under the same title of “Computer networks and multimedia techniques in the development of virtual organizations and e-commerce”. In the two articles, some minor differences in editing and syntax are present, which however – thanks to good standardization of data – had no effect on the correctness of the obtained result, which implicates an editorial error in the publishing procedures, made by the editors of the Studies and Proceedings of the Polish Association for Knowledge Management.

Other articles, whose similiarities turned out to be particularly large (80.6%), were No. 49 and No. 93. The first one is entitled “Multi-version model for enterprise’s data warehouse” and the latter “Method of conforming a data warehouse to the enterprise’s variable information needs”; therefore, the titles themselves suggest some resemblance of the content, as both concern the idea of a data warehouse.

Articles No. 48 and 82, entitled “Application of the traceability concept into food supply chains and networks design. Results of international benchmark study” and “Information technology in formulation of transparency strategies for food chain and supply management in Poland”, respectively, are similar in 70% of their content. In this case, the correlation is evident already in the titles which indicate the subject, i.e. food supply chains.

Figure 6.14 A histogram presenting the values of similarity and its frequency of occurrence Source: Own research.

The histogram in Fig. 6.14 shows a general distribution of the number of occurrences of individual similarities among the analyzed articles. More than 1,600 pairs revealed no common content. In most cases (nearly 4,500), similarities concerned only 2% and 3% of the text. Then the values decrease and only 80 compared pairs of articles reached the level of similarity of 20%, and only one pair was found to be practically identical, in terms of content, with a similarity of 100%.

This process is very useful to check whether comparable articles are similar or not. Thus, articles which are very similar to an already-published work can be eliminated before they are released, which will improve the quality of the publication.

(18)

6.4.6 Links between articles

With results concerning similarities among the articles in hand, it is possible

to find linear connections between them. For this purpose, the results are

presented using a RapidMiner graphic functionality.

Figure 6.15 Linear connections among articles Source: Own research.

Although most of the analyzed articles are not strongly interconnected (Fig. 6.15), some of them have strong links with other single articles, but it is not enough to form strong relationships with other publications. However, the mapping shows a cluster of articles whose mutual relationships are quite strong.

Figure 6.16 Linear connections among articles (scaled up) Source: Own research.

(19)

The map in Figure 6.16 shows a scaled-up cluster of mutually-related articles, where the article numbers are linked with other articles in a series. These relationships should not be mistaken for similarity, because here the articles are connected with all other articles in the group. 6.4.7 Clusterisation of articles

The process of clusterisation of articles is a sophisticated operation based on the association of documents into specific groups (clusters). The number of clusters is user-defined. The association of articles makes use of their similarities determined by word vectors. In this particular case, 15 clusters were selected.

Figure 6.17 Clustering Source: Own research.

Figure 6.17 shows an outcome of document clustering. Each cluster represents a relevant group of similar articles. The location of clusters is not coincidental, either. For example, the second cluster (the first in line) shows the least similarity to the sixth cluster (the final one). The numbers in each cluster represent the corresponding article titles, but descriptions of individual clusters are presented below.

(20)

CLUSTER NO. 2

Tools, methods and trends of communication development, logistics and diagnostic applications in small and medium enterprises. This cluster demonstrates a methodical dominance relation with the five articles presented in this cluster, in addition to the results from the content of substance.

CLUSTER NO. 3

Artificial intelligence systems in the operation of machinery and methods of financial engineering. The third and seventh clusters are examples of the lowest number of articles. Its content and methodology is different than the other represented groups. CLUSTER NO. 4

There were articles in the field of Internet technologies, the use of decision support systems and artificial intelligence methods in the operation of machines, models, diagnostics and water supply flow calculations. Diagnostic systems, hydraulics and technical conditions of machines in the process of their exploitation. Monitoring of mechanical damage in the diagnosis, critical conditions and the renewal of worn components, machinery and equipment. Modeling the costs of sustainable management and use of economic instruments for environmental protection in EU policy. Within this cluster one can find 19 articles, but the substance relation of decision support systems, artificial intelligence methods and models for diagnosing network flows in operation of machines are concentrated in 14 articles. The last five articles are about different areas of applications in strong relations with applied methodological solutions.

CLUSTER NO. 5

Cluster 5 groups the analysis of the prices of agricultural products with the usage of XML in remote applications, the managing of GPW (Polish stock exchange) investments with the use of neural networks, the construction of trading strategies and policy services, and a support strategies stockbroker, diagnostic systems interference and obtaining information for the location of wind farms. It is a universal cluster or multi-thematic, which consists of 11 articles with different application areas, but where there is a common informative and utilities approach.

CLUSTER NO. 0

Cluster 0 contains problems with the use of spatial databases for sustainable development and the use of information technology in public administration. The development of computerization of rural areas, the impact of social networks for social activity, the shaping of local innovation policy through e-government and a local web base in different countries with the assessment of the efficiency of modeling of the Yangtze River Delta in China. This cluster consists of 8 articles and is mostly concerned about regional policy issues and applications of information technology within public administration.

CLUSTER NO. 1

An automated computational framework for efficient knowledge utilization in data mining process and the approach to an efficient parallel computing organization.

(21)

Application analysis of prince2 and aim oracle as tools stabilizing the process of the integrated ERP II system implementation with a study of distributed e-commerce system effectiveness. Waternet modeling and model calibration for the waterworks management and transposition rearrangement in a weighted length cost model. Important indications for builders of newly-manufactured machines using fem and Xml documents change the detection algorithm based on a linear programming method. This cluster groups 8 articles, mainly focused on different aspects of the modeling operation of machinery and methodology of implementation of information technology in enterprises.

CLUSTER NO. 14

Vehicle repair cost calculation, market value estimation and exploitation analysis systems and chosen aspects of efficiency and investigation of fitness engines and vehicle nodal routing with traffic classification using diagnostic systems for trains and a vehicle parameter data recorder. This cluster contains 7 articles about different problems of substance use and diagnostic models of the fitness of a machine.

CLUSTER NO. 13

Monitoring of agri-food stuffs producers as an example of market information in international comparison of management morality in the economic context and the consequences of electronic public procurement. Information technology in the formulation of transparency strategies for food chain and supply management in Poland, in case it improves efficiency of its practical usage (in the example of the Zachodnio-Pomorskie province).

Analysis of information needs supporting ecological agriculture development and ecological food market and operational- decisional model of the food economy with logistic management and innovations as a factor of the competitive advantage in catering – a case study. Application of the traceability concept into food supply chains and networks design. The results of an international benchmark study and e-learning model platform which strengthens the manufacturing and distributing processes in agri-food SME in the region. Progress assessment of adaptation processes of agribusiness companies as the derivative of their network development and knowledge support and synthesis of enrollment and evaluation of agri-food network activities in the SME sector in Poland. Methods of managing the branch organization of agricultural manufacturers shown in the example of the Polish federation of cattle breeders and dairy farmers and employers-sharecroppers and the land owners’ association.

Cluster 13 groups 14 articles focused on different aspects of computer-aided logistics management and agribusiness companies. Important aspects related to the analysis of sources of competitive advantage agribusiness companies and regions, what reflects problems of the survey through the Framework Programme FR – 6 “Towards” whose results were published in volumes of the Studies and Proceedings of PSZW.

(22)

CLUSTER NO. 12

The possibilities of adopting an automatic analysis of the competitiveness between Polish and German enterprises for a business plan application in agribusiness by the use of a data warehouse application. The multi-version model for an enterprise’s data warehouse and the method of conforming a data warehouse to the enterprise’s variable information needs and organizational assumptions of RFID radio identification technology functioning.

This cluster brings together 5 articles on distributed subjects. The similarity is based on the analysis methods of its content.

CLUSTER NO. 9

Business process modeling by using simulation software in re-engineering and modeling processes in organizations for the challenge of their strategies in the enterprises and modeling the computerization strategy of an organization as multi-criteria methods in the evaluation of quality of management information systems. Computer networks and multimedia techniques in the development of virtual organizations and e-commerce and customization of RUP methodology for service-oriented programming.

Evaluation of the decision support system for the ICT area in selected problems of enterprise structures in public administration and knowledge management in market intelligence: know-how importance in sales & marketing inform the action systems overview of the business performance solutions. Application of the AHP method in selecting software components and chosen frameworks supporting cost software products selection, evaluation data mining and knowledge discovery for the implementation of knowledge management.

Intangible resources developed nowadays within the enterprise in computerized maintenance management ERP systems with bi-modules as a new perspective in knowledge management and an expert system for building technological pathways of initial processing of furniture elements and an information technology application in the fisheries management system.

Cluster 9 groups 20 articles and it is one of 3 clusters with the largest number of articles. All 3 of these contain 57 articles, which makes up 35% of total share. Publications grouped in this cluster cover a wide range of topics of knowledge management issues in different types of organizations.

CLUSTER NO. 8

The meta model approach to human resource management in the system of the DSS class regarding the efficient choice of savings and checking accounts with the usage of discrimination analysis. The business process analysis based on the fuzzy sets theory, dealing with textual documents in corporate applications: text categorization using fuzzy linguistic summaries. Linguistic database summaries using fuzzy logic: towards a human-consistent data mining tool and determinants of access to the Internet in households – Probit model analysis, exactly in the application of fuzzy spread regression to the analysis of biotechnological process efficiency. Fuzzy sets as one of the elements of artificial intelligence in machine condition testing and using fuzzy logic in the estimation of a Polish enterprise’s condition. Fuzzy expert systems supporting

(23)

statistical process control in a manufacturing automotive company and fuzzy parameters for interactive advertising control systems as an algorithm for improving the quality of a fuzzy rule base. This cluster contains 12 articles of different areas of the application of fuzzy sets and human resources management issues.

CLUSTER NO. 11

The structure of the demanded staff potential and knowledge recourses to implement comprehensively the relevant e-business technologies of net knowledge in the management of processes in Polish agriculture and an Internet dictionary of agricultural terms: a practical example of extreme programming. Knowledge management according to the maxime methodology, through innovations in Poland and the application of Internet techniques in the promotion of agro-tourist farms in Bory Tucholskie National Park.

Ontologies as a technique of knowledge management in open e-learning systems for operators of dynamic processes as goals and methods for searching and gathering information for international marketing research. Marketing communication in web 2.0 applications and utilization of the Internet in the promotion of small and medium enterprises and knowledge creation – car in British culture. The efficient use of multimedia methods in trainings and the relationship between the features of traditional and multimedia trainings as the basis of conducted research.

Staff education and research problems for building a knowledge-based economy and some alternative form of European social fund resources allocation for financing lifelong learning. Application of Electre Tri Multi-criteria decision-making for voivodeships classification and data analysis methods in social networking platforms and the Polish labor market during the recession and the impact of happiness capital on economic growth. Problems of standardization in virtual organizations as the perception of global collaborative knowledge systems in Polish SMEs and the analysis of web questionnaire use in the mobile telephony market research in Poland

The impact of the emission-limiting policy on technological conversion and macroeconomic structure, the identification of simple monetary policy rules with the use of heuristic methods of data analysis and using the rollout methodology during the ERP systems implementations in foreign countries in the integrated access control system.

Cluster 11 groups the largest numbers of 25 publications which are the most substantive scope of surveys, in large extent because of the e-business issues with specific research applications of Internet technologies in agro-tourism as well as standards of knowledge management in various sections of industry.

CLUSTER NO. 7

Appointing to the efficiency of the arrangement of moving the drive with traffic accident reconstruction.

Occurring in these Cluster 2 are articles that concern specific issues of research and the modeling of traffic and accident reconstruction.

(24)

Coherence function in the evaluation of a combustion engine’s technical condition and the vibration measure as information on a machine’s technical condition. The impact of the cubic non-linearity of spring force on the vibrations of the Duffing oscillator with environmental protection against communication noise and vibrations.

Cluster 10 includes 7 articles that are focused on the analysis of various research techniques and the modeling of machine vibrations. It is a collection of articles with a high level of homogeneity of substantive research issues.

CLUSTER NO. 6

Knowledge management in municipalities and companies at the Polish – German border and the selected problems of knowledge visualization. Information risk of the enterprise and knowledge base as an important condition of its management and organizational learning capability – the innovation trigger in the new economy, but trust and knowledge sharing are a critical combination.

Building a knowledge sharing organization within the information systems and knowledge management in businesses by means of an e-business strategy and empirical studies on the influence of information technology on knowledge sharing. Data mining in knowledge management systems in the age of the new economy and useless knowledge elimination methods in GCKS. Gaining the competitive advantage through knowledge management in small enterprises. Ontology as a descriptive method of knowledge management systems classification as a process based on knowledge management in projects – the tutorial outline.

Cluster 6 is integrated with 16 articles that are focused on different aspects of methodology and the utilitarian use of knowledge in the management of organizations. Especially important is a graphic presentation of knowledge and its role in building a competitive advantage.

The analysis process of the clustering was an extremely interesting opportunity to assess the results of grouping items into homogeneous classes, implemented by identifying similarities compared to the work’s content. Identification of the similarity of the articles in terms of their content is often made possible because of the uniformity of substantive research issues. An example of this is Cluster No. 15, which includes an analysis of sources of competitive-advantage agribusiness firms and regions, reflecting the problems of research through the Framework Program PR – 6 ‘Towards’. A similar dependence occurred in the analysis of Cluster No. 4, grouping 19 articles and substantive links concerning the application of decision support systems, artificial intelligence methods and models of network flow diagnostics in machine extraction operations. This analysis was carried out on 14 articles that focused on a research project conducted at the Department of OPIE Mechanical Engineering UTP in Bydgoszcz.

Other sources combine similar content with the convergence of content resulting from the homogeneity of exemplary methodological ties occurring in Cluster No. 6: the 16 articles focused on different aspects of methodical and utilitarian knowledge applications in business management. A similar phenomenon also occurs in the case of Cluster No. 8, which produces 12 articles on different areas of the application of fuzzy sets and the problems of human resource management.

(25)

6.4.8 Word statistics

Full-word statistics is possible to achieve when all the described processes are

executed: tokenisation, filtering out-of-stop words, lower-case transformation,

generation of n-grams (3 segments), filtering out of insignificant words (leaving

those with a minimum of 3 characters) and, finally, filtering the text on the basis

of an own dictionary (the removal of words that are typically found in scientific

articles, such as introduction, keywords, or summary).

Table 6.4 Word statistics of all articles from all volumes

(26)

The results presented above are concerned with the 42 most common words in the entire set of 159 articles (Table 6.4). The words are in their original form, as used in the documents, and did not undergo the stemming process. A corresponding list of words after the process is shown in Table 6.5.

For each word, besides the total number of its occurrences, the number of documents with the occurrences is shown: for example, 104 occurrences in 159 articles. The subsequent columns to the right show the total number of occurrences of a word in specific volumes of the Studies and Proceedings.

The most common word that occurred in all English-language volumes published by the Polish Association for Knowledge Management was knowledge: 2,173 occurrences in 104 documents. It was the least represented in Volumes No. 24, 34 and 35, whereas in Volume No. 40 it was found 569 times.

The next three places are the words which occurred with a similar frequency, namely about 1,850 times. These were: data, information and system.

The last of the highly-represented words was management, which occurred 1,336 times in all articles and appeared in 109 of 159 documents.

As far as the number of document occurrences is concerned, the leader of such a ranking is the word time, which occurred in 142 of all the documents. In total, the word was identified 925 times in the articles. The second most popular word was the verb and noun use – 141 articles and 976 occurrences in the entire data set, and third place in this respect was shared by two words: data and process, both occurring in 139 documents.

When analyzing the frequency of phrases (two or three words in a sequence, which form a commonly comprehensible phrase) in all the publications, it was observed that the most common pair of words was knowledge management. This phrase occurred 333 times in 36 documents. The next most common phrase was knowledge sharing with 129 occurrences in 12 articles, and the third most frequent phrase, decision making, was found 117 times in 48 documents.

The subsequent step, meant to provide even more detailed characteristics of the knowledge contained in the PSZW publications, was carried out using the stemming process. This basically involved the replacement of the various forms of words occurring in the texts with their basic form (stem), and then summing them up again.

(27)

Table 6.5 Word statistics of all articles after STERMING

Table 6.5 above shows 21 words as the results for the most common terms that occurred in all the analyzed articles. The process indicates that words from the family of system were the most frequent, occurring 2,916 times in 146 publications. The second most common word stem, knowledge, occurred in nearly 800 instances less, and various word forms of inform and process occurred with a similar frequency, more than 1,900 times.

6.4.9 Knowledge mapping

The knowledge contained in the publications of the Polish Association for Knowledge Management is dispersed in a lot of places and formats: texts, databases, ideas, schemas and diagrams. Both explicit (documents and database) and tacit knowledge (implied, undiscovered ideas present in data and thoughts) are acquired through text mining and data mining processes. Knowledge maps do not only represent knowledge, but also streams (transfers) of knowledge and provide possibilities to develop knowledge warehouses (on the semantic level).

The purposes of the development of knowledge maps are the following:

- To improve the perception of knowledge, and – more importantly – of its sources, thus enhancing and accelerating processes of finding required skills and experiences;

- To improve the possibility of the evaluation and construction of an intellectual capital of the publishing house;

- To indicate further steps to develop knowledge and skills within a given discipline. This chapter presents knowledge maps developed on the basis of all articles in the volumes of the Studies and Proceedings, crucial in terms of word occurrences. For this purpose, a special process of word vectoring was used, based on associations from the contexts in which the words were found. The significance level for the words was determined at 95%, which means that only

(28)

the most significant links and relationships were presented. All results were modified and reduced to a minimum due to the difficulty of representing them graphically in legible form.

Figure 6.18 A general knowledge map for all of the articles Source: Own research.

Figure 6.18 shows a general knowledge map for all of the publications. Whereas the map represents connections between words, the distances represent the strength of the relationship and their mutual roles. Reading the map is a way to discover new knowledge. In this case, one can start with the links coming from the word analysis: the word research is the nearest neighbour, along with number, which may mean, e.g. the number of studies in an analysis. The subsequent connections join the words results, and a group (of results), which make up data, the possibility of using by a user of a relevant method in order to process, e.g. necessary information.

Another knowledge map was designed using the keyword information (acquisition of information) (Fig. 6.19).

Figure 6.19 A knowledge map based on the keyword ‘information’ Source: Own research.

(29)

In this case, a significant share of the word user in the knowledge acquisition process can be seen as well. The algorithm also shows the likelihood of using software at a specific time to obtain a specific structure, and – through the structure – the required information.

It is possible to limit the number of connections shown in a map to just a few (Fig. 6.20), which largely facilitates analysis and yields a very clear knowledge map.

Figure 6.20 A small knowledge map, focused on the keyword ‘results’ Source: Own research.

The map shows relationships which result from the use of a specific keyword (in this case the keyword used is results).

There was presented here a wide scope of interpretation of the results of calculations obtained through the RapidMiner software application to make a comparative analysis of the contents of scientific papers published in 9 volumes of the Studies & Proceedings PAfKM edited in English. Obtained experiences can help in research carried out with text mining tools applications to analyze the activity of scientific publishers or the synthesis made on the basis of knowledge through the analysis of technical, structured databases which allows one to generate a synthetic characterization of their contents.

6.5 Internet sources

[1] Statute of the Polish Association for Knowledge Management (in Polish) (http://www.pszw.edu.pl/index.php/stowarzyszenie/cele-i-statut) [accessed as of 25.08.2011]

[2] PSZW guidelines for article writers (in Polish)