XML documents change detection algorithm based on linear programming method

(1)

MYKOLA ALIEKSIEIEV ET AL.

National Technical University of Ukraine „Kyiv Polytechnic Institute”

Summary

This paper deals with a new algorithm for detecting changes in XML documents which uses actual and the previous version of the monitored XML document to pro-duce difference. The technique can be effectively used to discover changes between text parts of the initial and new documents.

The approach presented in this paper differs from the previously cited ones. The main idea of proposed algorithm is in paying attention only to quantitative changes in the tracked documents, instead of searching of the exact changes sequence that produces the new document. The proposed technique represents the document as a tree and considers only meaningful for end users parts of XML documents. Keywords: XML, change detection, publish/subscribe system

1. Introduction

For effective change detection of certain part of web-page in old XML document it is neces-sary to find the corresponding part in the new document at first. Since in new document version the changes may be made in any part, the location of text of subscribers interest may change. Thus, firstly, such part in new document should be found, which matches good to the text in old version.

In previous works [1, 3, 4] this problem was solved using minimum cost edit distance. But this approach is not optimal regarding page monitoring, as algorithm usage result requires addi-tional processing to search changed parts. Also this approach provides “good matching” availability between nodes in considered document. Such “good matching” search is difficult resource-intensive problem. The other method of matching parts search in old and new documents is searching of certain similarity measure of HTML document parts. In work [5] the set of parame-ters for similarity measure determination was proposed. Based on approach, proposed in [5], the search of matching parameters between XML document parts were applied in this work. These parameters will further used to solve the problem of “good matching” search.

(2)

2. Preliminaries

According to the formulation of change detection problem between XML parts, the following conditions were formulated:

XML document is considered as ordered tree, in which nodes are ordered from left-to-right. The search of matching is made only for XML document leaves, which value contains most important for subscriber text information.

While comparing of XML tags, the attributes could be considered as matched ones only if their names and values are matching.

While searching of matching, the order of child nodes for any XML tree parent node, and also the position of the nodes in XML tree hierarchy based on XML tags indexing introduced in this work are taken into account.

In this approach the condition, that XML document is considered as ordered tree, in which the nodes left-to-right order is taken into account, was made. So this means that change of children nodes position order in parent node is important for subscriber and should be detected by devel-oped publish\subscribe system. To adopt this function the decision to implement XML-document indexing based on numerical value was made. This decision allows to take into consideration the nodes left-to-right order, and to keep the possibility of node position tracking in XML tree hierar-chy.

3. Applying of matching parameters to XML document nodes

XML document indexing means that for each tag in analyzed document the attribute ‘index’ will be applied. Index value will be chosen according to following rules:

1.index=1 for root tag.

2.index=1.i for child node of root tag, where i – the number of child node according to left-to-right order.

3.index=1.i.j accordingly, for j child node of i parent node.

Thus, based on proposed rules, XML document to XML tree correct transformation takes place while taking into account XML hierarchy and child node order in initial XML document.

Let XML document version is represented by tree T₁. Then tree T₁ is characterized by follow-ing parameters.

N – the amount of nodes in tree T₁, which corresponds to amount of tags in XML document;

{

r i m

}

R= _i| =1... – the set of parent nodes in tree T₁, where m – is the amount of parent nodes, ri– parent node of node

i

.

{

a i N

}

A₌ _i_| ₌₁ _{– the set of node attributes in tree}

1

T , a_i – the attribute of

i

node. )

(x_i

con – the content of

i

node, where x_i –

i

node of tree T₁.

Thus document tree is unordered tree, which elements are characterized by their positions and related set of attributes. The parts of text which are displayed on web page are the leaves of docu-ment tree.

Let T(e_n) is a subtree of tree T with root in node e_n for given node e_n of document tree T . Let introduce content matching parameter of nodes x₁ and x₂ like that:

(3)

| ) ( ) ( | | ) ( ) ( | ) , ( 2 1 2 1 2 1 x con x con x con x con x x P_con ∪ ∩ =

Parameter P_con(x₁,x₂) returns the percentage of words that appear in both nodes x₁ and x₂. Attribute matching parameter between nodes x1 and x2 can be obtained like that:

} ) ( ) ( { } ) ( ) ( { ) , ( 2 1 2 1 2 1

¦

∪ ∈ ∩ ∈ = r a r a a r a r a a x x P i i att

Parameters P_att(x₁,x₂) shows the measure of the relative weight of the attributes that have the same value in x₁ and x₂. In XML every attribute may have different value for different XML documents as far as syntax of language doesn’t define attribute value on default. For specified document attributes which are used have unique values. So weight functions proposed in [5] cannot be applied for matching attribute parameter calculation in XML tree. Thus, all attributes are treated as equivalent in proposed formula and only identical attributes of two nodes are taken into consideration.

Thus old and new versions of the same XML document will be considered during comparison, we make a decision that the names of attributes match in both documents. Accordingly, the identi-cal attributes are those attributes which have identiidenti-cal names and values.

Position matching parameter can be obtained by following expression:

)) ( ), ( max( )) ( ), ( ( ) , ( 2 1 2 1 2 1 x index x index x index x index suf x x P_dist = ,

In proposed expression function suf defines the length of common suffixes between attributes of nodes x1 and x2, which define the position of the node in XML tree hierarchy – between

) (x₁

index and index(x₂). Function maxdefines maximum length of attribute between index(x₁) and index(x₂).

It is necessary to get an expression for integral matching criteria for given content matching parameter, attribute matching parameter and position matching parameter of two nodes. These parameters should be weighted differently by using the weight factors in expression for integral matching criteria because some parameters could be considered more relevant than others.

Letα,β,γ be weight factors for P_con(x₁,x₂), P_att(x₁,x₂), P_dist(x₁,x₂) accordingly. Then 1

= + +β γ

α , and integral matching criteria can be obtained as follows: )) , ( ) , ( ) , ( ( 2 1 ) , (x₁ x₂ P x₁ x₂ P x₁ x₂ P x₁ x₂

CS =− + ⋅

α

⋅ _con +

β

⋅ _att +

γ

⋅ _dist

4. Matrix of integral matching criteria

From end user’s point of view, interest is paid to text parts of web pages. If we treat XML as tree then tags with text content are presented by leaves and carry the most important information for a user. Thus, good matching search problem between XML document versions becomes good matching search problem between text parts of these XML document versions.

Previous works [2, 5] consider good matching problem of simplified XML document models, which are not used in Internet publishing. Internet web-pages include extra parts such as

(4)

JavaS-cripts, HTML formatting and <meta> tags, etc. Also web-pages can be improper which may caused by the lack of closing tags. All these features of real web-pages reduce good matching search quality, speed and therewith are unimportant for end user.

Thus, web page must be purged from mentioned features before good matching searching of XML document parts.

Text parts of old and new versions are chosen during preparation of XML documents. These parts are compared with each other based on integral matching criteria later. We can create matrix of integral matching criteria between nodes of old and new versions of XML document after calculations of integral matching criteria.

Let’s consider simplified case of integral matching criteria matrix creation.

Let x₁,x₂,x₃,x₄ be text parts of old XML document version-(tree T₁), y₁, y₂ – text parts of new XML document version-(tree T₂). Accordingly, we can assume that two text parts of old document version were deleted and two other text parts were changed. It is necessary to find which parts were deleted and which were changed. Also we should find what changes were done. Also it is necessary to find matching between nodes x₁,x₂,x₃,x₄ and nodes y₁, y₂.

First of all, we need to find integral matching criteria for each pair of nodes. For nodes x₁ and

1 y : )) , ( ) , ( ) , ( ( 2 1 ) , (x₁ y₁ P x₁ y₁ P x₁ y₁ P x₁ y₁ CS =− + ⋅

α

⋅ _con +

β

⋅ _att +

γ

⋅ _dist

Similarly we can obtain find integral matching criteria for all pairs of nodes. In specified case matrix of integral matching criteria will be the following (table 1).

Table.1. Matrix of integral matching criteria

1 x x₂ x₃ x₄ 1 y CS(x₁,y₁) CS(x₂,y₁) CS(x₃,y₁) CS(x₄,y₁) 2 y CS(x₁,y₂) CS(x₂,y₂) CS(x₃,y₂) CS(x₄,y₂) 5. Mathematic simulation of good matching search of XML documents versions

While comparison of old and new document versions it was agreed that one node in old ver-sion can match not more than one node in new document verver-sion and vice versa. Thus good matching search problem is turned into optimal path search problem that is a transport problem, which is the linear programming task. While searching optimal matching necessary is to find solution, when the sum of integral matching criteria is maximal.

To formalize this task the connectedness matrix between nodes of T₁ tree and T₂ tree was suggested. T₁ tree and T₂ tree corresponds to old and new document versions accordingly.

(5)

Table 2 The connectedness matrix of new and old XML document versions 1 x x₂ x₃ x₄ 1 y a₁₁ a₁₂ a₁₃ a₁₄ 2 y a₂₁ a₂₂ a₂₃ a₂₄

Let a be the connectivity of nodes _ij x_i and y_j.

If x_i corresponds toy , then _j a_ij=1. If x_i does not correspond to y_j, then a_ij =0. Let`s enter the conditions, which are imposed by good matching search problem.

Since one of xi|i=₁_..₄ nodes should correspond to y1 node, this can be enunciated as follows:

1 14 13 12 11+a +a +a = a

Similarly for y₂node: 1 24 23 22 21+a +a +a = a

And since to every xi|i=₁_..₄ corresponds not more than one yi|i=1,2, thus for x1, x2, x3, x4

are valid the following conditions: 1 21 11+ a ≤ a 1 22 12+ a ≤ a 1 23 13+ a ≤ a 1 24 14+ a ≤ a

Efficiency function of this mathematical task which takes into account the matrix of integral matching criteria will be the as following:

max ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( 24 2 4 23 2 3 22 2 2 21 2 1 14 1 4 13 1 3 12 1 2 11 1 1 → ⋅ + ⋅ + ⋅ + + ⋅ + ⋅ + ⋅ + ⋅ + ⋅ a y x CS a y x CS a y x CS a y x CS a y x CS a y x CS a y x CS a y x CS

Thus the linear programming task can be formalized as the following:

max ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( 24 2 4 23 2 3 22 2 2 21 2 1 14 1 4 13 1 3 12 1 2 11 1 1 → ⋅ + ⋅ + ⋅ + + ⋅ + ⋅ + ⋅ + ⋅ + ⋅ a y x CS a y x CS a y x CS a y x CS a y x CS a y x CS a y x CS a y x CS 1 14 13 12 11+a +a +a = a (1) 1 24 23 22 21+a +a +a = a (2) 1 21 11+ a ≤ a (3) 1 22 12+ a ≤ a (4) 1 23 13+ a ≤ a (5) 1 24 14+ a ≤ a (6) 0 , , , , , , , ₁₂ ₁₃ ₁₄ ₂₁ ₂₂ ₂₃ ₂₄ 11 a a a a a a a ≥ a (7)

(6)

It is necessary to find the integer solution, which meets all specified conditions. It will be good matching between leaves in T₁ and T₂ trees.

Let’s consider set of constraints (1–7). Let’s rid of inequalities in constraints, using nonnega-tive balanced variables s₁,s₂,s₃,s₄ in constraints 3, 4, 5, 6:

1 14 13 12 11+a +a +a = a (1) 1 24 23 22 21+a +a +a = a (2) 1 1 21 11+a +s = a (3) 1 2 22 12+a +s = a (4) 1 3 23 13+a +s = a (5) 1 4 24 14+a +s = a (6) 0 , , , , , , , , , , , ₁₂ ₁₃ ₁₄ ₂₁ ₂₂ ₂₃ ₂₄ ₁ ₂ ₃ ₄ 11 a a a a a a a s s s s ≥ a (7)

To solve the task using simplex-method it is necessary to find out basic variables. It is possi-ble to assign basic variapossi-bles s₁,s₂,s₃,s₄ from the last (3–6) set of constraints.

Not all equations contain basic variables. It means that initial task does not contain feasible basic solution. Thus the problem should be solved using artificial basic method. Firstly it is necessary to solve the auxiliary task in artificial basic method.

Let’s enter the artificial nonnegative variables r₁, r₂ into equations 1, 2: 1 1 14 13 12 11+a +a +a +r = a (1) 1 2 24 23 22 21+a +a +a +r = a (2) 1 1 21 11+a +s = a (3) 1 2 22 12+a +s = a (4) 1 3 23 13+a +s = a (5) 1 4 24 14+a +s = a (6) 0 , , , , , , , , , , , , , ₁₂ ₁₃ ₁₄ ₂₁ ₂₂ ₂₃ ₂₄ ₁ ₂ ₃ ₄ ₁ ₂ 11 a a a a a a a s s s s r r ≥ a (7)

Basic variables are: s₁,s₂,s₃,s₄,r₁,r₂.

The aim of auxiliary task solving is to obtain the feasible basic solution that does not contain artificial variables r₁, r₂. Thus let’s formulate auxiliary efficiency function

2 1 r

r G= +

and minimize it in desired set of constraints.

To solve auxiliary task using simplex-method express function G via free variables: min

2− ₁₁− ₁₂− ₁₃− ₁₄− ₂₁− ₂₂− ₂₃− ₂₄→

= a a a a a a a a

G

Next step is formulating of initial simplex-table and solving of the task with simplex-method. As far as for solving the task using simplex-method, it is necessary to have the parameters numeri-cal values, then further description of task solving is not represented.

If after function G minimization its optimal value will equal 0 and all artificial variables ap-pear out of basic set, then obtained basic solution is feasible basic initial solution.

(7)

6. Change detection algorithm between XML document versions

On the base of approach presented above it is necessary to describe the complete change de-tection algorithm between XML document versions.

It will consist of following steps:

1. Forming of data as new and old versions of XML document.

2. XML document indexing, which is necessary for consideration of result XML tree nodes in left-to-right order and for node position tracking in XML tree hierarchy.

3. Old and new version of XML documents transformation into XML trees. Matching criteria determination and changes search will be made for XML trees nodes.

4. Identification of tags which are meaningful for change detection. 5. Matching parameters determination for selected tags on previous steps.

6. Creation of integral matching criteria matrix for all selected tags in both versions of XML tree.

7. Linear programming task formation based on integral matching criteria. 8. Linear programming task solution with the simplex-method.

9. Execution of content comparison function for detected matching nodes on the step of linear programming task solution.

10. The results of XML documents comparison presentation. Proposed algorithm can be presented as flow chart shown on Fig. 1.

Step 4 shows identification of important for subscriber tags which will be used for compari-son. Usually these are XML tags with text content which is interesting for subscriber. They are usually presented as leave nodes in XML tree.

(8)

Figure 1. Algorithm flowchart

Parameters on step 5 are calculated for all pairs of chosen nodes in old and new version of XML tree. They are used further for creation of integral matching criteria matrix. The integer solution obtained on step 8 presents the best match between given documents in accepted condi-tions.

(9)

7. Summary

The necessity for systems that provide change detection in XML documents has become es-sential due to the fast rate of information changes in the Web and the widespread usage of XML format. In this paper the new algorithm which allows the efficient detection of XML document differences in a quantitative way was proposed. The algorithm, rather than being based on compu-ting of edit sequence that produces the updated version of the whole document, focuses on the detection of changes between text parts of XML documents which are meaningful for end user. This algorithm introduces new approach which includes determination of similarity between nodes and resolving of good matching search problem as linear programming task.

%LEOLRJUDSK\

[1] Abiteboul S., Chawathe S., Widom J., Representing and querying changes in semistructured data, Proceedings of the International Conference on Data Engineering, Orlando, Florida, February 1998: p. 4–13.

[2] Abiteboul S., Cobena M., Marian A., Detecting Changes in XML Documents Gregory, SIG-MOD, 25(2): 2002, p. 493–504.

[3] Alieksieiev M.O., Alekseyev O.M., Molchanov Y.M., Publish/Subscribe System for R&D Information Resources, “Visnyk SumDU”, #2, Sumy 2009: p. 22–30.

[4] hawathe S., Garcia-Molina H., Meaningful change detection in structured data, Proceedings of the ACM, SIGMOD International Conference on Management of Data, Tuscon, Arizona, May 1997: p. 26–37.

[5] Flesca S., Masciari E., Efficient and affective Web change Detection, Data & Knowledge Engineering 46, 2003: p. 203–224.

(10)

ALGORYTM WYKRYWAIA ZMIAN W DOKUMENTACH XML OPARTY NA METODZIE PROGRAMOWANIA LINEARNEGO

Streszczenie

Artykuł opisuje nowy algorytm wykrywania zmian w dokumentach XML przy uĪyciu aktualnych i poprzednich wersji monitorowanego dokumentu XML. Technika ta moĪe byü efektywnie wykorzystana do wykrycia zmian pomiĊdzy czĊĞciami tekstu.

PodejĞcie zaprezentowane w artykule róĪni siĊ od wczeĞniej stosowanych. Główną ideą algorytmu jest poĞwiĊcenie uwagi iloĞciowym zmianom w Ğledzonych dokumentach, zamiast przeprowadzania poszukiwania okreĞlonych sekwencji zmian. Zaproponowana technika reprezentuje dokument jako drzewo i rozwaĪa tylko istotne dla koĔcowego uĪytkownika czĊĞci dokumentu XML.

Słowa kluczowe: XML, wykrywanie zmian, system publikacji/subskrypcji

Iurii Molchanov Larisa Globa

Mykola Alieksieiev et al.

National Technical University of Ukraine „Kyiv Polytechnic Institute” 03056, Ukraine, Kiev, per. Industrialny, 2

e-mail: lgloba@its.kpi.ua alexeyev@its.kpi.ua molchanov_y@ukr.net