MINING SEMI STRUCTURED DATA

M.Balakrishnan Lecturer –Computer science SRM Engineering college SRM Nagar Kattankulathur – 603 203 Kancheepuram Dist., Tamilnadu, India. mailtombala@yahoo.com

M.V.S.S.Nagaendranath Student – M.Tech(C.S.E) SRM Engineering college SRM Nagar Kattankulathur – 603 203 Kancheepuram Dist., Tamilnadu, India. Shivamaganti_@yahoo.co.in

Abstract:

In recent years XML has become very popular for representing semi structured data and a standard for data exchange over the web. Mining XML data from the web is becoming increasingly important. Several encouraging attempts at developing methods for mining XML data have been proposed. Normally, pre-processing or post-processing are required for mining XML data, such as transforming the data from XML format to relational format. However, efficiency and simplicity is still a barrier for further development. In this paper, we show that extracting association rules from XML documents without any preprocessing or post processing using Xquery is possible and analyze the Xquery implementation of the well known Apriori algorithm. In addition, we suggest features that need to be added into Xquery in order to make the implementation of the Apriori algorithm more efficient.

Keywords : Xquery , XML ,Association rule mining ,Semi structured data

1.0 Introduction :

Building a database system accommodates semi structured data has required us to rethink nearly every aspect of database management.

The web is rich with information. However the data contained in the web is not well organized which makes obtaining useful information from the web [1] a difficult task.

Knowledge discovery in databases (KDD) deals with the process of extracting interesting knowledge from large amount of data, usually stored in large(relational) databases [10].

Knowledge can be represented in many different ways such as clusters, decision trees, decision rules, etc., Among the others, association rules [2] proved effective in discovering interesting relations in massive amounts of relational data. The recent years have seen the dramatic development of the extensible Markup Language(XML) [3].

Extensible Markup Language was developed as a standard to represent semi structured data. With developments like xyleme [4,5] which is a huge warehouse integrating XML data from the web. To data mining XML documents requires mapping the data to be relational data model and using techniques designed for relational databases to do the mining. The XMINE operator has been introduced by Braga et.al for extracting association rules from XML documents, where mapping the XML data to relational database is required before mining is preformed.

The Query language Xquery [3] was proposed by the W3C [6] in order to provide a flexible way to extract XML data and provide the necessary interaction document can be mined for association rules using only the query language Xquery without any preprocessing or post processing.

1.1 Semi structured data :

The data is self-describing in the sense that it does not come with a separate schema, and the structure of the data, when it exists, has to be inferred from the data itself, such data is called semi structured. Semi-structured data is also useful when integrating several databases, some of which may be structured.

In such an integration process the data may come from several different sources and thus it may be difficult to constrain the integrated database to a single unifying schema.

Semi-structured data is naturally modeled in terms of graphs which contain labels that give semantics to the underlying structure.

Table1, which is labeled EMPS, models an entity set of employees ,where each entity in EMPS is represented by an isolated node.Table2,which is labeled ED2,models the subset of employees in EMPS working in the Maths department. Table 3 models the information pertaining to EMP1.

DEPTS, shown in table 4 models an entity set of departments and nodes labeled DEPT1 and DEPT2 shown in Table 5 and 6 respectively, model the information pertaining to DEPT1 and DEPT2. Note that in DEPT2 the actual address of the department is missing and also that it has the additional attribute faculty which is missing from DEPT1.Finally,the node labeled works shown in Table7, models the relationship of an employee working in a department, where each employee and their department is represented by an arc in the node.

We observe that cyclic references are achieved by two nodes referencing each other, for example, a cyclic reference exists between EMP1 and DEPT1,since EMP1 references DEPT 1 and DEPT1 references EMP1

In general, the problem of information wise equivalence of node databases and thus semi-structure databases is un decidable.

2.0 Related work :

One data mining technique that has proved popular is association rule mining. It finds association between items in a data base. Application of association rules extend to finding useful patterns in consumer behavior , target marketing and electronic commerce.

XML documents can be mined for association rules by implementing the well known Apriori algorithm using only Xquery.

An association rule [7] is a rule which implies certain association relationships among a set of objects(such as “occur together” or “one implies the other”)in a database.

Given a set of transactions, where each transaction is a set of items ,an association rule is an expression of the form X à Y, where X and Y are set of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y.

Let I = { i₁,i₂,…….,i_m} be a set of literals, called items[8]. Let D be a set of transactions, where each transaction T is a set of items such that T Є I. Associated with each transaction is a unique identifier, called its TID, we say that a transaction T contains X ,a set of some items in I, if X Є T.

An association rule is an implication of the form XàY, where X C I, Y C I and X ∩ Y = Φ. The rule X à Y holds in the transaction set D with confidence C if C% of transactions in D that contain X also contain Y. The rule Xà Y has support S in the transaction set D if S% of transactions in D contain XUY.

Y, where X C I,Y C I and X∩Y=Φ. The rule X Y holds in the transaction set D with confidence C if C% of transactions in D that contain X also contain Y. the rule X Y has support Sin the transaction set D if S% of transactions in D contain XU Y.

Confidence α is the ratio of number of transactions that contain XUY to the number of transactions that contain X [9].The accuracy of confidence or strength for an association rule X à Y is given by

α (XàY)=P(XUY /X)

Support S for an algorithm rule XàY is the %of transactions in the database that contain X U Y.

Given a set of transaction D ,the problem of mining association rules is to generate all association rules that have support and confidence greater than the user-specified minimum (called min sup) and minimum confidence(called min conf )respectively.

Following table 8 gives the rule for support and confidence,

Table 8

High

Support S

Low

Confidence α

Low High

Event occurs very often; correct only few times

Event occurs very often; mostly correct

Event occurs less often; correct only few times

Event occurs less often. Mostly correct

If low support and low confidence then rules can be ignored and rules having high support and high confidence though desirable may not be available.

3.0 Association rules :

Consider the database in table 9,a database ‘D’ contains a set of transaction T and each transactions consist of one or more items.

Table 9.

An example Database

Tid	Items
1	{bread, butter, milk}
2	{bread, butter, milk, ice cream}
3	{ice cream, coke}
4	{battery , bread, butter, milk}
5	{bread, butter, milk}
6	{battery, ice cream, bread, butter}

When people buy bread and butter, may also buy milk in 66% of the cases and 80% of the transaction with bread and butter also contain milk. Such a rule can represented as

“bread, butter àmilk | support =0.66 , confidence=0.8”

The pattern of discovering all association rules can be decomposed into two subprograms [2].

1. Find all sets of items (item sets) that have transaction support above minimum support. The support for an item set is the number of transactions that contain the item set Item sets with minimum support are large item sets and all others small item sets.

2.Use the large item sets to generate the desired rules. For every large item set L, find all non-empty subsets of L For every such subset a, output a rule of the form a à (L-a) if the ratio of support(L)to support(a) is at least min conf.

The performance of mining association rules is mainly dependent on the large item sets discovery process. Therefore, it is important to have an efficient algorithm for large item sets discovery.

In order to find frequent item sets (Patterns),the apriori algorithm, which is the more efficient and basic algorithm for finding frequent item sets , is used.

Algorithms for discovery large item sets make multiple passes over the data. In the first pass, we count the support of individual items and determine which of them are large, i.e. have minimum support .In each subsequent pass, we start with a seed set of item sets found to be large in the previous pass. we use this seed set for generating new potentially large item sets called candidate item sets and count the actual support for these candidate item sets during the pass over the data. At the end of the pass, we determine which of the candidate item sets are actually large, and they become the seed for the next pass. This process continues until no new large item sets are found.

Apriori is an influential algorithm for mining frequent item sets for association rules. The name of the algorithm is based on the fact that the algorithms uses prior knowledge of frequent item set properties. Apriori employs an iterative approach known as a level wise search, where K item sets are used to explore (K+1) item sets. First the set of frequent 1 item sets is found. This set is denoted as L1.L1 is used to find L2,the set of frequent 2 item sets, which is used to find L3 and so on, until no more frequent K item sets can be found. The finding of each L_kone full scans of the database.

Algorithm : Apriori [8]

L1={large 1-itemsets}

For(K=2; L _K-1
≠_Φ_;K++) do begin

C_k= apriori-gen (L_K-1); //New candidate

For all transaction t ε D do begin

C_t=subset (Ck, t); //candidates contained in t

For all candidate C ε C_tdo

C. Count++;

End.

L_K={C ε C_K| C. Count ≥ min sup}

End.

Table 10 summarizes the notation used in the algorithm

Table 10

Notation	Definition
K-item set	An item set having K items
C_k	Set of candidate K-item sets
L_k	Set Of large K – item sets

The Apriori-gen function takes as arrangement L_K-1,the set of all large (K-L) items sets. It returns a superset of the set of all large K item sets. The function works as follows.

First, in the join step, we join L _K-1 with L _K-1

Insert into C_K

Select P.item1,P.item2,…, P.item _K-1,q.item_K-1

From L _K-1P,L _K-1q

Where P.item ₁=q.item_1,….., P.item _K-2=q.item_K-2, P.item _K-1<q.item_K-1

Next, in the prune step , we delete all item sets C Є C_KSuch that some (K-1) subset of C is not in L _K-1;

For all item sets C Є C_K do

For all (K-1) subsets s of C do

If (S € L _K-1 ) then

Delete C from C_K.

3.1 Finding frequent item sets : mining

From the input table presented above ,with the constraints of min_support = 0.49 and confidence threshold=80%. for an illustration purpose , description field only taken for first level mining to discover description based knowledge as below.

Initial candidate set ,

Candidate set

{coke butter, ice cream}

{bread, butter, milk}

{coke, bread, butter, milk}

{bread, milk}

Frequent-1item set

Candidate(C1)

Support

{coke}

{bread}

{butter}

{ice cream}

{milk}

0.5

0.75

0.25

0.75

Frequent –2 item set

Candidate(C2)

Support

{coke, bread}

{coke, butter}

{coke, milk}

{bread, butter}

{bread, milk}

{butter, milk}

0.25

0.5

0.25

0.5

0.75

0.5

Frequent –3item set

Candidate(C3)

Support

{coke, butter, bread}

{coke, bread, butter, milk}

{coke, bread, milk}

{bread, butter, milk}

0.25

0.5

Frequent –4 item set

Candidate(C4)

Support

{bread, butter, milk}

0.5

The data base was scanned 4 times to find the frequent item sets of at all search levels with different combinations of candidate item as mentioned in the algorithm. The candidate for which the support is not satisfactory i.e. support < 49% are considered to be infrequent and they have been removed from the next iteration candidate list.

3.2 Generation of association rules from frequent Item sets :

Once the frequent item sets from records in a database have been found, it is straight forward to generate strong association rules from them .

Method :

1.List all frequent item sets

2.Form all non-empty subset of frequent item set

3. Generate rules by relating the subset with one another as like Aà B, and measure confidence

4.The rule that support the minimum confidence threshold are only the really interesting and they will be confirmed as strong

Confidence ( A à B) = P(B/A) = Support-count(AU B)

----------------------------

Support –count (A)

In this illustration, the frequent item set found is,

L ={ bread, butter, milk }

Non-empty subsets are, {bread , butter} , {bread, milk }, {butter, milk},{bread},{butter},{milk}

The Association rules are,

{bread ,butter} à{milk}confidence = 2/3 = 66%

{bread, milk}à{butter} confidence = 2/3 = 66%

{butter,milk}à{bread}confidence = 2/2 = 100%

{bread} à {butter, milk}confidence= 2/3 = 66%

{butter} à {bread, milk }confidence =2/3=66%

{milk} à {bread butter} confidence = 2/3=66%

Though the minimum confidence threshold is 80%,the rule {butter, milk}à {bread} only is the more interesting and strong to generate knowledge.

4.0 Xquery expression for mining association rules from XML data :

We refer to the sample XML document, depicted in fig 1. where information about the items purchased in each transaction are represented. For example, the set of transactions are identified by the tag <transactions> and each transaction in the transactions set is identified by the tag <transaction>.The set of items in each transaction are identified by the tag <items> and an item is identified by the tag <item>. Fig 1. An example data base we can move on to discuss how to compute the association rules document as shown in fig1. Inside the association rules document, an association rule is identified by the tag <rule>,with two attributes support and confidence which specify the support and confidence values.

Also, the antecedent and the consequent of the rule are identified by the tag <antecedent> and <consequent> respectively, each contains one or more items. We present the following Xquery expression that computes the rule document.

Let $min conf:=1.00

Let $src:=document(“/large.XML”) //large item set

For $itemset1 in $src

Let $item1:=$itemset1/items

/* For $itemset2 in $src

Let $item2:=$itemset2/items */

Where count($items1)>count($items2) and

Count(common Its($items1,$items2)=

Count($items2)and $itemset1/support div $itemset2/support ≥$min conf

return < rule support = “{$itemset1/support}”

confidence=”{$itemset1/support * 1.0) div

($itemset2/support * 1.0)” >

<antecendent> {$items2}</antecendent>

{removeItems($items1,$items2)}

</consequent>

</rule>

The above expression can be explained as follows. For each large item set X in the large item sets document, we look for other item sets Y in the large item sets document such that Y C X and support (X U Y )/support(Y) ≥min conf .

However, if the document to be mined has an irregular structure like that shown in fig2.

Transactions

Transaction Transaction Transaction

Item A item B items number of items

Price item C item D category A

Fig2.

Then an extra query can be added inside the let clause where we specify the data source to take the structure into account, to restructure the input to a more regular structure at runtime. This is possible since an Xquery query returns an XML document of any structure. Therefore, any XML documents can be mined using Xquery.

The Xquery implementation requires that for each item set in the candidate set it will read the database once to obtain the support value. Therefore, the number of times needed, to scan the data base to obtain the support count for finding large item sets is O(2^l),where l is the length of the maximal large item set. This is very inefficient compared with the implementation in other languages.

Xquery cannot store the support value for the item sets but instead return it in a query. Therefore, it is difficult to read a record in the database and update the support count for the item sets using Xquery. This means that the whole database needs to be read for each candidate itemset. The results of our experiments shown that xquery is more suitable for mining data from small datasets.

5.0 Conclusion

In this paper, we presented the Xquery implementation of apriori method and demonstrating the process of mining association rules from native XML data.

The significance of this work is that with implementation, the corpuses of XML data that exist can now be mined for association rules directly without mapping the XML data to another form at. Many issues remain open, one of the issues concerns the structure of the XML data. Since , the structure of the XML data can be very complex and irregular, identifying the mining context on such XML data becomes difficult .Therefore, to simplify the task of identifying the context, a set of transformations of the XML data might be required.

Our current and future research in this area focuses on investigating what other standard data mining algorithm can be expressed in Xquery. We are currently testing large XML data sets to study the performance of our Xquery implementation of the algorithm.

This research provides a good starting point, there are still many open issues for future research- even within the work presented in this paper.

6.0 References:

1. Jacky W.W.Wan,Gillian Dobble,”Mining association rules from XML data using Xquery”

2.R.Agrawal,T.Imielinski and A.Swami, ”Mining association rules between sets of items in large data bases”,may’1993

3.Worldwide web consortium, eXtensible markup language version 1.0.

http://www.w3org/XML

4.Xyleme. http://www.Xyleme.com

5.Luice Xyleme, A dynamic warehouse for XML data of the web ,IEEE 2001

6.World wide webconsortium.http://www.w3.org

7.Uma maheswari, association analysis on generalized object relational data model of video annotation of cricket match,Dec’2004.

8.Rakesh Agarwal, rama Krishnan srikanth “Fast algorithms for mining association rules in large data base.Sep’1994

9.A.S.Al.Amean,Dr.R.Srinivasan,An application of association rule mining ,DEC’2004

10. D.Brage, A.Campi “A Tool for extracting XML Association rules from XML documents,2002

tid	Items
1	{a, d, e}
2	{b, c, d}
3	{a, c}
4	{b, c, d}
5	{a, b}