Disponible uniquement sur Etudier
  • Pages : 30 (7355 mots )
  • Téléchargement(s) : 0
  • Publié le : 7 octobre 2010
Lire le document complet
Aperçu du document
Web Reverse Engineering
Fabrice Estiévenart1, Aurore François1, Jean Henrard1,2, Jean-Luc Hainaut2 CETIC, rue Clément Ader, 8 - B6041 Gosselies - Belgium (1) Institut d’Informatique, University of Namur, rue Grandgagnage, 21 - B5000 Namur - Belgium (2) {fe, af, jh}@cetic.be, jlh@info.fundp.ac.be Abstract
Modern technologies allow web sites to be dynamically managed by building pages on-the-flythrough scripts that get data from a database. Dissociation of data from layout directives provides easy data update and homogeneous presentation. However, many web sites still are made of static HTML pages in which data and layout information are interleaved. This leads to out-of-date information, inconsistent style and tricky and expensive maintenance. This paper presents a tool supportedmethodology to reengineer web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data ofthe pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualised into a unique schema describing the domain covered by the whole website. Finally, the data are converted according to this new schema so that they can be used to produce the renovated web site. These principles will be illustrated through a case study using the tools that create the META document, extract the data and the XML Schema. keywords: reengineering, web site, XML, data extraction. Nowadays large web sites are dynamically managed. The pages are builton-the-fly through (programs) scripts that get data from a database. The dissociation of the data from the layout can overcome or simplify web site maintenance problems [2] such as out-of-date data, inconsistent information or inconsistent style. Web sites publish a large amount of data that change frequently and the same data are present on many different pages (redundancy). If the data are stored in awell structured database, keeping them up-to-date is easier than if these data were disseminated in a many of pages. Pages that are generated through a script or a style sheet have all the same style which increases the user’s comfort and gives a more professional look to the site. If the style of the web site must be changed, only the script or the style sheet need to be changed and all thepages respect this new style. Another advantage of the separation of the data and their layout is that the same data can be presented according different layout depending of the intended audience. For example, the same data (stored in a database) can be used to produce the intranet, the extranet and some paper brochures. Many sites are made up of static pages because such pages can be easily createdusing a variety of tools that can be used by people with little or no formal knowledge of software engineering. When the size of these sites increases and when these sites require frequent update, the webmasters cannot manage these HTML pages any more. Hence the need to migrate static websites to dynamic ones that are easier to maintain. This paper presents a methodology to extract the data fromstatic HTML pages to migrate them into a database. If we want to separate the data of static pages from their layout, these pages need to be analysed in order to retrieve the encapsulated data and their their semantic structure have to be elicited. For example, we need to know that a page describing a customer contains its name, one or two phone number(s), an address (itself comprising street,...
tracking img