Thursday 22 December 2011

Hadoop: MapReduce Introduction:
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.


A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.


Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java.


More Research Topics:


MapReduce: A software framework introduced by Google to support distributed computing on large datasets.

Answer Set Programming (ASP): Declarative programming oriented towards difficult (primarily NP-hard) search problems. ASP includes all applications of answer sets to knowledge representation. Answer Set Solvers: smodels, assat, clasp and dlv.

Using these two concepts I am going to start reading and understanding about MapReduce and Answer Set Programming and one of its solvers dlv. The challenge is to produce an implementation using ASP and MapReduce together.

Monday 19 December 2011

Research Topics:
After talking to Dr. Kewen Wang he suggested me some research topics. He talked about a new model in the World Wide Web in which Artificial Intelligence and Knowledge Representation is applied. He said, nowadays the Internet is a web of linked documents, but not data. Efforts like DBpedia are trying to make a new Internet. The solution to the myriad of data formats could be Linking Open Data (LOD). He also mentioned about Wikipedia evolution, from WorldNet to Wikipedia and the question is now LOD? Based on what have been done in WorldNet and Wikipedia what could be an innovative approach.


Research Topics:
1. Resolve conflicts
2. Ranking candidate solutions
3. Linked open data

Thursday 15 December 2011

Firefox extensions and On-line tools for the Semantic Web


Firefox Extensions
Semantic Radar: Displays a status bar icon to indicate presence of Semantic Web (RDF) data in the web page.
More extensions

On-Line Tools
Semantic Query End-Point

Thursday 8 December 2011

Installing a Semantic Web Environment

On Ubuntu 10.04
I'll be doing some tests on Jena Framework for the Semantic Web. First things first we need to install Jena. As long as Java is correctly configured to install Jena we just need:
1. Download Jena
2. Unzip it in any folder
3. Run the test as said in this instruction tutorial

We will also need a Java IDE, as recommended by this book: Semantic web programming. I have installed Eclipse in my Ubuntu 10.04. It is very easy:

$sudo apt-get install eclipse

On Windows 7


1. Setting up the environment variable:
http://introcs.cs.princeton.edu/java/15inout/windows-cmd.html

2. Java editor: Eclipse

3. Ontology editor: Protégé

4. Semantic web programming framework: Jena

5. Pellet reasoner

That's it!

This is a basic environment installation. I'll be doing more interesting things by following the above book. The book has a web site in which they have all the source code they use.

Semantic Web Real World Examples:


Example 1:
Try googling for all cars advertised on the web with engines smaller than 2.0 litres that run unleaded, and have an mp3 connection and can been seen in a showroom conveniently accessible by public transport from your house. Google is unable to help you. You have to make several searches and correlate the results yourself. On the Semantic web, you can express an interest in products for sale that are cars, and add the constraints. Every result would be useful.

Example 2:
You want to correlate data that is not clearly related. Like for example, country walks in a population versus the levels of clinical obesity in the same population. This kind of information can be watched in Gapminder.

An example is a site that gives the weather for any city in the world, in HTML form. Even though the site offers dynamic, database-driven information, it is presented in a purely syntactic way. One could imagine a computer program that tried to retrieve this weather information through text parsing or "web scraping". Though it would be possible to do, if the creators of the site ever decide to change around the layout or HTML of the site, the computer program would most likely need to be rewritten in some way. In contrast, if the weather site published its data semantically, the program could retrieve that semantic data, and the site's creators could change the look and feel of the site without affecting that retrieval ability.

Technologies for the Semantic Web:
SPARQL Query Language
RDF Language to organize information and represent resources

Monday 5 December 2011

Open Rules. Business Rules Management Methodology and Supporting Tools
  • Offers a methodology and open source tools for business analysts to create a Business Rules Repository
  • Repositories are used across enterprises as a foundation for rules-based applications with complex business, processing, and presentation logic
  • It uses familiar graphical interfaces provided by MS Excel, OpenOffice and Google Docs
  • OpenRules supports collaborative rules management.
  • OpenRules® also helps subject matter experts to work in concert with software developers to integrate their decision models into existing infrastructures for Java and .NET.
  • OpenRules makes rules-based systems less expensive, easier to develop and manage, and more sustainable.

Reference:

Visual Rules Execution Platform
Visual Rules Execution Platform provides a centralized rule deployment and execution environment that allows rules to be easily integrated into many applications running on any platform. Hot deployment capabilities ensure that new rule versions are made available with zero downtime.

Providing Rules as Web Services
Any rule models deployed to Visual Rules Execution Platform automatically become available as standard web services. These services can be consumed by a wide variety of clients, not limited to Java architectures.


Reference:

W3C Workshop on Rule Languages for Interoperability

Overview:
  • Rule languages and rule systems are widely used in :
    • Database integration
    • Service provisioning
    • Business process management 
  • General purpose rule languages remain relatively non-standardized
  • Rule systems from different suppliers are rarely interoperable
  • The Web has achieved remarkable success in allowing documents to be shared and linked 
  • Semantic Web languages like RDF and OWL are beginning to support data/knowledge sharing
  • Having a language for sharing rules is often seen as the next step in promoting data exchange on the Web
Summary:
  • In April 2005, the W3C held a workshop to explore options for establishing a standard web-based language for expressing rules.
  • Half-dozen candidate technologies were presented and discussed.
  • The workshop confirmed the differences among types of rules. "if condition then action" rules and "if condition then condition" rules.
Introductory Sessions:
  • During the first session everyone met with each other. There were three backgrounds: business rules, logic programming and semantic web.
  • The second session had two presentations proposing scopes for a standard, and one on the W3C approach to standardization. 
  • Both scope/requirements presentations suggested that no single rule language would cover all the requirements but that there could be a common core to a family of languages.
Candidate Technologies:
  • WSML, RuleML, SWSL, N3, SWRL, Common Logic, TRIPLE. Primarily academic efforts.
  • The discussion revolved largely around formal issues and semantic features.
  • What constitutes a candidate technology for standardization, what kind of specification is needed for a rule language?
  • The RuleML presentation claimed a slightly different ground, focusing more on the exchange format and interoperability.
  • The other main line of discussion was thus about what is feasible in a short time and what should be the scope of the standard.
  • Some of the participants argued for a simple set of features to start with instead of a very rich and complex language.
  • The candidate technologies have not been tested on commercial rule bases.
Related Standards:
  • The Production Rule Representation (PRR)
  • A standard Java API for rule engines
  • The Semantic of Business Vocabulary and Business Rule meta-model (SBVR, aka the "semantic beaver").
  • PRR is limited to forward-chaining and sequential rule processing
  • The lightweight JSR-94 API does not specify the behavior of the engine
  • SBVR is for business modeling by business users, in their own terms.  It provides structured English for business rules from which the meaning can be extracted as formal logic.
Issues:

Negation as a failure:
  • Many features of the Web (including search engines) report failure for inscrutable and unpredictable reasons.
  • In a database if a record is not found, then we can assume it is not true. On the Web if a book isn't found by a search engine, it might just mean it failed to crawl the appropriate part of the Web.
Relationship to Description Logics (OWL)
Users want a language where they can represent both rules and ontologies. This topic came up in nearly every session.

Syntax Options
People want rules in many different styles of syntax, driven by who (or what) they expect will be reading and writing rules.
  • XML is convenient for machine interchange, and appears to be widely deployed and understood by rule users and implementers. 
  • English-like syntaxes are often good for people who are not experts in the language, especially if they need to read and understand a rule set.
  • Programmer-oriented syntaxes, on the other hand, are designed for people who know the language well.
  • An Abstract syntax is defined to not be directly usable; rather, it is mapped to one or more concrete syntaxes, each of which will be in one of the above styles. It is possible to have an abstract syntax, several normative concrete syntaxes, and several non-normative concrete syntaxes.
  • An RDF syntax (where the syntactic structures are described in RDF) has some of the appeal of an abstract syntax while being directly usable by machines. However, there is significant doubt whether a rule language can be defined with an RDF syntax and still have consistent semantics.
Conclusions:
The most obvious conclusion from the workshop is that there was significant interest in establishing a standard language for expressing rules.

  • Customers are demanding standards to protect their rule assets. They want portability across vendors, platforms, and applications. The want to be able to repurpose, reuse, and redistribute rule sets.
  • The standard should be simple. This field tends towards a complex work, and we will seriously endanger deployment if we go that route. It should be simple to use and relatively simple to implement.
  • Compatibility with deployed and emerging technologies. In particular, compatibility with RDF, OWL, OMG PRR, and ISO Common Logic, along with common programming and rule methodologies will allow people to understand and adopt the work much more quickly.
  • A Working Group in this field should be given a narrow and well-defined scope. People should be able to see, early on, if the work is relevant to their uses for rules, instead of having the Working Group trying to prioritize from an overwhelming sea of features.
Reference:

Cruzar: An application of semantic matchmaking for eTourism in the city of Zaragoza

General Description
The web is a big showcase for cities that want to build their tourism industry. Nowadays, many tourists plan their trips in advance using the information that is available on web pages. Information on the websites often leads to information-bloated and multimedia-rich web sites which are similar to digital versions of printed brochures. Everyone receives the same information, regardless of their interests. This is unlike when they visit a tourism office, and receive customized information and recommendations based on their profile and desires.

CRUZAR is a web application that uses expert knowledge (in the form of rules and ontologies) and a comprehensive repository of relevant data (instances) to build a custom route for each visitor profile. CRUZAR can potentially generate an infinite number of custom routes and it offers a much closer fit for each visitor's profile.

There are a number of reasons that make this city an excellent test bed for such a project. In the first place, Zaragoza has a high density of Points of Interest (POIs). Zaragoza is one of the biggest cities in Spain, and it enjoys a very dynamic cultural agenda, as well as frequent top-level sport events. Finally, the city council has extensive databases with all the aforesaid information, including content in five languages.

Technical details of the solution
The first challenge was to collect the required data from existing relational databases which are used to feed the content of the Official Website of Zaragoza. This data was split across the following four information silos:



  • The CMS database which feeds the city council web site with pertinent information for tourists visiting Zaragoza. Monuments, historical buildings of the city, restaurants, accommodation, green spaces, shopping areas and other relevant points of interest.
  • A database which contains up-to-date information about upcoming cultural events and leisure activities.
  • The city council web site which mainly displays photographs of the area.


  • The IDEZar, which is a Geographic Information System hosted by the University of Zaragoza. It is designed to use REST web services to fetch maps as raster images and to compute the shortest path between two geo-referenced points of the city.

The information contained in these databases is transformed into RDF data using specific adapters. This process takes place regularly every time the databases are updated.

Representing knowledge of the domain
An ontology is used to organize the RDF data. The CRUZAR ontology captures information about three types of domain entities: 1) Zaragoza’s tourism resources, mainly events and POIs, 2) user profiles to capture the visitors' preferences and their context, and 3) the route configuration. The conceptual structure of CRUZAR is based on the upper-ontology DOLCE.

Events and POIs are defined in terms of their intrinsic features: position, artistic style or date. Conversely, visitors’ profiles contain information on their preferences and their trip: arrival date, composition of the group, preferred activities, etc. In order to match the local information with the preferences, a shared vocabulary is needed. The central concept of this intermediate vocabulary is “interest”. Visitors’ preferences are translated to a set of “interests”, and POIs and events can attract people with certain “interests”. This translation is captured as production rules, which are executed using the Jena rule engine. These rules are simple enough to be easily understood by the domain experts.

POIs ranking
All the POIs in Zaragoza are dynamically ranked to reflect their “subjective interest” according to the profile of each visitor. At the end of the matchmaking process, a numerical score is assigned to all POIs to quantify their anticipated level of interest. Initially every POI has a static score or relevance which was decided by the experts of the domain (“objective interest”). The semantic matchmaking process is executed individually for each POI, and its output is a calculated score for the resource (“subjective interest”). The value of this score depends on how many of the visitor’s interests (derived from their profile) are fulfilled by each POI.

Route Planning
After all the candidate POIs have been sorted by their subjective interest, a planner algorithm is run in order to create the route. The main driving force of the algorithm is to balance the quantity and quality (interestingness) of the selected POIs and the distance. 


Route customization
The route proposed by the system is offered to the user using an accessible, information-rich interface that includes: the sequence of selected POIs, a tentative timetable for each day, a map highlighting the POIs, suggestions of other interesting places near the route, and two sets of recommended restaurants near the last POI of the route. Complementary activities, such as events (music concerts, sport events, etc.) and shopping, are also suggested. Users can interact with the generated route in a number of ways. 

Key Benefits of Using Semantic Web Technology
Semantic web technologies are put into practice:
  • to integrate and to organize data from different sources
  • to represent and to transform user profiles and tourism resources
  • to capture all the information about the generated routes and their constraints.

CRUZAR implements a matchmaking algorithm between objects that are described in RDF, and it pipes the results to a planner algorithm. Moreover, at the same time, it offers an innovative service for visitors to plan their trip in advance, exploiting expert knowledge. These features are often used as important examples to illustrate the promises of the Semantic Web.

References:
http://www.w3.org/2001/sw/sweo/public/UseCases/Zaragoza-2/

Thursday 1 December 2011

IBM FileNet P8 Platform
FileNet is a company that developed software to help enterprises manage their content and business processes. The FileNet P8 platform is a framework for developing custom enterprise systems. FileNet combines enterprise content management reference architecture with comprehensive business process management and compliance capabilities. The FileNet P8 platform is a key element in creating an agile, adaptable Enterprise Content Management (ECM) environment necessary to support a dynamic organization that must respond quickly to change.

You can use the workflow software to create, modify, manage, analyze, and simulate workflows (also referred to as business processes) that are performed by applications, enterprise users, and external users such as partners and customers. The functionality to define your workflows extends from the integrated expression builder, which provides a means of associating complex business rules with various routes between workflow steps

Examples of automated processes:
Use FileNet Process software to automate the flow of work to complete a structured business process. Examples of automated processes include:

- Circulating documents for a systematic review and approval process
- Processing new employee paperwork
- Submitting travel expense reports for approvals and payment
- Handling customer queries

Starting with FileNet Process Applications:
Multi-step business processes centre on the systematic routing of documents and information, with each step completed by the appropriate participant or an automated program. An individual workflow automates the routing and processing of a particular type of document, or set of documents, for a specific business process.

In a process system, different users perform different activities:
Participant: Participate in a workflow and Launch a workflow
Workflow Administrator: Manage work in progress
Workflow author: Design a workflow
System Administrator: Set up and maintain a Process system
Developer: Develop custom applications

Integrating business rules
Workflow authors and business analysts can create and add business rules to individual steps of a workflow definition. You can use third party rules software to separate the business rules from the process, making it easier for a business analyst to independently manage the process and the rules behind the process, rather than modifying a workflow definition.

To implement rules functionality in a workflow, the workflow author and the business analyst work together to determine how rules will be used in the workflow, what decisions will be controlled by rules, what workflow data will be required, appropriate names for the rule sets, and the steps in a workflow where the rules will execute.

  Rules integration using web services
  A business rules management system leverages industry standard web services as a communication 
  mechanism for invoking business rules. FileNet P8 Business Process Manager provides the ability to 
  configure and invoke web services from within a workflow. Steps:
  - Author a rule
  - Deploy it to the business rules management system
  - Generate a Web Service Definition Language (WSDL) 
  - Import the WSDL into the Process Engine
  - The business rule is then available to use within a workflow
  - The final step is to configure calls to the rules engine to execute the business rules as part of a workflow

  Rules Integration using the Rules Connectivity Framework
  The Process Engine server uses TCP/IP to communicate with the Rules Listener. The Rules Listener is 
  implemented in Java as a multi-threaded process. It hosts the rules vendor JAR file that implements the 
  rules vendor functionality. The rules vendor must provide a JAR file that contains an implementation of the 
  IFNRule listener interface in order for to be invoked from the RCF. The IFNRule listener Java interface is 
  defined by IBM FileNet. The Rules Listener looks for the rules vendor JAR file and if it is present, loads it 
  and enable the rules functionality. The following figure provides a graphical high level view of the rules 
  integration

Enterprise Content Management: Is a formalized means of organizing and storing an organization's documents, and other content, that relate to the organization's processes.
An Overview of W3C Semantic Web Activity:
The Semantic Web is an extension of the current Web in which the meaning of information is clearly and explicitly linked from the information itself. The World Wide Web Consortium (W3C) Semantic Web Activity, researchers and industrial partners want to enable standards and technologies to allow data on the Web to be defined and linked in such a way that it can be used for more effective discovery, automation, integration and reuse across various applications. The Internet will reach its full potential when data can be processed and shared by automated tools as well as by people.

The Semantic Web fosters greater data reuse by making it available for purposes not planned or conceived by the data provider. E.g. you want to locate news articles published in the previous month about companies headquartered in cities with populations under 500,000. The information may be there in the Web, but currently only in a form that requires intensive human processing.

The Semantic Web will allow:
- It will allow information to surface in the form of data, so that a program doesn't have to strip the formatting, pictures and ads off a Web page and guess at how the remaining page markup denotes the relevant bits of information.
- It will allow people to generate files that explain – to a machine – the relationship between different sets of data. For example, one will be able to make a "semantic link" between a database with a "zipcode" column and a form with a "zip" field to tell the machines that they do actually mean the same thing. This will allow machines to follow links and facilitate the integration of data from many different sources.

Being "semantically linked" means that the Semantic Web will allow people to make relations with the data. Relationships such as hasLocation, worksFor, isAuthorOf, hasSubjectOf, dependsOn, etc., will allow web machines to find related information in a more natural way. At the moment these kind of relationships are there but only they can be understood by people.

The development of the Semantic Web is underway in at least two very important areas: (1) from the infrastructural and architectural position defined by W3C and (2) in a more directed application-specific fashion by those leveraging Semantic Web technologies in various demonstrations, applications and products.


Enabling standards:
Uniform Resource Identifiers (URIs) are fundamental for the current Web and are in turn a foundation for the Semantic Web. URIs provide the ability for uniquely identifying resources of all types – not just Web documents – as well as relationships among resources. Besides the development of the Extensible Markup Language (XML) and the Resource Description Framework (RDF) help to represent relationships and to obtain meaning.

The W3C Semantic Web Activity plays a leadership role in both the design of specifications and the open, collaborative development of technologies focused on representing relationships and meaning and the automation, integration and reuse of data.

The base level standards for supporting the Semantic Web are currently being refined by the RDF Core Working Group. 

The Web Ontology (http://www.w3.org/2001/sw/WebOnt/) Working Group standards efforts are designed to build upon the RDF core work a language, OWL ( http://www.w3.org/TR/owl-ref/), for defining structured, Web-based ontologies. Ontologies can be used by automated tools to power advanced services such as more accurate Web search, intelligent software agents and knowledge management. Web portals, corporate website management, intelligent agents and ubiquitous computing are just some of the identified scenarios that helped shape the requirements for this work.

Semantic Web Advanced Development (SWAD):
SWAD-Europe aims to highlight practical examples of where real value can be added to the Web through Semantic Web technologies. Their focus is on providing practical demonstrations of how (1) the Semantic Web can address problems in areas such as sitemaps, news channel syndication, thesauri, classification, topic maps, calendaring, scheduling, collaboration, annotations, quality ratings, shared bookmarks, Dublin Core for simple resource discovery, Web service description and discovery, trust and rights management and (2) effectively and efficiently integrate them.

The W3C is running some other projects such as: SWAD-Simile and SWAD-Oxygen

Conclusion:
- The Semantic Web is an extension of the current Web.
- It is based on the idea of having data on the Web defined and linked such that it can be used for more effective discovery, automation, integration and reuse across various applications.
- Provides an infrastructure that enables not just Web pages, but databases, services, programs, sensors, personal devices and even household appliances to both consume and produce data on the Web.
- Software agents can use this information to search, filter and prepare information in new and exciting ways to assist Web users.
- New languages make significantly more of the information on the Web machine-readable.

Notes:
Both authors are part of the W3C and can be contacted by email Eric Miller em@w3.org and Ralph Swick swick@w3.org

References:
An Overview of W3C Semantic Web Activity: http://onlinelibrary.wiley.com/doi/10.1002/bult.280/full