|
Anticipating a your site getting Slashdotted, or succuming to the Digg effect? Mirror your site for free with LynxCache.
Planning a Semantic Web site
mirror of http://www.ibm.com/developerworks/xml/library/x-plansemantic/index.html?ca=drs-
Planning a Semantic Web site
Skip to main content [1] [2]
Country/region
[ select [3] ]
All of dW ----------------- AIX and UNIX Information Mgmt Lotus Rational Tivoli
WebSphere ----------------- Architecture Autonomic computing Java technology Linux
Multicore acceleration Open source SOA & Web services Web development XML
----------------- dW forums ----------------- alphaWorks ----------------- All of IBM
Home [4]
Business solutions [5]
IT services [6]
Products [7]
Support it just follows its internal rules for displaying the page. It's up to you to
understand the information on the page.
Structuring data adds value to that data. With consistent structure, it can be used in
more ways. You can see the demand for structured data today in the proliferation of APIs
that have sprung up around Web sites as a part of the Web 2.0 trend--an _API_ is
structured data, and structured data from a variety of sources is what powers mashups. The
idea behind _mashups_ is that data is pulled from various sources on the Web and, when
combined and displayed in a unified manner, this combination of elements adds value over
and above the source information alone.
The individual APIs that everyone is busy building are to solve the exact same problem
that the Semantic Web is intended to address: Expose the content of the Web as data and
then combine disparate data sources in different ways to build new value. Rather than
build and maintain your own API, you can build your Web site to take full advantage of the
Semantic Web infrastructure which is already in place. If your Web site is your API, you
can reduce the overall development and maintenance. Similarly, rather than build custom
solutions for every Web site you want to pull data from, you can implement one solution
based on Semantic Web technologies and have it work interchangeably across many Web
sites--including Web sites you weren't even aware of before you began development.
BACK TO TOP [32]
Semantic Web technology overview
Semantic Web technologies can be considered in terms of layers, each layer resting on and
extending the functionality of the layers beneath it. Although the Semantic Web is often
talked about as if it were a separate entity, it is an extension and enhancement of the
existing Web rather than a replacement of it.
FIGURE 1. THE SEMANTIC WEB TECHNOLOGY STACK
As shown in Figure 1, the base layer of the Semantic Web is HTTP and URIs. These are
commonly considered 'Web' rather than 'Semantic Web', but every proposed Semantic Web
technology rests upon these Web fundamentals. URIs are the nouns of the semantic Web. HTTP
are the verbs: GET, PUT and POST as well as a number of thoroughly tested solutions in the
fields of authentication and encryption.
The Resource Description Framework (RDF) is the workhorse of the Semantic Web. It is a
grammar for encoding relationships. An RDF triple has three components: a subject, a
predicate (or verb), and object. Each can be expressed as a resource on the Web, that is a
URI. This is far less ambiguous than encoding data in random XML documents. Compare the
different ways of expressing a simple relationship in XML given in Listing 1 with the RDF
triple in Listing 2 [33].
LISTING 1. AMBIGUOUS RELATIONSHIPS IN XML
page Rob page
Listing 2 shows the RDF triple.
LISTING 2. EXPRESSING RELATIONSHIPS IN RDF
.
The relationship expressed in all the examples shown in Listing 1 [34] is 'Rob is the
author of page'--a fairly simple statement--yet expressed in several ways in XML. It would
be very difficult to build software that can derive that relationship from all the
possible ways to express it in XML. But an RDF expresses that relationship in only one
way, so it becomes feasible to build generic parsers.
In the early days of the Semantic Web, it was hoped that content producers would make all
their content available in RDF and soon make a plethora of data available. Unfortunately,
perhaps because the main XML expression of RDF looked unnecessarily complex, uptake was
slow. More succinct RDF representations, like Notation3 (N3) and Terse RDF Triple Language
(Turtle) are now available but have been unable to overcome the inertia. (For more on N3
and Turtle, see Resources [35].) A solution to the problem was inspired by the
Microformats approach. With Microformats, semantic value is added to existing HTML content
by using consistent patterns of standard HTML elements and attributes. Microformats exist
for narrow but common items of data such as contact information and calendar items. The
W3C equivalent is RDFa, RDF data embedded in XHTML. The implementation is slightly more
complex than Microformats but it is far more generic--anything which you can express in
RDF, you can add to XHTML documents using RDFa. Through this technique the Semantic Web
can be bootstrapped by existing Web content.
Of course, the RDF embedded in XHTML documents as RDFa is no good for all the Semantic
Web tools, which require RDF as input. There needs to be an automatic method to recognize
the presence of RDFa content and extract the RDF out of it. The W3C solution for this is
Gleaning Resource Descriptions from Dialects of Languages (GRDDL). The idea is that you
run an existing XHTML document through an XSL transform to generate RDF. You can then link
the GRDDL transform either through direct inclusion of references or indirectly through
profile and namespace documents.
While unambiguously expressed semantics with RDF are good, even if everyone did that, it
is of little use if you have no idea how the RDF from different sites is related. The RDF
triple in Listing 2 [36] expressed an author relationship in the predicate, and while the
meaning might seem obvious to you, computers still need some help. If you expressed an
author relationship in an RDF file on your site, could the computer assume they were the
same thing? What if you instead had a writer relationship in your RDF triple? What you
need is a way to express a common vocabulary, to be able to say that my author and your
author are the same thing, or that 'author' and 'writer' are analogous. On the Semantic
Web this problem is solved by _ontologies_, and the W3C standard for expressing ontologies
is the Web Ontology Language (OWL). OWL is a large subject in it's own right, and since
you're only interested in applications of it in this article, see Resources [37] for more
information about it.
Once you have some sources of data in RDF, and you have ontologies to let you determine
the relationships between them, you need a way to get useful information out of them. The
Simple Protocol and RDF Query Language (or SPARQL, pronounced 'sparkle') is an SQL-like
syntax for expressing queries against RDF data, and the queries themselves look and act
like RDF data. The fundamental paradigm for SPARQL is pattern matching and it is designed
to work across the Web on data combined from disparate sources and to be flexible. For
example, matches can be described as optional, which makes it much better than SQL at
querying ragged data. _Ragged data_ has an unpredictable and unreliable structure, which
is what you might expect to find if your data is combined from various sources on the Web
rather than from a single well-contained SQL database.
BACK TO TOP [38]
Things you need to know when planning a Semantic Web site
As you've already seen, if you build the next great Web 2.0 site, you can save time if
you plan from the start to embrace Semantic Web technologies and turn your Web site into
an API, rather than create a separate API for your Web site. A Semantic Web approach gives
you free API-like functionality. Usually an API is a way to get structured data, in XML or
JSON format, out of an otherwise unstructured Web site. This leads to a dual approach: You
have Web pages for human consumption and you have an API where computers can pull out
structured information for automatic processing. However, this creates extra work for you;
if you expect people to make use of your API, then you have to document it and support it
and keep it synchronized with new features on your Web site. With a Semantic Web approach,
your Web site is the structured data. You don't need a separate implementation. You and
your users can take advantage of other Semantic Web tools to do automatic processing.
This does raise some issues for planning. With an API you are free to define your own
data format for each item of information you want to deliver, and in the Semantic Web this
is analogous to defining your own ontology. Ontology design can be a difficult thing to
get right with little experience, so you should consider whether any of the large array of
existing ones will be suitable for the types of data you plan to use, which will be
discussed in the next section [39]. When you design an API, you also usually consider an
object model for conceptual organization so developers can understand when they get
collections of items or just items, and which collections their items belong in. On a
Semantic Web site this will be partly determined by your ontology choices, but also by
your URI scheme. Next, you'll look at approaches to making your URIs usable as part of
your API.
Finally, on an existing Web site, you and your users can still benefit from the Semantic
Web, if you update your content to take advantage of GRDDL, RDFa and Microformats.
BACK TO TOP [40]
Evaluate your data in the context of existing ontologies
A more complex part of the Semantic Web is to design an ontology that matches up to your
data. Arriving at the right ontology is usually a critical element of successful
implementation of Semantic Web projects. Fortunately, many ontologies already exist. Table
1 lists some of them.
TABLE 1. SOME ONTOLOGIES IN USE ON THE WEB TODAY
Dublin Core
This metadata element standard for cross-domain information resource description
provides a simple and standardised set of conventions for describing things online in ways
that make them easier to find.
SIOC
Semantically-Interlinked Online Communities Project is an ontology that expresses the
information contained both explicitly and implicitly in Internet discussion methods, such
as blogs or forums mailing lists.
FOAF
The Friend of a Friend ontology describes individuals, their activities and their
relations to other people and objects. FOAF allows the description of social networks in a
distributed fashion.
DOAP
Description Of A Project is an ontology to describe open-source projects
ResumeRDF
This ontology expresses a Resume or Curriculum Vitae (CV), including information such
as work and academic experience or skills.
In addition, many ontologies are domain specific in fields such as technology,
environmental science, chemistry and linguistics. These will apply to fewer Web sites than
those listed above, however. A lot of your data is likely to fit into at least one of the
areas covered by the ontologies in Table 1 [41], in which case you can incorporate them in
your planning.
BACK TO TOP [42]
Choose a Semantic URI scheme
If your Web site is your API, then your URIs are the methods that programmers will access
to get data. A sensible, succinct and consistent structure is therefore very important,
and you need to think about it in advance because frequent changes after everything is
launched will cost the goodwill of your target audience. You should also remember that the
components of an RDF triple are usually URIs. To change them will invalidate most existing
RDF which refers to your Web site.
In the early days of the Web, the structure of the URI usually reflected the organization
of the files on a Web server. If you sold a particular type of widget among a collection
of products, its URI might be similar to:
http://www.mysite.com/products/gadgets/widget.html.
The advantage of this approach is that it is relatively semantically clear; if you also
sold a doodad, then an obvious URI where you might expect to find the product details is:
http://www.anothersite.com/products/gadgets/doodad.html.
The relationship between the widget and the doodad is fairly clear. The main problem is
that this approach is inflexible; the categorization hierarchy is fixed.
As the Web advanced, dynamically generated sites became the norm. But while the sites
became more flexible, with structure no longer tied to a particular layout of files, the
amount of semantic information in the URI decreased. The page you are shown is determined
by some rather cryptic information in the query string. For instance, the URI of the
widget might be: http://www.mysite.com/inventory.cgi?pid=12345 and the URI of the doodad
might be: http://www.mysite.com/inventory.cgi?pid=67890.
Suddenly the URI gives you very little semantic value. It's certainly not clear that
these two products might be in the same category. More recently, content management
systems and Web development frameworks have started to address this issue. Now it's much
easier to have semantically structured URIs yet retain the flexibility of dynamic pages.
This is achieved through URIs that refer not to a physical file on the server, but to
content which can be delivered from a script or page in a different location. In the
trend-setting Ruby on Rails framework. this is achieved through _routes_ (rules that map
matching URLs to specific controllers and actions). In CMS packages, the feature usually
depends on Apache's mod_rewrite (or equivalent on other Web servers) and is often referred
to as "Search Engine Friendly URIs" or something similar. When you choose a CMS or
development framework for your site, be sure to investigate what it is capable of in this
regard.
One final note: If possible, consider removing file name extensions from your URIs. The
filename extensions (.html and .cgi) provide no semantic information that is relevant to
the user and actually cause problems in the long run. If you changed your Web site to use
PHP instead of CGI scripts, you suddenly have different URIs but serve exactly the same
content. This is bad for the semantic value of your URIs, as well as your Google ranking!
A more semantically elegant method is to take advantage of the HTTP headers to do content
negotiation. Consider the following URI: http://www.mysite.com/products/gadgets/widget.
A Web browser will generally indicate its preferred content type using the Accept HTTP
header. When asked for this resource, the Web server can check that header, note that
text/html is one of the options, and serve an HTML page. If you have a mashup application
that wants RDF, then the Accept header in the HTTP request should contain
application/rdf+xml and the Web server, from the same URI, can serve an RDF version of the
page.
At present this content negotiation functionality is not available in many off-the-shelf
CMS solutions, but in the short term it should be possible for a lot of them to use URIs
without file extensions, which means you can add this functionality in the future without
upsetting your URI scheme.
BACK TO TOP [43]
Take advantage of existing semantic add tools
Whether you fully embrace the Semantic Web in your Web site infrastructure, or just want
to make your existing content more useful, there are probably several opportunities to add
structure to existing content on your Web site. This is the domain of Microformats, RDFa
and GRDDL. Table 2 lists the more common information types that you can easily mark up as
structured data.
TABLE 2. OPPORTUNITIES FOR STRUCTURED MARKUP AND AUTOMATIC TRANSFORMATION
INFORMATION TYPE
STRUCTURED MARKUP
People and Organizations
hCard, RDF vCard
Calendars and Events
hCalendar, RDF Calendar
Opinions, Ratings and Reviews
VoteLinks, hReview
Social Networks
XFN, FOAF
Licenses
rel-license
Tags, Keywords, Categories
rel-tag
Lists and Outlines
XOXO
Adding the structured markup to your page is fairly simple. Listings 3 [44] and 4 [45]
below show a fragment of HTML containing contact information without, and then with, the
additional markup required for the RDF vCard, respectively.
LISTING 3. UNSTRUCTURED CONTACT INFORMATION
Rob Crowther. Web hacker at Example.org . You can contact me via e-mail or on my
work phone at 0123 456789.
Listing 4 shows the contact information with additional markup required for the RDF
vCard.
LISTING 4. CONTACT INFORMATION USING VCARD
Rob Crowther. Web hacker at Example.org . You can contact me via e-mail or on my
work phone at 0123 456789 .
In Listing 4 [46], you can see span elements added to delimit the semantically
significant bits of text, and attributes that indicate what they mean. You added the
namespace "contact" linked to the RDF VCard vocabulary. Next, you indicated that this
element is _about_ the resource represented by the URI http://example.org/staff/robertc.
Then, you added metadata using the rel attribute for link relationships and the property
attribute on non-links. The only slightly complex part is the telephone because you need
to specify a type as well as the number. To achieve this, you nest the type and value
elements inside the tel element. Adding this structure allows users to add the contact
details to their address book with a single click of the mouse.
Other automatic processing is possible with the other structured forms; for example,
Technorati makes use of the rel-tag microformat to categorize its vast aggregation of blog
posts. A rel-tag is shown in Listing 5, and as you can see, it is simply a link that makes
use of the rel attribute. The significant part is the last bit of the URI, after the final
/. This is the tag (using the normal URI encoding conventions where a space is represented
by the plus sign).
LISTING 5. REL-TAG FOR TECHNORATI FOR THE TAG \'SEMANTIC WEB\'
Semantic Web
If you write a blog post related to the Semantic Web that includes the code from Listing
5 [47] and then ping Technorati to let them know you made a new post (a lot of blog
software can be configured to do this automatically), then their crawler will index your
post and add a summary of it to the page that your tag element links to, along with any
other posts with the same tag (see Figure 2).
FIGURE 2. THE \'SEMANTIC WEB\' PAGE ON TECHNORATI, GENERATED FROM REL-TAG
BACK TO TOP [48]
Conclusion
SHARE THIS...
[49]
Digg this story [50]
[51]
Post to del.icio.us [52]
Slashdot it!
In this article, you saw how Semantic Web technologies address the need for structured
data on the Web in a standard and consistent manner, in contrast to the currently popular
method of each Web site defining their own API. You looked at how the Semantic Web
technologies add value in layers on top of the HTTP and URIs of the existing Web, first
allowing the unambiguous expression of relationships with RDF, and then allowing for
shared meaning with OWL based ontologies and finally querying the distributed Web of
knowledge using SPARQL. The article also looked at how you can take advantage of existing
ontologies to define what your data is and use a semantic URI scheme to enable your Web
site to also be your API. Finally the article looked at how you can upgrade the content of
your existing Web site using RDFa and Microformats so that GRDDL services can
automatically extract RDF from your pages.
Although the promise of Tim Berners-Lee's Semantic Web is yet to be fully realized, the
years of thinking and research that have gone into it are starting to bear fruit in terms
of solutions to practical problems that people face today. The strong collaboration trends
in Web 2.0 will only lead to more requirements for structured and semantically encoded
data being available on the Web. With some planning, you can be in position to take
advantage of the Semantic Web tools which help meet that need.
ResourcesLEARN
The ultimate mashup--Web services and the semantic Web [53] (Nicholas Chase,
developerWorks, August 2006): Practice using Semantic Web techniques with this six-part
tutorial series.
Introduction to Jena: Use RDF models in your Java applications with the Jena Semantic
Web Framework [54]( Philip McCarthy, developerWorks, June 2004): Find out how to use the
Jena Semantic Web Toolkit to exploit RDF data models in your Java applications.
Programmable Web [55]: Stay up to date with the latest on mashups and the new Web 2.0
APIs.
The Structured Web - A Primer [56]: Read a general introduction to the value of
structured data.
The W3C\'s RDF Primer [57]: Learn the basics of RDF and how to use it effectively.
A Semantic Web Primer for Object-Oriented Software Developers [58]: Read how to use
Ontologies, such as RDF Schema and OWL, in the context of OOP.
The W3C\'s OWL Overview [59]: Get an understanding of what OWL can do for apps that
process information content instead of just presenting it to humans.
The SPARQL Query Language for RDF [60] specification: Explore the syntax and semantics
of this query language for RDF.
Notation3 [61]: Read about N3, a compact and readable alternative to RDF's XML syntax.
Terse RDF Triple Language [62]: Check out Turtle, a textual syntax for RDF that writes
RDF graphs in a compact and natural text form, with abbreviations for common usage
patterns and datatypes. Turtle is compatable with existing N-Triples and Notation 3
formats and the triple pattern syntax of SPARQL.
Cool URIs for the Semantic Web [63]: Read guidelines for effective URIs as the link
between RDF and the semantic Web.
University of Southampton Department of Electronics and Computer Science [64]: See a
semantic Web site in action.
RDFa [65] or Microformats [66]: Embed semantic information in your Web pages.
IBM XML certification [67]: Find out how you can become an IBM-Certified Developer in
XML and related technologies.
XML technical library [68]: See the developerWorks XML Zone for a wide range of
technical articles and tips, tutorials, standards, and IBM Redbooks.
developerWorks technical events and webcasts [69]: Stay current with technology in
these sessions.
The IBM developerWorks XML zone [70]: Learn more about XML and the Semantic Web.
The technology bookstore [71]: Browse for books on these and other technical topics.
GET PRODUCTS AND TECHNOLOGIES
IBM trial software [72]: Build your next development project with trial software
available for download directly from developerWorks.
DISCUSS
Participate in the discussion forum [73].
XML zone discussion forums [74]: Participate in any of several XML-related discussions.
developerWorks XML zone: Share your thoughts: [75] After you read this article, post
your comments and thoughts in this forum. The XML zone editors moderate the forum and
welcome your input.
developerWorks blogs [76]: Check out these blogs and get involved in the developerWorks
community [77].
About the author
Rob Crowther is a Web Developer from London. He has a keen interest in Web Standards and
blogs sporadically at http://www.boogdesign.com/b2evo/ [78].
Rate this page
Please take a moment to complete this form to help us better serve you.
Did the information help you to achieve your goal?
Yes
No
Don't know
Please provide us with comments to help improve this page:
How useful is the information?
1
2
3
4
5
Not
useful
Extremely
useful
Share this....
Digg this story
del.icio.us [79]
Slashdot it!
BACK TO TOP [80]
About IBM [81]
Privacy [82]
Contact [83]
Terms of use [84]
All content mirrored on this site is generated from the open-source browser, Lynx. Ads are included simply to pay for our bandwidth in supplying this free service. All content is copyright the original owner (see mirror of: in the header).
|