FEATURED | Semantic Content Generation for News Domain
IndraStra Global

FEATURED | Semantic Content Generation for News Domain

By Rahul Guhathakurta

FEATURED | Semantic Content Generation for News Domain by Rahul Guhathakurta

A true form of “News” was never meant to be just a piece of information, but in fact it has been always based upon reader’s affiliation towards one or more particular causes and its consequent effects. If we look in to the past, News content distributors like Thomson Reuters, Associated Press, former ITAR-TASS were always into the business of creating information substances delivering a country’s objectives and eventually portraying ideological make-up of its own people. Soon, Governments and Revolutionaries understood the power of news and its various forms which gradually became a very integral part of domestic and international propaganda system, directly serving a nation’s primary objective under various controlled information release format.

With the start of digital revolution through the medium of Internet, the penetration and distribution of controlled as well as uncontrolled contents among the masses became more prominent. Overtly speaking, we are at the cusp of very smart content revolution era. Earlier, the news syndicates had the upper hand that we used to pay a hefty price to syndicate contents or source contents. Now, we have News Search Engines, armed with news crawler bots which are controlling very distribution of our content at particular region of world through embedded geo-targeting codes and through search engine optimized keywords which mostly influences the web hits at particular article page from one particular city, region or country. With the rise of mobile Internet penetration – raw news are easily available over the web the form of images and videos – the only cost which we are incurring is understanding that piece of information, analyzing it, authenticating it and finally structuring it for the end readers. The delivery of news and its allied contents are now happening sub-consciously through various social media networks and news aggregators under the concept of “Virility Factors”. Currently, the most important question is about the survivability of news syndicates as organizations which completely depends on their capacity to deal with these heterogeneous data in a proficient way and of course, it’s a time to re-invent the whole business wheel.

The Rise of Semantic Web:

Next generation digital news industry is based on the assumption that consumers, actually mainly viewers, will become into participants [1]. This fact implies the need for interactive devices, content adaptation, and management for new distribution channels. The need for a new architecture and structures for content management is important for the digital news industry. The aim is to improve knowledge management and information retrieval.

Content generation routines, a creative process that started its digitalization a few years ago, involve people and it is very important to track the impact that this change (to digital news) will have on them. Not to mention that the archive system can be used for search and retrieval functions.

FEATURED | Semantic Content Generation for News Domain - IMAGE 1

The Concept of Obsoledge:

It was short, but now it’s got shorter. Most endeavors are focused on developing a learning curve by converting the knowledge from experts into a structured web of information. The outcome is an innately exhausting as well as not at all enticing. More accessible technology platforms like smart phones on 3G/4G networks are causing an explosion of information. This has the effect of making the shelf-life of knowledge shorter and shorter. Alvin Toffler has – in his book Revolutionary Wealth – coined the term Obsoledge to refer to this increase of obsolete knowledge. 

Algorithm Based Content Curation, Just an Auxiliary Support:

FEATURED | Semantic Content Generation for News Domain - IMAGE 2

OpenCalais, a Thomson Reuters tool launched in 2008 based upon Natural Language Processing (NLP) and machine learning algorithms which can examine any news article, understand what it’s about, and connect it to related media. This is more than a simple keyword search. OpenCalais extracts “named entities,” analyzing sentence structure to determine the topic of the article. It is able to understand facts and events. For example, when fed a short article about a refugee crisis forming near Hungarian border, an OpenCalais demo tool recognized locations, facilities like Refugee Camps and an even occupation like “border police”. It also understood facts, synthesizing a subject-verb-object phrase to express that a UN refugee monitoring cell had released various press release on Syrian Refugees and their transition path through out the European Continent. OpenCalais has already been put to work at a wide range of news organizations, including The Nation, The New Republic, Slate, GulfNews and Aljazeera. Each site’s implementation is unique; for example, DailyMe uses semantic data to monitor each user’s reading habits, presenting the user with personalized reading suggestions. Both The Nation and The New Republic saw immediate benefits to the use of OpenCalais, According to Thomson Reuters; the tool  very much coincided with significant gains in time-on-site, and it automatically generates pages dedicated to a single topic, which had been a labor-intensive process for editors.

Defining the Ontological Infrastructure: 

As a result of applying the XML Semantics Reuse methodology, we have obtained a set of ontologies that reuse the semantics of the underlying standards, as they are formalised through the corresponding XML Schemas. All the ontologies related to journalism standards, i.e. NewsCodes NITF and NewsML, are available from the Semantic Newspaper site. The MPEG-7 Ontology is available from the MPEG-7 Ontology site. 

The ontologies that are going to be used as the basis for the info-structure of the semantic newspaper are: 
  • NewsCodes Subjects Ontology: an OWL ontology for the subjects’ part of the IPTC NewsCodes. It is a simple taxonomy of subjects but it is implemented with OWL in order to facilitate the integration of the subjects’ taxonomy in the global ontological framework. 
  • NITF 3.3 Ontology: an OWL ontology that captures the semantics of the XML Schema specification of the NITF standard. It contains some classes and many properties dealing with document structure, i.e. paragraphs, subheadlines, etc., but also some metadata properties about copyright, authorship, issue dates, etc.
  • NewsML 1.2 Ontology: the OWL ontology resulting from mapping the NewsML 1.2 XML Schema. Basically, it includes a set of properties useful to define the news structure as a multimedia package, i.e. news envelope, components, items, etc. 
  • MPEG-7 Ontology: The XSD2OWL mapping has been applied to the MPEG-7 XML Schemas producing an ontology that has 2372 classes and 975 properties, which are targeted towards describing multimedia at all detail levels, from content based descriptors to semantic ones. 

Semantic Web Architecture

In addition to content-based metadata, there is context-based metadata. This kind of metadata higher level and it usually, in this context, related to journalism metadata.It is generated by the system users (journalist, photographers, cameramen, etc.). For instance, there are issue dates, news subjects, titles, authors, etc.

This kind of metadata can come directly from semantic sources but, usually, it is going to come from legacy XML sources based on the standards’ XML Schemas. Therefore, in order to integrate them, they will pass through the XML2RDF component. This component, in conjunction with the ontologies previously mapped from the corresponding XML Schemas, generates the RDF metadata that can be then integrated in the common RDF framework.

This framework has the persistence support of a RDF store, where metadata and ontologies reside. Once all metadata has been put together, the semantic integration can take place


In Early 2010, Evri.com started as a search engine and eventually migrated into content curation services through its patented semantic technologies. Backed by Microsoft Co-founder Paul Allen, the company didn't survive  for more than two years and eventually rounding up its commercial activities by the end of 2012. Evri.com itself incorporated the semantic indexing technology from its acquisition of Radar Networks into its products line. Unfortunately, Radar’s old existing product, Twine was being shut down in May 2010. End-Readers in the market are like multiple noses, the question is not how you hold these noses but how long you can the hold the noses.

Understanding The Copyright Clauses with respect to Semantic Contents:

Semantic Copyright
The user‐generated semantic content can be represented through tag ontology as we discussed above which can be used to represent tagging data at a semantic level using Semantic Web technologies like OpenCalais Tool. 

Social Semantic Cloud of Tags can eventually improve the expressive knowledge representation and that multiple  ontologies can aid in describing copyright metadata using some extended properties like enabling the republishing, exchange, and reuse of tagging data, and will provide a way to reduce the risk of copyright infringements in the process of tag sharing within the multiple information layer through Internet on multiple devices. 

The copyright domain is a complex one and conceptualizing it, is a very challenging task. The conceptualization process, as it has been shown in the pattern description, is divided into two phases. The first one concentrates on the static aspects of the domain. The static aspects are divided into two different sub models due to its complexity.

First, there is the creation sub model. This model is the basis for building the conceptual models of the rest of the parts. It defines the different forms a creation can take, which are classified following the three main points of view as proposed by many earlier discussed ontologies, e.g. the Suggested Upper Merged Ontology :

• Abstract: Work.
• Object: Manifestation, Fixation and Instance.
• Process: Performance and Communication.

A part from identifying the key concepts in the creation sub model, it also includes some relations among them and a set of constraints on how they are interrelated. More details for this point and the following steps in the conceptualization process are available from

Second, there is the rights sub model, which is also part of the static part model. The Rights Model follows the World Intellectual Property Organization (WIPO) recommendations in order to define the rights hierarchy. The most relevant rights in the Digital Restrictions Management (DRM) context are economic rights as they are related to productive and commercial aspects of copyright. All the specific rights in copyright law are modeled as concepts. For the economic aspects of copyright there are the following rights: Reproduction, Distribution, Public Performance, Fixation, Communication and Transformation Right.

Each right governs a set of actions, i.e. things that the actors participating in the copyright life cycle can perform on the entities in the creation model. Therefore, it is time to move to the dynamic aspects of the domain. The model for the dynamic part is called the Action Model and it is built on the roots of the two previous ones. Actions correspond to the primitive actions that can be performed on the concepts defined in the creation sub model and which are regulated by the rights in the rights sub-model. For the economic rights, these are the actions:
  • Reproduction Right: reproduce, commonly speaking copy.
  • Distribution Right: distribute. More specifically sell, rent and lend.
  • Public Performance Right: perform; it is regulated by copyright when it is a public performance and not a private one.
  • Fixation Right: fix, or record
  • Communication Right: communicate when the subject is an object or retransmit when communicating a performance or previous communication, e.g. a re-broadcast. Other related actions, which depend on the intended audience, are broadcast or make available
  • Transformation Right: derive. Some specializations can be adapted or translated. 

At IndraStra, we believe this area is quite alluring for experiencing these encounters in the field of “Smart News Generation and Distribution”. We expect that the utilization of upcoming technological advances in Semantic Web field for news content curation will deliver sure shot advantages for the process automation of contents along with search and recuperation. But, as a publisher we  believe deploying these new school methods but not by sacrificing the old school ideologies. At the end, the content is the king and queen, a fierce battle will be fought between who publishes and who curates the best. The final outcome will be a bloodbath if there is no synergy between multiple players with primary focus on the evolution of readership habits based upon various affinity factors.

About The Author:

Rahul Guhathakurta  is the Founder and Curator of IndraStra.com and can be reached at his LinkedIn Profile. Thomson Reuters ResearcherID : K-4094-2015


1. Ontological Infrastructure for a Semantic Newspaper -Roberto García, Ferran Perdrix, Rosa Gil Departament d'Informàtica i Enginyeria Industrial,, Universitat de Lleida, Jaume II 69, E-25001 Lleida, Spain - http://image.ntua.gr/swamm2006/resources/paper07.pdf

2. An experience with Semantic Web technologies in the news domain - Luis S´anchez-Fern´andez1 , Norberto Fern´andez-Garc´ıa1, Ansgar Bernardi2, Lars Zapf2, Anselmo Pe˜nas3, Manuel Fuentes4 1 Carlos III - http://ceur-ws.org/Vol-155/paper2.pdf

3. Contracting and Copyright issues in Composite Semantic Services - Christian Baumann, SAP AG , SAP Research / Email: ch.baumann@sap.com - http://link.springer.com/chapter/10.1007%2F978-3-540-88564-1_59#page-2

AIDN: 001-10-2015-0348

Image Attributes: Cover Art - Simplistic example of the sort of semantic net used in Semantic Web technology, February 3, 2010, Wikimedia Commons [Link]