---

Deficiencies of Manual Tagging

An impressive array of professional cataloging systems has been developed in order to sort information through the inclusion of ‘metadata’. This metadata relates the nature of a document’s content in a short label or ‘tag’. An entire scientific field has emerged surrounding cataloging, categorizing and classifying data according to complex rules and systems, using large controlled vocabularies for describing document contents.

Metadata is used to mark and identify video and audio content to facilitate the management and usage of data. In a video archive or library, the metadata frequently includes the titles stocked and a description of the content such as the director, the date of production and the physical location where it is stored.

Most critics argue the metadata is:

  • Time-consuming
  • Expensive
  • Subjective
  • Often incorrect
  • Dependent on expertise
  • Static and tags reflect current interests and vocabulary

 

Subjective

Each person will categorize or tag a given file differently. In order to retrieve it, a user must guess the category under which it was tagged. Two authors of similar information may view it very differently from one another, and differently again from those searching for the information. This often results in a complex searching process that misses the targeted video content.

Another problem inherent in this approach is that humans can get lax in their tagging, leading to the large majority of content being tagged with default tags, making it difficult to find specific content and rendering the whole taxonomy system useless.

Further complications arise when subjects incorporate multiple themes. For instance, a report about “technology development in Russia within the context of changing foreign policy” could easily be classified as:

  • Russian technology
  • Russian foreign policy
  • Russian economics

The decision process is both complex and time consuming, and introduces yet more inconsistency, particularly when the sheer number of options available to a user is considered. For example, over eight hundred tags for general newspaper subjects make the task of choosing a potentially basic subject description in a reasonable time scale an even more challenging process.

To complicate the issue further, the same author may tag the same information in different ways on different occasions – yesterday a document was to do with “defense”, today it is related to “war”. Inconsistent approaches waste valuable time and can frustrate any long-term use of information by limiting its classification in terms of temporal attributes.

 

Language Dependent

Keywords are inherently dependent on the local language system. Across a global enterprise, relevant documents from regional departments that are entirely similar might never be collocated together based on the language used to tag them. Additionally, folksonomies invite deliberately idiosyncratic tagging, which is not necessarily a problem in the consumer sphere, but in the professional environment it can mislead the user and dramatically decrease retrieval rates.

Furthermore, there is no synonym control in a fully manual system, leading to multiple tags that have identical intended meanings (e.g. , , ) but that return different results. Multiple forms of the same words, such as verbs or plural/singular forms, also lead to ambiguities. Typing in “car” or “cars” will yield different results each time. Ambiguity of tags can also result if users apply similar tags to describe vastly different things. Homonyms also cause problems; for instance, if a user enters the query “fly”, the system will return documents that relate to the anatomy of insects as well as details of airplane tickets.

 

Apathy

One of the most stubborn issues is that of employee apathy. While professional taxonomists may engage in organizing and metatagging with quasi-religious zeal, many employees are less enthused by the process. This will often lead to tags being too general, missing, or simply incorrect.

 

Incorrect Data

Metatagging is not proofed by the system and, as such, the tag can carry information which is not related to the file. For instance, video may be tagged so that it appears to contain authorized content when in fact it doesn’t. This is frequently seen with online video clips where the tag is changed to avoid copyright detection.

 

Limited Search

The most common information retrieval techniques, keyword and Boolean search, require users to input the exact words they are looking for into a text field. Upon submission, a search will return a list of files that contain the search terms. This will only be successful if the user running the search uses the exact same word that was applied to the content at the time of index.

 

Not Scalable

In order to be very specific in the retrieval and processing of tagged files, the number of tags will need to be very high. For example, tag numbers in a company such as Reuters run into the tens of thousands. However, as the number of tags increases, so does the effort required and the likelihood of misclassification.

 

High Labor Costs

Taxonomy creation and tagging is still a predominantly manual task requiring input from librarians, users and IT staff. This means that large labor costs are involved in making sense of information.

 

Interoperability of Tagging

XML is not a set of standard tag definitions; it is a set of definitions that allow users to define tags. This means that if two organizations are going to interoperate and apply the same meaning to the same tags, they have to explicitly agree upon their definitions in advance. While this may prove possible for small groups of cooperating agents working over public networks, doubts remain as to whether this will scale to support an extended network of industry trading partners.

As with other forms of structured information, current methods of indexing video resort to arranging information into a set of pre-defined fields. These methods undermine the value of the asset, thus reducing the richness of information contained within the video.

To derive maximum benefit from video assets, the IT industry needs to take a new approach to harness the power of rich media assets. Manual tagging is not tenable: it does not scale, is subjective, and is inconsistent when trying to categorize large volumes of information.

This new approach is what Autonomy Virage’s product proposition offers, setting the bar at a level that allows rich media to become a reusable corporate asset, facilitating optimized business value.

 

Idea Distancing

Tags also fail to highlight the relationships between subjects. There are often vital relationships between seemingly separately tagged subjects such as wing design/low drag and aerofoil/efficiency, a concept known as “idea distancing.” Obviously, there will be a degree of overlap between these categories, and because of this a user may be interested in the contents of both. However, without understanding the meanings of the category names there is no clear correlation between the two.

Related DocumentsAutonomy Technology White Paper

Autonomy Performance and Scalability White Paper