What is a Video Search Engine? Part I - Searching Speech

Of all formats, videos are the most difficult to search. Typically, current search engines can only search for "Title" and "Metadata" of the videos, which are manually keyed in by a human. There is no way to search the content inside the video. For example, how do you find a specific piece of news in a news clip? Or specific words that appear inside a video? How can you find them without actually watching the videos yourself?

Before we even get into the question of what is a video search engine, we need to have an understanding what can we search inside a video? Elements can include SpeechWords (or Text), MotionEmotionsFaces and Objects.

Video Search as a Service

To kick of this “What is a Video Search Engine?” series, let’s tackle the most obvious of the elements – Speech.

In an hour, a person can say up to 9,000 words. Given the rate of videos are being produced today, that’s a lot of words. According to The Ethnologue catalogue of world languages, there are currently 7099 living languages. Obviously, Speech Recognition technology has not been able to keep with these vast number of languages. However, the good news is (depending on you see things), just 23 languages account for more than half of the world’s population.

 Languages in the World (Source: www.ethnologue.com)

On the technical aspect of searching speech in videos, the following process is required: 

  1. Transcribe (Speech-to-Text) – transcribing speech in the video
  2. Index - make the speech searchable
  3. Search - brings the users to exactly where the search terms are in the video.

The processes involved might sound simple, but the process of transcribing speech is filled with problems. There are factors that can affect the accuracy of speech recognition. For example:

  • heavy localized accent
  • low speech volume
  • bad diction
  • heavy background noise
  • multiple voices speaking at the same time

With the above in consideration, there are a lot of videos that are “not suitable” for machine transcribing: movies, TV shows, anything with mixed audio and sound effects, poorly recorded content with background noise (hiss).

