In a white paper I recently released, I referred to an article quoting a report from 2008 discussing the need for storage solutions, and this report mentioned that the volume of unstructured data was growing with a 61.7% compound annual growth rate, which reflects a pretty huge increase. Quick question: what are the types of content that are feeding into this growth?
Of course, the explosion of picture, audio, and video artifacts streaming into sites like YouTube and Flickr (for example) does contribute significantly. While many of these items are pirated copies of previously broadcast material, video resumes of prospective Hollywood superstars or the country’s funniest home videos, there are also some serious pieces carrying semantically worthy content (from numerous broadcast channels, as well as *ahem* “experts” in respective fields).
OK, then there are the other social media channels: blogs, Twitter, Facebook, et. al., which spawn streams of largely unstructured text. Also, the traditional online media channels provide a stream of articles as well as white papers and other types of reports carrying interesting information.
So, here is the real issue: if the amount of content available is essentially doubling about every 18 months, how does one keep up with it? Actually, a different spin on that question: clearly, not all of that content is relevant to any individual, so how does one filter out the signal from the noise?
This is not really a new question in the BI arena, considering that the objective of the data warehouse was to consolidate and re-present information as it is synthesized from mounds of data. The idea is that by organizing and collating pieces of information, the environment supplements the end-consumer’s need to pluck out what is relevant and then start tallying it up. Presumably, the same concepts will be applied to unstructured content as well, so there is some hope.
Practically, this suggests the need for more sophisticated methods for semantic analysis of various types of content and meta-tagging and quantification for organizational purposes. Text-mining tools are emerging in the mainstream, and hopefully they will help in filtering.
For the content provider (which means almost everybody today), though, there are two paths to take. The first is to jump on the bandwagon and churn out ever more chunks of content and pump it into the stream. Seems like a lot of folks are taking this path.
The second is to concentrate on quality and differentiation when it comes to generating content, and perhaps continuing to refine the message for the specific audience you are trying to reach. In turn, search engine optimization algorithms are likely to filter out that content and bubble it up to the top specifically when someone is looking for it, and that may reflect more of the business intelligence approach…


