Everyone searching for data has run up against the limitations of keywords. Even the best search engines require clever word choice to pull up the data you really want, and untold piles of data never get found because they're missing the right word or phrase. This problem is particularly bad for those who attempt to cross discipline boundaries, such as engineers who seek solutions in biology, or policy wonks who are looking for geographic statistics.
Most attempts to change this have required the top-down creation of ontologies to make tables-of-contents (such as Yahoo, Amazon, or the British mSpace research project). However, these methods require extensive skilled labor time, and rely on the creators' ontologies being complete, consistent, and non-arguable; in other words, they are expensive and only marginally useful. Tools like Vivisimo automate the ontology-building process, but are bound by the limits of keywords, so they fail to overcome the jargon barrier.
Improvements in computer power and algorithms have begun to allow the creation of bottom-up associations of related concepts. This means grouping of semantic peers, even those that do not share specific words. (It could potentially also lead to ontologies that self-assemble.) So far Latent Semantic Indexing appears to be the most promising technology. As researchers at Middlebury College described it:
"In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent."At least one company is commercializing the technology. According toTelcordia, "LSI can retrieve relevant documents even when they do not share any words with a query. LSI uses these statistically derived 'concepts' to improve search performance by up to 30%."
A somewhat related technology is Language Weaver, designed to be a translation engine. As NewScientist writes, "The key[s] to their 'statistical machine translation software' are the translation dictionaries, patterns and rules - translation parameters - that the program develops. It does this by creating many different possible translation parameters based on previously translated documents and then ranking them probabilistically." And they quote the company founders as saying "Before long a machine will discover something about linguistics that only a machine could, by crunching through billions of words."
Language Weaver is intended as a spoken-language translator, but what if it were used as a discipline-language translator? It could cross the jargon barrier between doctors and engineers, programmers and biologists, designers and psychologists, etc., in ways that currently require experts skilled in both fields.
The dearth of such cross-discipline people has always been a stumbling block for innovation and science. Perhaps tools like Latent Semantic Indexing and Language Weaver could help change all that.
Language Weaver and similar statistical machine translation systems could conceivably be used to translate between registers in the way you describe, but there's a problem.
The way that statistical machine translation works is that you train it on existing translations: you take a whole bunch of text in English, say, and all of the translations of that text into French (which, we should remember, are done by hardworking human translators), and you train your system to recognize the correspondences between the English and French text.
Then the system applies the rules it's learned to new English text, and hopefully spits out something looking more or less like readable French.
The problem with translating between jargon-laden text and a more everyday version is that currently the training data doesn't exist in the quantity one would need to build a useful translation system. Perhaps that will change in the future.
Folksonomy - what I used to call "speculative categorization" or "distributed classification" is a very promising approach.