I didn’t realize this until a couple weeks ago – but Oxford Dictionaries announces an annual “Word of the Year”.  For 2016, the Word of the Year is “Post-Truth”, which is:

“An adjective defined as ‘relating to or denoting circumstances in which objective facts are less influential in shaping public opinion than appeals to emotion and personal belief.”

Back in the old days, long before the World Wide Web, we had “Propaganda,” which Webster’s defines as:

 “The spreading of ideas, information, or rumor for the purpose of helping or injuring an institution, a cause, or a person.” 

There are 4 million text messages sent each minute in the United States alone!

Every election, every battle or war, and every policy has had its share of feeding by the propaganda machines of nations.  So this concept of Post-Truth is not new.  However, it is far different today.   The difference is that today, anyone with a device, a connection, some social media savviness, and a reasonable network of followers, can create propaganda now.  An email or a tweet gets sent out with a statement or fact – and boom – we now have content that may be interesting, or relevant, or important or, maybe- a Post-Truth.

We have seen and heard all of the implications of this content being spread in broadcasts, print and social media that are driven by agendas, personal vendettas, and of course, the prized goal of controlling, owning and manipulating information to serve a specific agenda.  Just look at how ISIL and other Terrorist organizations have spread their agenda to recruit and spread fear.  Creating Post-Truths has been part of humanity forever, but the volume and veracity of it in the past years has been staggering – especially in the context of two global events of 2016: The UK Referendum on the EU, and the US presidential election.  Many of us saw it, read it, heard it, maybe lived it… and most had to cope with content flooding our screens.  There was so much information being sent, in so many languages, with such a broad range of sentiment and emotion associated with just these two events on a global level.  A multilingual friend in Europe showed me some comparisons of Facebook posts that he had been reviewing in 5 different languages, and though they each tried to convey a similar thought … each could have been interpreted in a different manner depending on the translation and the reader’s political or personal perspective.

It is astonishing that all of this multilingual content is being created, and very little of it is being used for purposes beyond sharing a mood or a feeling or, retweeting to keep your feed going.

This is Big Data, but a twist.  This is Big Language Data.  This buzz word applies to Every Market, Every Business, Every Nation’s security and defense posture, every parent and every child.   Content is created on the internet at an astounding rate.  From Gwave.com:1

  • 500 Million Tweets sent each day!
  • More than 4 Million Hours of content uploaded to YouTube every day!
  • 3 BILLION Facebook messages posted daily!
  • 4 Million Text messages sent each minute in the US alone!

You don’t know where it is coming from, and none of us have the time or resources to look at it all, do the fact-checking ourselves, and make conclusions.  Now consider this problem when you expand to a global level.  English, Chinese, Spanish and Arabic are among the most used languages on the Web2.  If you were an analyst responsible for understanding global reactions to an event, or predicting how a certain population may react to a new foreign policy, or responsible for the forensics associated with a Terror attack and searching for the suspects – your big data problem is really a Big Language Data problem.  And it is imperative to realize that the end goal is to truly UNDERSTAND what is being said, why it is being said, and any emotion that is being shared.

Add one more twist to this Big Language Data problem – the fact is that language itself is evolving on the internet.  Staying up to date on the continuously evolving terms and idioms that are created in English is challenging enough for many of us, beginning with me.  But this phenomenon is happening in all languages.  Professional translators that are tasked with translating content in a second or third language are finding that they have to rely on others who are closer to the evolving terminology in social media languages; unfortunately most organizations do not have a strategy for continuous learning and sharing of what they have learned across groups and across languages.  And as we have seen over the past decades, new languages are appearing on the internet as more people gain access to the internet, and even those begin evolving almost immediately.  Just look at the use of Amharic from a social media perspective – today compared to 10 years ago.

Within every industry, companies are investing into Social Media collection, analysis, forensics – and then using that information to drive decisions.  Companies need to know the emotions and reactions being generated about new products, or information, or price increases.  Governmental or political organizations are all over social media trying to figure out the how, what, and why people are reacting to policies, scandals, corruption, and … of course… Post-Truths.

In the industry I serve – Defense and Security, the problem is far closer to home.  We have law enforcement analysts, intelligence analysts, and soldiers – all using social media to stay one step ahead of our adversaries.  They write in dialects and slang. They write in 40+ languages, often mixing languages together.  They create words and emojis to confuse and redirect those tracking them.  In a time when it is acceptable to post broken sentences, misspell words, ignore grammar, and create new concepts with a mix of letters and pictures – the challenge is daunting.  Though we may see it every day in English, this phenomenon is happening in Chinese, Russian, Arabic, Hindi and every other language.

So the Big Language Data is coming in without rules; without grammatical structure; without predictability.  One could try to translate all of this data – but it is impractical and impossible; So we use technologies to filter – we can go into Native language processing and extract entities, and associate people persons and things. We can take out priority content based on other metadata and translate that priority through Machine Translation.  But we still have to get to the point of contextual understanding.  Be able to learn a new phrase or idiom, and then be able to share it with others – on a global level if needed – so we aren’t trying to learn it again in some other situation or location.

Translation is a key step – but understanding  should be the goal.  Especially in the Defense and Security domain, where that deep understanding of incoming multilingual content could mean the difference between defeating a potential threat and having to pick up the pieces after an attack.

Finally, we are seeing the maturation and the convergence of Artificial Intelligence, Deep/Machine Learning, Neural-based Machine translation, and affordable Processing and GPU-based hardware. The breakthroughs are coming – which will have humans use machines to ingest, translate, analyze, and, understand all of this content; in all of these languages, with spelling errors and emojis; with acronyms and terminology; with dialects and language switching.  There have been some amazing breakthroughs in Neural Machine Translation, which will not only improve the quality of translation, but also accelerate our learning of new words created on the internet.

When we are able to truly understand multilingual content – we will then be able to decipher it, separating fact from fiction, threat from safety; news from opinion; and actors from watchers.  Combine these machine learning breakthroughs with shared tools such as terminology management, shared translation memory, and mature, repeatable processes to ingest and understand content in a consistent manner – and we will be able to deal with the Big Language Data and the myriad of impacts it has on daily lives, national security and safety, international relationships, and, the 2016 Word-of-the-Year – Post-Truths.

 

2 http://www.internetworldstats.com/stats7.htm

1 https://www.gwava.com/blog/internet-data-created-daily

About the Author

leaders_danny

S Danny Rajan is CEO of SDL Government in Herndon VA.

SDL Government provides innovative Language Technology Solutions for the US Defense and Intelligence community, built around its suite of COTS products and its technical solutions team.

Learn More About SDL Governments Machine Translation Solutions