AI Sells, Data Delivers!


The world has advanced at a faster pace and new technologies are reshaping the business world and society. One of these emerging technologies, Artificial Intelligence (AI), has gained immense momentum in the world like other technologies such as the IoT, Blockchain, 3D printing, Robotics etc.

AI is all around us these days, both in areas that border on science fiction, like self-driving cars to the more ordinary tasks like what show should you watch or purchase next. AI is influencing almost every walk of life, from businesses to society. With these technologies, the business will gain more agility needed to solve problems that humans can’t possibly solve.

Three Components of AI

Apart from the context of the domain where AI is being applied, there are three main components of AI:

The first one is the AI algorithm itself. Open-source machine learning libraries like Keras, Theano and TensorFlow have shed a lot of the low-level complexity involved in designing and building AI applications. These tools are free, well documented and supported by vibrant communities. The availability of these tools has made building AI applications far more accessible to developers.

The second one is computing power, both in the form of raw CPU power and large scale data storage solutions. Cloud services like Amazon Web Services, Google Cloud, Microsoft Azure and others make renting servers, virtual machines and big data tools are as simple as pushing a few buttons.

The third but the most important one is data. Before you can contemplate hiring data scientists, renting or paying for servers and installing open-source machine learning libraries, you must have data. The quality and depth of data will determine the level of AI applications you can achieve.

Data & AI


While AI is a reasonably wide area of study in computer science, most of the excitement these days is centred on an area of AI called machine learning and in particular, deep learning. Machine learning trains the algorithms to learn and predict answers to problems by analysing data to make predictions on their own.

As we discussed above, the promise of AI depends on certain ingredients for successful adoption; one of the critical ingredients for a successful AI implementation is a clean, representative and large amount of historical data. However, unless you are Google or Facebook with vast amounts of representative data, you will struggle to harvest historical data that are enough to give the required inference for machine learning techniques to be an effective enabler for AI initiatives. Therefore, to make AI work, there is a need for an improved level of data preparation, engineering, enrichment, and contextualization.

As with most reports about groundbreaking technology, this discussion is way ahead of current industry practices. The vision serves a useful purpose in suggesting what’s possible. But with many businesses lacking the data infrastructure necessary to obtain real AI and ML capabilities, the journey towards perfect production can also be so abstract that it perplexes the very people looking to achieve it.

AI algorithms learn from data. It is critical that you feed them the right data for the problem you want to solve. Even if you have good data, you need to make sure that it is on a useful scale, format and even that meaningful features are included. Understand the key capabilities you need for collaborative, operational data preparation pipelines to serve the needs of both your data and business analysts.

During the global launch of the Outside Insight book, author and Meltwater CEO Jorn Lyseggen, alongside AI experts, discussed the importance of the data fueling AI, and the need for executives using AI outputs for decision-making to both understand the data informing those outputs and ensure it’s as comprehensive and unbiased as possible.

“Artificial Intelligence is your rocket, but data is the fuel. You can have the best algorithms in the world, an amazing rocket, but you’re only going to get as far as your data gets you. Data is fundamental - data is AI,” said Gerardo Salandra, Chairman of the AI Society of Hong Kong and CEO at Rocketbots, at the Hong Kong launch event.

Similarly, Monica Rogati’s Data Science Hierarchy of Needs is a pyramid showing what’s necessary to add intelligence to the production system. At the bottom is the need to gather the right data, in the right formats and systems, and in the right quantity. Any application of AI and ML will only be as good as the quality of data collected.


In the growing AI market, International Data Corporation (IDC) predicts global spending is expected to increase 50% per year, to a total of $57.6 billion by 2021. Business leaders are catching on to the importance of implementing an AI strategy globally. However, it’s not enough just to introduce AI-driven tools; you need the right data inputs to find valuable insights.

Importance of Data

A lot has been written about AI recently, but one element that is often not stressed is the value data plays in allowing AI to function. Take self-driving cars - probably the most recognized application of AI. Building a self-driving car requires a humongous amount of data ranging from signals from infrared sensors, images from digital cameras, and high-resolution maps. NVIDIA estimates that one self-driving car generates 1 TB per hour of raw data. All that data is then used in the development of the AI models that actually drive the car.

While we have seen recent advances in other AI techniques like reinforcement learning that use less data (like the success of Deep Mind’s recent Alpha Go - Zero in the game of GO), data is still critical for developing AI applications.

Enterprises are overwhelmed with silo IT systems built over the years that contain data designed to do very specific individual ‘System of Record’ tasks, but unfortunately, these records are duplicated across multiple ‘Systems of Record’ resulting in massive data proliferation but lacking complete representation of an entity in any single system. This reality has given rise to fragmented and often duplicated data landscape that requires expensive and often non-efficient means of establishing ‘Source of Truth’ data sets.

Data provides ‘Intelligence’ to AI

AI applications improve as they gain more experience (means more data) but present AI applications have an unhealthy infatuation with gaining this experience exclusively from Machine Learning techniques. While there is nothing inherently wrong with Machine Learning, the main caveat for a successful Machine Learning outcome is sufficient and representative historical data for the machine to learn from. For example, if an AI model is learning to recognize chairs and has only seen standard dining chairs that have four legs, the model may believe that chairs are only defined by four legs. This means if the model is shown to, say, a desk chair that has one pillar, it will not recognize it as a chair.

Preparing data for AI

While your organisation may not be at the stage where you are able to start building AI applications, at the least you should be preparing for a future where your data will be utilised to power smart solutions. Treat every new initiative or project as an opportunity to build a foundation for future data models.

Data Collection

This aspect has become crucial in light of the GDPR legislation. Are there clear and followed guidelines about what and why data is collected when a new feature or product is being developed? Does that data have a purpose or is it being collected just like that?

Data Format

When you are collecting data, is it being saved in a usable format across all our data collection touchpoints? Are the field names the same? Is the same level of validation and error checking applied across products?

Data Storage

Data needs to be flowing into data stores and be available in real-time to all areas of the business. Given that AI applications usually become more reliable the more they can correlate different sources of information, siloed data sets that are hard to access become an obstacle to discovering value in an organisation’s data.

Data Literacy

AI is basically biased in how it was created, trained, programmed. One of the most significant things for AI to be successful is that executives and decision-makers have the data literacy to beat up the model, to challenge the model, to massage the model and to fully understand what the underlying assumptions are to make sure the answer it produces actually matches the terrain that you want to operate in. What’s vital to get the best predictive models and forward-looking insights is that the data informing them comes from a variety of external sources.

More Data

Just like humans, AI applications improve with more experience. Data provides examples essential to train models that can perform predictions and classifications. A good example of this is in image recognition. The availability of data through ImageNet transformed the pace of change in image understanding and led to computers reaching human-level performance. A general rule of thumb is that you need 10 times as much data as the number of parameters (i.e., degrees of freedom) in the model being built. The more complicated the task, the more data needed.

Data Understanding

For cooking the perfect meal, it’s great to know the tastes of your diners. Similarly, data is essential to tailoring an AI model to the wants of specific users. We need to learn how users use their applications and search content in order to generate meaningful personalized recommendations.

By knowing what content users read, download and collect, we can give them advice on potential content of interest. Furthermore, techniques such as collaborative filtering, which make suggestions based on the similarity between users, improve with access to more data; the more user data one has, the more likely it is that the algorithm can find a similar a user.

Diverse Data

A key problem in building AI models is overfitting - this is, where the model focuses too specifically on the examples given. For instance, if a model is trying to learn to recognize chairs and has only been shown standard dining chairs with four legs, it may learn that chairs are defined by having four legs. If the model is then shown a desk chair with just one pillar, it wouldn’t recognize it. Having diverse data helps combat this problem.

During training, the AI model can view more examples of different types of things. This is particularly valuable in working with data about people, where there can be the potential for algorithmic bias against people from diverse backgrounds. This point was made by Prof. Dame Wendy Hall in her interview at the World Summit AI. Prof. Hall focused on the need to make sure that AI was trained on diverse datasets. A good example of combating this through data is the lengths that Apple went to in training their new Face ID recognition algorithm.

External Data

As the race to implement AI tools at an enterprise-level reaches new heights, it’s important to note that the data informing those tools is of paramount importance. Relying only on internal information to inform algorithms will produce insights gleaned only from the information you already have. Rather, it’s vital that decision-makers also look to insights from external data for a much more comprehensive and unbiased view of their customers and industry landscape.

Hypothesis Testing

Even in cases where techniques can be used that require less training data, more data makes it easier to test AI systems. An example of this is A/B testing. This is where a developer takes a small amount of traffic to a site and tests to see whether a new recommendation engine or search algorithm performs better on that small set of traffic.

The more traffic (means data), the easier it is to test multiple algorithms or variants. At the World AI Summit, Netflix explained how they use A/B testing to select artwork that maximizes the engagement with films and TV series on Netflix.

Data Reusability

Finally, it is usually the case that data can be reused for different applications. For example, a technique called transfer learning allows data developed for one domain to be applied to another domain. Moreover, recent work has revealed that background knowledge can further improve on tasks like object detection in images. Recent work from Google has shown how training using data designated for a different task like image recognition can help performance on another completely different task like language translation.


In summary, data is the pivotal element in developing any AI system today.

In the near future, what we are now calling AI will be embedded in our culture and we won’t call it AI any longer. It will just be how things work. What you have in your control today is your data. It’s crucial that you start preparing for a future where AI applications can start using your data and that starts with the quantity and quality of the data itself.

Embracing AI with business & society is a journey, not a silver bullet that will solve challenges instantly. It begins with gathering data into simple visualizations and statistical processes that allow you to better understand your data and get your processes under control. From there, you’ll progress through increasingly advanced analytical capabilities, until you achieve that utopian aim of perfect production, where you have AI helping you make products as efficiently and reliably as possible.

Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

If you have any questions or comments, click the "Go To Discussion" button below!