Data and AI Concepts

Table of Contents

I Preface

Back to TOC

Topics Covered: Why This Book?, Who Should Read This Book?, Scope of This Book, Outline of This Book

Most probably you might have already heard the quotes like ‘Data is the new oil’, ‘AI is the new electricity’. There is no doubt that data and AI have become the most valuable assets of the digital ecosystem. Different applications of data and AI are helping businesses, governments, and society in general. Due to the unprecedented adoption of data and AI techniques, the demand for data professionals has also skyrocketed.

Why This Book?

I have been thinking to cover data and AI concepts in a form of a book for quite some time. But as Simon Sinek suggests, ‘Start with Why?’, it gives you purpose. The very first question that I asked myself was why I need to write a book on ‘Data and AI’? I pondered over this question for several weeks and then came up with several reasons:

First, data and AI is a vast field, in fact, it’s an amalgamation of multiple fields. There is already so much great literature that has been written on various aspects and those books cover the specific areas in great length and depth. But there is no literature in my knowledge that covers the overall landscape holistically.

Second, data and AI projects can’t be delivered by a single person, these projects are a team effort. Members with different niche skill-set collaborate to deliver the business value of data. It has been identified that data professionals can be effective with a T-shaped skill-set, which means they have their niche but they also have enough knowledge and exposure of the horizontal layer. With this book, I try to cover the horizontal layer of data and AI space in just enough depth end-to-end.

Third, after working in Data and AI field for more than a decade, I have developed my own perspective around it, which I would like to share with the other learners and practitioners. I truly believe that if I really understand my stuff, I should be able to teach it to a duck. Moreover, this exercise will immensely help to structure my knowledge as well.

Fourth, my focus in this book would be on the concepts. Why? Because concepts are the abstraction of the real-world phenomenon, if you know the concept, you can explore it and build on it as you desire. I intend to cover the data and AI concepts in an intuitive way.

Finally, I plan to cover all the aspects of the Data and AI field holistically, which will help you to connect the dots and build your own perspective. I am confident that you will be able to contribute to your data and AI projects more effectively.

Who Should Read This Book?

This book is for anyone who calls himself a data professional or wants to become one. 

I call a data professional as anyone who is a stakeholder in data and AI projects. Be it a technical or business person, there is a minimum level of understanding that is required to be effective in data and AI teams. I have used parts of the material with people from business and technology alike.

A data professional can be a data analyst, data scientist, data engineer, machine learning (ML) engineer, business intelligence (BI) engineer, cloud engineer, DevOps/MLOps engineer, data architect, and head of analytics.

Don’t worry if you are completely new to the data and AI field, you have even more reason to be excited. There are no prerequisites to read and grasp the concepts mentioned in the book. This is a promise that learning data and AI concepts will change the way you think about the problems you want to solve and show you how to tackle them by unlocking the power of data.

Scope of This Book

In this section I am going to mention what and what not I am going to cover in this book.

As you can imagine, the data and AI being a vast and complex field, it’s nearly impossible to cover every topic from concept to implementation in a single book. Hence as I am covering the breadth of topics, I will be limiting the scope of this book to the concepts only. I have already mentioned in the previous section that concepts are the starting point and important in a view that it abstracts the real-world phenomenon, it exposes you to the topic well enough so that you can explore it further on need basis.

  • Covered: breadth of the Data and AI field to just enough depth, sticking to the comcepts
  • Not covered: specialization in each of the sub-fields, not the implementation details

Outline of This Book

Before I mention the outline, have a look at the above figure, what do you understand from above? Does it look too complicated? Are you aware of the layers and terms mentioned?

Now lets have a look at this figure, does this fugure look simpler? What do you understand from this?

As it turns out, before building or working on any data and AI platform, we need to understand the underlying different layers and what they are made of. Even before that, we we need to make sense of the building blocks.

The book is divided in three parts, first one builds the foundation, second one covers the components and third part talks about various data and AI platforms.

  • Data and AI Foundation: This part covers the basic concepts like Data, , Mathematics, IT/Programming, Business Domain, AI
  • Data and AI Components: This part exposes you to different layers of a data and AI platform and its components like Data Ingestion, Data Storage, Data Engineering, Data Science, Data Visualization, Data Operationalization, Data Architecture, Data Governance, Data Management
  • Data and AI Platforms: This part builds your understanding around various platforms like Open Source, AWS, Azure, GCP, Databricks, Snowflake

II Data and AI Foundation

Back to TOC

Topics Covered: Data and AI Introduction, Mathematics, IT/Programming, Business Domain

In this section, I am going to build the foundation that is necessary to grasp before looking at components of data and AI platform.

And we will start from scratch, first we will cover the basic concepts of data and AI and how these fields are connected, then we will focus on core concepts of mathematics, IT/programming and business domain etc.

1 Data and AI Introduction

1.1 Data Concepts

Topics Covered: Data, Data Vs Information, DIKW Pyramid, Different Aspects of Data (Formats, Scope, Biases), Structured, Semi-structured and Unstructured Data, Data Usage (Scientific Research, Business Management, Finance, Governance), Data Analysis

1.1.1 Data

Data is the back-bone of data-driven AI. So lets first understand what is data?

Data is the raw fact without any context i.e. a number, symbol, character, word, codes, graphs, etc.

Data has originated as a plural form of Latin word ‘Datum’, which means ‘a given fact’.

Broadly speaking, it can be any information in digital form, it can be output of sensing device or organ.

Loosely, data and information are used interchangeably, which is not correct, we will cover the difference in an upcoming section.

Data, information, knowledge and wisdom are closely related concepts, but each has its own role in relation to the other, and each term has its own meaning, we will also touch this part soon.

References:

  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf
  • https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/2-identify-need-data-solutions
  • https://en.wikipedia.org/wiki/Data
1.1.2 Datum, Data and Dataset

Mostly we talk about data but occassionally, you may hear terms like datum or dataset, lets understand the difference. Datum is single piece of information, which can be treated as an observation. Data is plural of datum, which we can say multiple observations. Dataset is a homogenous collection of data (each datum must have the same focus).

References:

  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

1.1.3 Information

When data is processed and put into context, it becomes information, which can be utilized by humans in significant way i.e. making decisions, forecasting etc

References:

  • https://en.wikipedia.org/wiki/Information

1.1.4 Knowledge and Wisdom

When we put relevant information to work in a specific domain, it becomes knowledge. And when that knowledge is enhanced with first-hand experience, it becomes wisdom.

Lets relate it to an example:

  • ‘100’ number is data
  • ‘100 miles’ is information
  • ‘100 miles is quite a far distance’ is a knowledge
  • ‘100 miles is very difficult to walk’ is wisdom.

References:

  • https://en.wikipedia.org/wiki/DIKW_pyramid
1.1.5 Different Aspects of Data

Formats of Data

We can classify data formats in three categories as structured, semi-structured and unstructured:

  • Structured data has a definite structure like table with rows and columns.
  • Semi-structured data has some structure like JSON, key-value or graph database.
  • Unstructured data has no specific structure like photos, audio and video files.

References:

  • https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/2-identify-need-data-solutions
  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

Scope of Data

Data can be classified in two categories based on scope:

  • Population, which means we have access to all the data
  • Sample, which means only a portion is available or feasible

References:

  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf
Biases in Data

Bias in data means over or under-representation of a sub-population, may not be intentional.

  • Omission: using arguements from only one side
  • Source selection: including more authoritative sources from one side
  • Story selection: sharing stories that agree with one side
  • Placement: unimportant stories gets important placement in reputed media platforms
  • Labelling: labeled on one side or missing labels on other side
  • Spin: stories providing only one interpretation of an event

References:

  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf
1.1.6 Data Usage

Data is used in following fields:

  • Scientific research
  • Business Management
  • Finance
  • Governance

References:

  • https://en.wikipedia.org/wiki/Data
1.1.7 Data Analysis
  • Data requirements
  • Data collection
  • Data processing
  • Data cleaning
  • Exploratory data analysis
  • Data product
  • Communication

References:

  • https://en.wikipedia.org/wiki/Data_analysis

1.2 IT/Programming Concepts

Topics Covered: Technology, Information Technology, Data Structures and Algorithms, Data Processing and Storage, Data Models, Operational & Analytical Data, Databases, Data Warehouses, Streaming and Batch Data, ETL/ELT

1.3 AI Concepts

Topics Covered: Intelligence, Intelligent Agents, Applications (Web Search, Recommendation Systems, Self-driving Cars, Strategic Games), Aspects of AI (Search, Knowledge, Uncertainty, Optimization, Learning, Neural Networks, Language), Strong and Weak AI

1.4 From Data To AI

Topics Covered: Business Intelligence, Data Science, Machine Learning, Deep Learning, Artificial Intelligence

2 Mathematics

2.1 Linear Algebra

Topics Covered: Scalars, Vectors, Matrices and Tensors, Multiplying Matrices and Vectors, Identity and Inverse Matrices, Linear Dependence and Span, Norms, Special Kinds of Matrices and Vectors, Eigendecomposition, Singular Value Decomposition (SVD), The Moore Penrose Pseudoinverse, The Trace Operator, The Determinant, Principal Component Analysis

2.2 Multivariate Calculus

Topics Covered: Functions, Derivatives, Product Rule, Chain Rule, Integrals, Partial Derivatives, The Gradient, The Jacobian, The Hessian, Multivariate Chain Rule, Approximate Functions, Power Series, Linearization, Multivariate Taylor

2.3 Probability and Statistics

2.3.1 Probability

Topics Covered: Probability, Conditional Probability, Random Variables, Probability Distributions

2.3.2 Statistics

Topics Covered: Statistics, Descriptive Statistics (Univariate, Bivariate, Multivariate Analysis, Function Models), Inferential Statistics (Sampling Distributions & Estimation, Hypothesis Testing, Correlation, Causation & Regression), Bayesian Statistics (Frequentist Vs Bayesian Statistics, Bayesian Inference, Test for Significance), Statistical Learning (Prediction & Inference, Parametric & Non-parametric methods, Prediction Accuracy and Model Interpretability, Bias-Variance Trade-Off)

3 IT/Programming

3.1 Operating System Basics

*Topics Covered: *

3.2 Data Structures and Algorithms Basics

Topics Covered: Data Structures (Array, Linked List, Stack, Queue, Heap, Hashing, Binary Tree, Binary Search Tree, Graph, Matrix), Algorithms (Asymptotic Analysis, Searching and Sorting, Greedy Algorithms, Recursion, Dynamic Programming)

3.3 Programming Basics

*Topics Covered: *

3.3 Database Systems Basics

*Topics Covered: *

3. Cloud Computing

Topics Covered: Introduction, Public, Private and Hybrid Clouds, IaaS, PaaS and SaaS, Data and AI on Cloud, AWS, Azure and GCP

4 Business Domain

Topics Covered: Problem Solving, Problem Identification, Problem Definition, Prioritization, Root-Cause Analysis, Possible Solutions, Solution Evaluation, Cost-Benefit Analysis, Planning and Implementation

III Data and AI Components

Back to TOC

Topics Covered: Data Governance, Data Architecture, Data Ingestion, Data Storage, Data Engineering, Data Science, Data Visualization, Data Operationalization

5 Data Governance 

Topics Covered: Data Governance Basics, Why Data Governance is Important?, Aspects of Data Governance, How to do Data Governance?

6 Data Architecture

Topics Covered: Data Architecture Basics, Why Data Architecture is Required?, How to build Data Architecture?

7 Data Ingestion

Topics Covered: Data Ingestion Basics, Types of Data Ingestion, Tools for Data Ingestion

8 Data Storage

Topics Covered: Data Storage Basics, Types of Data Storage, Tools for Data Storage

9 Data Engineering

Topics Covered: Data Engineering Basics, Tools for Data Engineering, Building Data Pipelines

10 Data Science

Topics Covered: Data Science Basics, Overall Process, Algorithms, Tools for Data Science

11 Data Visualization

Topics Covered: Data Visualization Basics, Why Data Visualization is Important?, Tools for Data Visualization

12 Data Operationalization

Topics Covered: Operationalization Basics, Why Operationalization is required?, Tools for Data AI Operationalization

IV Data and AI Platforms

Back to TOC

Topics Covered: Open Source, AWS, Azure, GCP, Databricks, Snowflake

13 Open Source

Topics Covered: Building Data and AI Platform in Open Source

14 AWS

Topics Covered: Building Data and AI Platform in AWS

15 Azure

Topics Covered: Building Data and AI Platform in Azure

16 GCP

Topics Covered: Building Data and AI Platform in GCP

17 Databricks

Topics Covered: Building Data and AI Platform in Databricks

18 Snowflake

Topics Covered: Building Data and AI Platform in Snowflake

V Appendix

Back to TOC

Topics Covered: SQL, Python, UNIX and Shell Scripting, Data Structure and Algorithms

19 SQL

Topics Covered: SQL, Data Models, ER Diagrams, Tables, Temporary Tables, Selecting (SELECT, FROM, DISTINCT), Filtering (WHERE, AND, OR, IN, NOT, BETWEEN, NULLs, Wildcards), Ordering (ORDER BY, DESC), Aggregating (GROUP BY, HAVING, AVERAGE, COUNT, MAX, MIN), Subqueries, Joins (Cartesian, Inner, Outer <Left/Right>, Self), Sets (UNION, UNION ALL, INTERSECT), Aliases, Views, Subqueries (WITH AS)

20 Python

Topics Covered: Programming, Installation, Basic Syntax & Variable Types, Data Types and Conversion, Basic Operators and Loops, Functions, Exceptions and Modules, Data Science Specific Modules (NumPy, SciPy, Pandas, MatPlotLib, Scikit-Learn)

21 UNIX and Shell Scripting

Topics Covered: Operating System, Architecture, Basic UNIX Commands, Shell Scripting