1 Data Concepts

Topics Covered: Data, Data Vs Information, DIKW Pyramid, Different Aspects of Data (Formats, Scope, Biases), Structured, Semi-structured and Unstructured Data, Data Usage (Scientific Research, Business Management, Finance, Governance), Data Analysis

1.1 Data

Data is the back-bone of data-driven AI. So lets first understand what is data?

Data is the raw fact without any context i.e. a number, symbol, character, word, codes, graphs, etc.

Data has originated as a plural form of Latin word ‘Datum’, which means ‘a given fact’.

Broadly speaking, it can be any information in digital form, it can be output of sensing device or organ.

Loosely, data and information are used interchangeably, which is not correct, we will cover the difference in an upcoming section.

Data, information, knowledge and wisdom are closely related concepts, but each has its own role in relation to the other, and each term has its own meaning, we will also touch this part soon.

References:

  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf
  • https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/2-identify-need-data-solutions
  • https://en.wikipedia.org/wiki/Data

1.2 Datum, Data and Dataset

Mostly we talk about data but occassionally, you may hear terms like datum or dataset, lets understand the difference. Datum is single piece of information, which can be treated as an observation. Data is plural of datum, which we can say multiple observations. Dataset is a homogenous collection of data (each datum must have the same focus).

References:

  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

1.3 Information

When data is processed and put into context, it becomes information, which can be utilized by humans in significant way i.e. making decisions, forecasting etc

References:

  • https://en.wikipedia.org/wiki/Information

1.4 Knowledge and Wisdom

When we put relevant information to work in a specific domain, it becomes knowledge. And when that knowledge is enhanced with first-hand experience, it becomes wisdom.

Lets relate it to an example:

  • ‘100’ number is data
  • ‘100 miles’ is information
  • ‘100 miles is quite a far distance’ is a knowledge
  • ‘100 miles is very difficult to walk’ is wisdom.

References:

  • https://en.wikipedia.org/wiki/DIKW_pyramid

1.5 Different Aspects of Data

Formats of Data

We can classify data formats in three categories as structured, semi-structured and unstructured:

  • Structured data has a definite structure like table with rows and columns.
  • Semi-structured data has some structure like JSON, key-value or graph database.
  • Unstructured data has no specific structure like photos, audio and video files.

References:

  • https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/2-identify-need-data-solutions
  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

Scope of Data

Data can be classified in two categories based on scope:

  • Population, which means we have access to all the data
  • Sample, which means only a portion is available or feasible

We don’t have access to all the data in most of the cases, in these cases we collect the sample in a way that it contains most of the information from population so that we can estimate the patterns in population from that sample.

References:

  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf
Biases in Data

Bias in data means over or under-representation of a sub-population, may not be intentional.

These are the types of biases that exist in data:

  • Omission: using arguements from only one side
  • Source selection: including more authoritative sources from one side
  • Story selection: sharing stories that agree with one side
  • Placement: unimportant stories gets important placement in reputed media platforms
  • Labelling: labeled on one side or missing labels on other side
  • Spin: stories providing only one interpretation of an event

References:

  • https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

1.6 Data Usage

Data is used in following fields:

  • Scientific Research: Factual data is both, an essential resource and a valuable output
  • Business Management: Data helps understand and improve processes
  • Finance: Whoever has the best and the fastest information gains the edge
  • Governance: Open data platform to help promote data-driven governance

References:

  • https://en.wikipedia.org/wiki/Data

1.7 Data Analysis

Data analysis is a process for obtaining raw data, and converting it into information useful for decision-making by users.

Data, is collected and analyzed to answer questions, test hypotheses, or disprove theories.

These are the steps of a typical data analysis process:

  • Data requirements: to understand what input would be required for analysis
  • Data collection: to collect those inputs from various sources
  • Data processing: to process or organize data for analysis
  • Data cleaning: to deal with incomplete, inaccurate, redundant elements
  • Exploratory data analysis: to explore data and understand the patterns
  • Data product: to covert data into actionable inputs
  • Communication: to convey the results of analysis to users

References:

  • https://en.wikipedia.org/wiki/Data_analysis