26 Mar 2020

5 Tips for DS/AI Beginners & Enthusiasts

Sam Truong Dan


While I believe that the more you struggle, the more you get to learn in the process, you can still learn from experienced professionals to avoid common roadblocks or challenges.

DS/AI starters and enthusiasts keep interacting with me and I try to help them. Apart from common questions by them, I give them tips on how they can become effective in their DS/AI journey, i.e. what to focus and what to avoid etc.

Here I am sharing 5 of those tips:

Master The Basics

Before you can work on any DS/AI problem, you need to master the basics. Understand the mathematics required for DS/AI, which is Linear Algebra, Multivariate Calculus, Statistics & Probability.

Then learn about frequently used algorithms like Logistic Regression, Naive Bayes, Decision Trees, Random Forest etc. Where to apply which algorithm and what are the advantages and limitations of using a particular algorithm.

Understand the overall process of working in DS/AI projects, from defining the problem to deploying the solution.

Improve your coding skills to execute & validate your hypothesis efficiently. Python & R are the two major languages preferred by data scientists.

Learn Just Enough

DS/AI is a vast field, one can not be an expert overnight. The biggest mistake newbies do is to get stuck into an endless learning loop.

You are learning DS/AI is to solve real-world problems, so learn just enough concepts to get started. Participate in hackathons/competitions, practice on public datasets.

You need not read each book or attend every course on the planet before feeling confident to start with DS/AI. Just follow one decent book or course to build your basics and start working on problems. Keep the material (books & courses) for reference in case you get stuck with your problem.

Handle Data Like a Pro

Data is the heart of DS/AI initiatives & projects. If you can’t handle data, you can’t make progress with your use case. Data handling can make or break any DS/AI project.

Build your skills in data handling, learn frequently used tips & tricks in SQL, Pandas (in Python) or data.table (in R) to handle & manipulate data efficiently.

Learn from Kaggle kernels, popular GitHub repos, how experts handle the data. Gather and clean public datasets on your own and refer these resources wherever get stuck.

Understand The Context

As a newbie data scientist, you may be tempted to get the highest accuracy you can on any DS/AI problem you are working. But the real world is quite different, it’s not about the accuracy every time, many times robustness is very important and most of the time, the business may need models to be more interpretable.

Every framework, every feature has its own use case. If a particular approach worked in your previous project, the same may not work in the current one. As we say, there is no silver bullet in DS/AI field.

Hence understanding the context of the problem or the use case is important.

Improve Communication

To understand the problem or use case, to get the overall context from business, to explain outcomes of your complex models requires excellent communication skills.

Presenting your approaches and findings to a non-technical audience, such as the marketing team or the CXOs, is a crucial part of being a data scientist.

You need to have the ability to interpret data, tell the stories contained therein, and in general, communicate, write and present well. Presentation, Storytelling, Data Visualization, Writing/Publishing, Business Insights, all are part of communication skills.

You may have to work hard to develop these skills — the same as you would with any technical skills. But with time and practice, you can get really good at it.

Bonus Tip: Develop T-shaped Skill-set

DS/AI is an interdisciplinary field, and you can get expertise in one area at a time. Still, I would suggest you develop T-shaped skill-set to collaborate better with experts in other related fields working with you in the project.

T-shaped skills describe specific attributes of desirable workers. The vertical bar of the T refers to expert knowledge and experience in a particular area, while the top of the T refers to an ability to collaborate with experts in other disciplines and a willingness to use the knowledge gained from this collaboration. A t-shaped person is someone with t-shaped skills.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

01 Mar 2020

Useful Resources to Learn Artificial Intelligence

Photo by Markus Spiske

While you are learning Artificial Intelligence (AI) and looking to get into this exciting field, you may wonder which are the resources to refer, how do you know if a book or course is worth to spend time and/or money?

Due to close association between Data Science (DS), Machine Learning (ML) and Artificial Intelligence (AI), these terms have been used in this article interchangeably.

This is not an exhaustive list by any means, but it is good enough to keep as your reference. You can build your own list of references once you get more awareness of the field.

This post is my attempt to make your task easier. I am listing down major quality resources (mostly free) here and also going to provide you with my view of these resources, which will help you to make an informed decision.

You need not go through each and every resource mentioned here, I would suggest you build the foundation first using a course or a book and keep other resources for your reference.

Books to refer

Machine Learning with R

This is an excellent book for the R starter who wants to apply ML to any kind of project. All the main ML models are presented, as well as different performance metrics, bagging, pruning, tuning, ensembling etc. Easy to scan through, many tips with fully-solved textbook problems. Certainly, a very good starting point if you plan to compete on Kaggle. If you already master both R and ML, this books is obviously not for you.

Machine Learning with R

Python Machine Learning

This is a fantastic introductory book in machine learning with python. It provides enough background about the theory of each (covered) technique followed by its python code. One nice thing about the book is that it starts implementing Neural Networks from scratch, providing the reader with the chance of truly understanding the key underlying techniques such as back-propagation. Even further, the book presents an efficient (and professional) way of coding in python, the key to AI/ML.

Python Machine Learning

ISLR

The book explains the concepts of Statistical Learning from the very beginning. The core ideas such as bias-variance trade-off are deeply discussed and revisited in many problems. The included R examples are particularly helpful for beginners to learn R. The book also provides a brief, but concise description of functions’ parameters for many related R packages. Compared to The Elements of Statistical Learning, it is easy for the reader to understand. It does a wonderful job of breaking things down complex concepts. If one wishes to learn more about a particular topic, I’d recommend The Element of Statistical Learning. These two pair nicely together.

An Introduction to Statistical Learning: with Applications in R

Deep Learning

This is the book to read on deep learning. Written by luminaries in the field — if you’ve read any papers on deep learning, you must have heard about Goodfellow and Bengio before — and cutting through much of the BS surrounding the topic: like ‘big data’ before it, ‘deep learning’ is not something new and is not deserving of a special name. Networks with more hidden layers to detect higher-order features, networks of different types chained together in order to play to their strengths, graphs of networks to represent a probabilistic model.

This is a theoretical book, but it can be read in tandem with Hands-On Machine Learning with Scikit-Learn and TensorFlow, almost chapter-for-chapter. The Scikit-Learn and Tensorflow example code, while only moderately interesting on its own, helps to clarify the purpose of many of the topics in the Goodfellow book.

Deep Learning

Hands-On Machine Learning with Scikit-Learn and TensorFlow

This book provides a great introduction to machine learning for both developer and non-developers. Authors suggest to just go through even if you don’t understand math details. Highlights of this book are:

  • Extraction of field expert knowledge is very important, you should know which model will serve better for the given solution. Luckily, a lot of models are available already from other scientists.
  • Training data is the most important part, the more you have it the better. So if you can you should accumulate as much data as you can, preferably categorized, you may not still know how you will apply the accumulated data in the future but you will need it.
  • Labelling training data is very important too, to train neural network you need to have at least thousands of labelled data samples, the more the better.
  • Machine learning algorithms and neural networks are pretty common for years but the latest breakthrough is possible because of new optimization, new autoencoders ( that may help to artificially generate training data) allowing to do training faster and with fewer data.
  • Machine learning is still pretty time and resources consuming process. To train a machine learning model you need to know how to tweak parameters and how to use different training approaches fitting the particular model.

The book demonstrates (including the code) different approaches using Scikit-Learn python package and also the TensorFlow.

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

Data Science for Business

This is probably the most practical book to read if you are looking for an overview of data science. Either you know when terms like k-means and ROC curves are to be used or you have some context when you start digging deeper into how some of these algorithms are implemented. You will find it at the right level because there is just enough math to explain the fundamental concepts and make them stick in your head.

This isn’t a book on implementing these concepts or a bunch of algorithms. This gives the book the advantage of being something you can refer to an intelligent manager or interested developer, and they can both get a lot out of it. And if they are interested in the next level of learning there are plenty of pointers. You will also find the chapter on presenting results through ROC curves, lift curves, etc. pretty interesting. It would be cool if this book had some more hands-on, but you can go to Kaggle and browse around the current and past competitions to apply what you learn here.

Data Science for Business

Courses to attend

Machine Learning

Machine Learning is one of the first programming MOOCs. Coursera put online by Coursera founder and Stanford Professor Andrew Ng. This course assumes that you have basic programming skills and you have some understanding of Linear Algebra. Knowledge of Statistics & Probability is not required though.

Andrew Ng does a good job explaining dense material and slides. The course gives you a lot of structure and direction for each homework, so it is generally pretty clear what you are supposed to do and how you are supposed to do it.

Machine Learning by Andrew Ng

Deep Learning

When you are rather new to the topic, you can learn a lot of doing the deeplearning.ai specialization. First and foremost, you learn the basic concepts of NN. How does a forward pass in simple sequential models look like, what’s a backpropagation, and so on? I experienced this set of courses as a very time-effective way to learn the basics and worth more than all the tutorials, blog posts and talks, which I went through beforehand.

Doing this specialization is probably more than the first step into DL. I would say, each course is a single step in the right direction, so you end up with five steps in total. I think it builds a fundamental understanding of the field. But going further, you have to practice a lot and eventually it might be useful also to read more about the methodological background of DL variants. But doing the course work gets you started in a structured manner — which is worth a lot, especially in a field with so much buzz around it.

Deep Learning Specialization

Fast AI

If your goal is to be able to learn about deep learning and apply what you’ve learned, the fast.ai course is a better bet. If you have the time, interleaving the deeplearning.ai and fast.ai courses is ideal — you get the practical experience, applicability, and audience interaction of fast.ai, along with the organised material and theoretical explanations of deeplearning.ai.

Fast AI Course

Kaggle Learn

Practical data skills you can apply immediately: that’s what you’ll learn in these free micro-courses. They’re the fastest (and most fun) way to become a data scientist or improve your current skills.

Kaggle Learn

Blogs to follow

KD Nuggets

KDnuggets is a leading site on AI, Analytics, Big Data, Data Mining, Data Science, and Machine Learning and is edited by Gregory Piatetsky-Shapiro and Matthew Mayo. KDnuggets was founded in February of 1997. Before that, Gregory maintained an earlier version of this site, called Knowledge Discovery Mine, at GTE Labs (1994 to 1997).

KD Nuggets

Analytics Vidhya

Analytics Vidhya provides a community-based knowledge portal for Analytics and AI professionals. The aim of the platform is to become a complete portal serving all knowledge and career needs of Data Science Professionals.

Analytics Vidhya

Towards Data Science

TDS joined Medium’s vibrant community in October 2016. In the beginning, their goal was simply to gather good posts and distribute them to a broader audience. Just a few months later, they were pleased to see that they had a very fast-growing audience and many new contributors.

Today they are working with more than 10 Editorial Associates to prepare the most exciting content for our audience. They provide customized feedback to our contributors using Medium’s private notes. This allows them to promote their latest articles across social media without the added complexity that they might encounter using another platform.

Towards Data Science

Podcasts to listen

Data Hack

This is Analytics Vidhya’s exclusive podcast series which will feature top leaders and practitioners in the artificial intelligence and machine learning industry.

So in every episode of DataHack Radio, they bring you discussions with one such thought leader in the industry. They have discussions about their journey, their learnings and plenty of other AI-related things.

Data Hack

Super Data Science

Kirill Eremenko is a Data Science coach and lifestyle entrepreneur. The goal of the Super Data Science podcast is to bring you the most inspiring Data Scientists and Analysts from around the World to help you build your successful career in Data Science.

Data is growing exponentially and so are salaries of those who work in analytics. This podcast can help you learn how to skyrocket your analytics career. Big Data, visualization, predictive modelling, forecasting, analysis, business processes, statistics, R, Python, SQL programming, tableau, machine learning, Hadoop, databases, data science MBAs, and all the analytics tools and skills that will help you better understand how to crush it in Data Science.

Super Data Science

The O’Reilly Data Show Podcast

Known as the father of all other data shows, “the O’Reilly Data Show” features Ben Lorica, O’Reilly Media’s chief data scientist. Lorica conducts interviews with other experts about big data and data science current affairs. While it does get technical and may not be the best place for a beginner to start, it provides interesting insights into the future of the AI/ML industry.

Data Show

YouTube Channels

DeepLearning.TV

DeepLearning.TV is all about Deep Learning, the field of study that teaches machines to perceive the world. Starting with a series that simplifies Deep Learning, the channel features topics such as How To’s, reviews of software libraries and applications, and interviews with key individuals in the field. Through a series of concept videos showcasing the intuition behind every Deep Learning method, they show you that Deep Learning is actually simpler than you think. Their goal is to improve your understanding of the topic so that you can better utilize Deep Learning in your own projects. They provide a window into the cutting edge of Deep Learning and bring you up to speed on what’s currently happening in the field.

DeepLearning.TV

Data School

Are you trying to learn data science so that you can get your first data science job? You’re probably confused about what you’re “supposed” to learn, and then you have the hardest time actually finding lessons you can understand! Data School focuses you on the topics you need to master first, and offers in-depth tutorials that you can understand regardless of your educational background.

Your host here is Kevin Markham, and he is the founder of Data School. He has taught data science using the Python programming language to hundreds of students in the classroom, and hundreds of thousands of students (like you) online. Finding the right teacher was so important to his data science education, and so he sincerely hopes that he can be the right data science teacher for you.

Data School

Caltech Machine Learning

This is an introductory course by Caltech Professor Yaser Abu-Mostafa on machine learning that covers the basic theory, algorithms, and applications. Machine learning (ML) enables computational systems to adaptively improve their performance with experience accumulated from the observed data. ML techniques are widely applied in engineering, science, finance, and commerce to build systems for which we do not have a full mathematical specification (and that covers a lot of systems). The course balances theory and practice and covers the mathematical as well as the heuristic aspects.

Caltech Machine Learning

GitHub Repos

Awesome Data Science

This Repo answer the questions, “What is Data Science and what should you study to learn Data Science?” An awesome Data Science repository to learn and apply for real-world problems.

As the aggregator says, “Our favourite data scientist is Clare Corthell. She is an expert in data-related systems and a hacker and has been working on a company as a data scientist. Clare’s blog. This website helps you to understand the exact way to study as a professional data scientist.”

“Secondly, Our favourite programming language is Python nowadays for Data Science. Python’s — Pandas library has full functionality for collecting and analyzing data. We use Anaconda to play with data and to create applications.”

Awesome Data Science

Essential Cheat Sheets for Machine Learning and Deep Learning Engineers

Machine learning is complex. For newbies, starting to learn machine learning can be painful if they don’t have the right resources to learn from. Most of the machine learning libraries are difficult to understand and the learning curve can be a bit frustrating. Kailash Ahirwar has created a repository on Github (cheatsheets-ai) containing cheatsheets for different machine learning frameworks, gathered from different sources. Have a look at the Github repository, also, contribute cheat sheets if you have any. Thanks.

Essential Cheat Sheets

HackerMath for Machine Learning

Math literacy, including proficiency in Linear Algebra and Statistics, is a must for anyone pursuing a career in data science. The goal of this workshop is to introduce some key concepts from these domains that get used repeatedly in data science/AI applications.

As outlined by Amit Kapoor, “Our approach is what we call the ‘Hacker’s way’. Instead of going back to formulae and proofs, we teach the concepts by writing code. And in practical applications. Concepts don’t remain sticky if the usage is never taught.”

The focus here is on depth rather than breadth. Three areas are chosen — Hypothesis Testing, Supervised Learning and Unsupervised Learning. They are covered to sufficient depth — 50% of the time on the concepts and 50% of the time spent coding them.

HackerMath for Machine Learning

So this is all that I needed to cover in this article. Now you know what useful resources you need to keep handy in order to get proficient in the field of AI.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

23 Jan 2020

Artificial Intelligence from Novice to Professional

Photo by [Ben White](https://unsplash.com/@benwhitephotography?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)/[Hunters Race](https://unsplash.com/@huntersrace?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/faceless-person?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Why this post?

There is a general notion that Artificial Intelligence is all about algorithms. If you look around most of the courses, books, blogs, etc, you will find that most of them cover how you can apply certain algorithms and further optimize those algorithms.

After working on a few real-world AI/ML projects, you will realize that while algorithms are at the core, AI is much more than algorithms. From identifying & assessing AI opportunities in your organization/department to deploying and maintaining the solutions in production, a lot goes into AI/ML projects.

In this post, I am going to briefly touch every aspect of AI/ML project lifecycle. The good news is, as a data scientist you need not master every aspect, but knowing those areas to some extent will help you contribute better in real-world projects.

A single post can’t make you an AI professional, but it can ask relevant questions & increase awareness, seeking answers to these questions can help you build your own AI roadmap.

We can segregate all these aspects of AI into 5 parts: AI Introduction, AI Pre-requisites, AI Concepts & Algorithms, Enterprise AI & Peripherals of AI

I. AI Introduction

AI is a team game, every aspiring AI role needs to have a common understanding of AI/ML field. This part covers a simple What, Why & How of AI:

— Why you should learn AI?

— Why AI is important?

— What is Artificial Intelligence?

— How an end-to-end AI project looks like?

— What are the roles in AI projects and who does what?

II. AI Pre-requisites

AI/ML has a steep learning curve because it is an amalgamation of many fields (Statistics, IT & Domain etc). And some knowledge of these fields is required before you can start grasping AI/ML concepts. This part covers the pre-requisites of AI:

— What topics of maths you should cover?

— What programming languages, libraries & frameworks you should be aware of?

— How does data move in an organization?

— How is it stored and processed?

III. AI Concepts & Algorithms

AI/ML in itself is quite a vast field and have different techniques and frameworks to deal with different kind of real-world problems. This part covers major AI/ML techniques and what to use when:

— What are the major types of AI techniques?

— When to use what? Main concepts & algorithms ML, DL & RL?

IV. Enterprise AI

Learning AI/ML concepts & algorithms are not enough, you will be solving some business problems with what you have learned. You need to be aware of what happens when these concepts & algorithms are applied within an enterprise. This part covers the aspects of enterprise AI:

- What is the difference between Hackathons & Real-world projects?

- How to operationalize AI?

- How to build AI Strategy?

- Why ethics & explainability is important & how can we make AI explainable?

V. Peripherals of AI

Apart from learning concepts and their applications, you may have specific needs at the moment like you want to get into the AI field or you need to lead an AI initiative in your organization. This part covers those specific needs:

- How you can get into AI field?

- How to lead AI initiatives in your organization?

- How to future-proof your AI career?

Conclusion

I hope by now you have got an idea of what it takes to work on AI projects in the real world. Due to different but overlapping fields, it is really hard to get in-depth knowledge of every aspect of an AI aspirant.

But what you can do is to build a T-shape skill-set in the AI field, where you go in-depth in the aspect of your choice and have handy knowledge of other related aspects.

I will cover the above-mentioned aspects of AI thoroughly in upcoming posts, stay tuned.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

27 Dec 2019

Building Blocks of Artificial Intelligence

There are a few core skills in every job. To perform that job, you need to be aware of core concepts, you need to be aware of the end to end process and you need to learn how to use related tools to perform that job. Artificial Intelligence is no different, it has its own core concepts, processes and tools.

This post covers the core concepts you need to learn, the end-to-end process you need to be aware of & important tools you need to master to work as a data scientist.

Please note that this post only outlines the concepts, processes and tools used by data scientists. I will publish the resources (mostly free) for these topics in upcoming post.

Concepts to learn

Mathematics

Artificial Intelligence contains math — no avoiding that! This section is for learners about basic math they need in order to be successful in almost any AI project/problem. So let’s start:

Multivariate Calculus

Calculus is a set of tools for analyzing the relationship between functions and their inputs. In Multivariate Calculus, we can take a function with multiple inputs and determine the influence of each of them separately.

In AI, we try to find the inputs which enable a function to best match the data. The slope or descent describes the rate of change off the output with respect to an input. Determining the influence of each input on the output is also one of the critical tasks. All this requires a solid understanding of Multivariate Calculus.

Linear Algebra

The word algebra comes from the Arabi word “al-jabr” which means “the reunion of broken parts”. This is the collection of methods deriving unknowns from knowns in mathematics. Linear Algebra is the branch that deals with linear equations and linear functions which are represented through matrices and vectors. In simpler words, it helps us understand geometric terms such as planes, in higher dimensions, and perform mathematical operations on them. By definition, algebra deals primarily with scalars (one-dimensional entities), but Linear Algebra has vectors and matrices (entities which possess two or more dimensional components) to deal with linear equations and functions.

Linear Algebra is central to almost all areas of mathematics like geometry and functional analysis. Its concepts are a crucial prerequisite for understanding the theory behind AI. You don’t need to understand Linear Algebra before getting started in AI, but at some point, you may want to gain a better understanding of how the different algorithms really work under the hood. So if you really want to be a professional in this field, you will have to master the parts of Linear Algebra that are important for AI.

Statistics & Probability

Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data. Probability is the chance that something will happen — how likely it is that some event will happen.

Statistics help you to understand your data and is an initial & very important step of AI. This is due to the fact that AI is all about making predictions and you can’t predict if you can’t understand the patterns in existing data.

Uncertainty and randomness occur in many aspects of our daily life and having a good knowledge of probability helps us make sense of these uncertainties. Learning about probability helps us make informed judgments on what is likely to happen, based on a pattern of data collected previously or an estimate.

AI often uses statistical inferences to predict or analyze trends from data, while statistical inferences use probability distributions of data. Hence knowing probability & statistics and its applications are important to work effectively on AI problems.

Programming

To execute the AI pipeline, you need to learn algorithm design as well as fundamental programming concepts such as data selection, iteration and functional decomposition, data abstraction and organisation. In addition to this, you need to learn how to perform simple data visualizations using programming and embed your learning using problem-based assignments.

Machine Learning Algorithms

Machine learning algorithms can be divided into 3 broad categories —

  • Supervised learning,
  • Unsupervised learning
  • Reinforcement learning.

Supervised learning is useful in cases where a property (label) is available for a certain dataset (training set) but is missing and needs to be predicted for other instances. Unsupervised learning is useful in cases where the challenge is to discover implicit relationships in a given unlabeled dataset (items are not pre-assigned). Reinforcement learning falls between these 2 extremes — there is some form of feedback available for each predictive step or action, but no precise label or error message.

Intrinsic details of various algorithms is not in scope of this series, you can refer the resources mentioned in the next post to learn them.

Supervised learning can be further divided into Regression (Linear, Non-linear, etc) & Classification (Logistics Regression, Decision Tree, Naïve Bayes etc) algorithms. Some algorithms can be used for regression as well as classification i.e. Random Forests, Support Vector Machines, etc.

Unsupervised learning can also be further divided into Clustering, Anomaly Detection, Associative Mining.

Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Deep Learning Frameworks

Deep learning frameworks are a more advanced form of ML and solve specific problems where data is either unstructured or huge or both. Neural Nets, CNNs, RNNs & LSTM, GANs are the frameworks one needs to be aware of.

Domain Knowledge

This lack of domain knowledge, while perfectly understandable, can be a major barrier to data scientists. For one thing, it’s difficult to come up with project ideas in a domain that you don’t know much about. It can also be difficult to determine the type of data that may be helpful for a project — if you want to build a model to predict an outcome, you need to know what types of variables might be related to this outcome so you can make sure to gather the right data.

Knowing the domain is useful not only for figuring out projects and how to approach them but also for having rules of thumb for sanity checks on the data. Knowing how data is captured (is it hand-entered? Is it from machines that can give false readings for any number of reasons?) can help a data scientist with data cleaning and from going too far down the wrong path. It can also inform what true outliers are and which values might just be due to measurement error.

Often the most challenging part of building a machine learning model is feature engineering. Understanding variables and how they relate to an outcome is extremely important for this. Knowing the domain can help direct the data exploration and greatly speed (and enhance) the feature engineering process.

Once features are generated, knowing what relationships between variables are plausible help for basic sanity checks. Being able to glance at the outcome of a model and determine if they make sense goes a long way for quality assurance of any analytical work.

Finally, one of the biggest reasons a strong understanding of the data is important is because you have to interpret the results of analyses and modeling work.

Knowing what results are important and which are trivial is important for the presentation and communication of results. It’s also important to know what results are actionable.

Process to follow

Problem Definition

The first thing you have to do before you solve a problem is to define exactly what it is. You need to be able to translate data questions into something actionable.

You’ll often get ambiguous inputs from the people who have problems. You’ll have to develop the intuition to turn scarce inputs into actionable outputs–and to ask the questions that nobody else is asking.

Data Collection

Once you’ve defined the problem, you’ll need data to give you the insights needed to turn the problem around with a solution. This part of the process involves thinking through what data you’ll need and finding ways to get that data, whether it’s querying internal databases, or purchasing external data-sets.

Data Understanding

The difficulty here isn’t coming up with ideas to test, it’s coming up with ideas that are likely to turn into insights. You’ll have a fixed deadline for your AI project, so you’ll have to prioritize your questions.

You’ll have to look at some of the most interesting patterns that can help explain why sales are reduced for this group. You might notice that they don’t tend to be very active on social media, with few of them having Twitter or Facebook accounts. You might also notice that most of them are older than your general audience. From that you can begin to trace patterns you can analyze more deeply.

Feature Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features from raw data that help facilitate the machine learning process. Feature Engineering is, in fact, an art.

Modelling

Depending on the type of question that you’re trying to answer, there are many modelling algorithms available. You run the selected algorithm/s on the training data to build the models.

Validation

Validation is a step used to evaluate the trained model on validation data. You use a series of competing for machine-learning algorithms along with the various associated tuning parameters that are geared toward answering the question of interest with the current data.

Tuning

Tuning an algorithm or machine learning technique can be simply thought of as a process which one goes through in which they optimize the parameters that impact the model in order to enable the algorithm to perform the best.

Deployment

After you have a set of models that perform well, you can operationalize them for other applications to consume. Depending on the business requirements, predictions are made either in real-time or on a batch basis. To deploy models, you expose them with an open API interface. The interface enables the model to be easily consumed from various applications.

Tools to master

The list mentioned here is not exhaustive, it depends more on what kind of problem you are solving and in what tech stack you are working.

SQL

Structured Query Language (SQL) is a standard computer language for relational database management and data manipulation. SQL is used to query, insert, update and modify data. Most relational databases support SQL.

As data collection has increased exponentially, so has the need for people skilled at using and interacting with data; to be able to think critically, and provide insights to make better decisions and optimize their businesses. The skills necessary to be a good data scientist include being able to retrieve and work with data and to do that you need to be well versed in SQL, the standard language for communicating with database systems.

R

R is a programming language and software environment for statistical analysis, graphics representation and reporting. In the world of AI, R is an increasingly popular language for a reason. It was built with statistical manipulation in mind, and there’s an incredible ecosystem of packages for R that let you do amazing things — particularly in data visualization.

Python

Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language. Python is no-doubt the best-suited language for a Data Scientist. It is a free, flexible and powerful open-source language. Python cuts development time in half with its simple and easy to read syntax. With Python, you can perform data manipulation, analysis, and visualization. Python provides powerful libraries for Machine learning applications and other scientific computations.

Tensorflow

Currently, the most famous deep learning library in the world is Google’s TensorFlow. Google product uses machine learning in all of its products to improve the search engine, translation, image captioning or recommendations.

TensorFlow is the best library of all because it is built to be accessible to everyone. Tensorflow library incorporates different API to built at scale deep learning architecture like CNN or RNN. TensorFlow is based on graph computation; it allows the developer to visualize the construction of the neural network with Tensorboad. This tool is helpful to debug the program. Finally, Tensorflow is built to be deployed at scale. It runs on CPU and GPU.

Keras

Keras is a high-level neural networks API, capable of running on top of Tensorflow, Theano, and CNTK. It enables fast experimentation through a high level, user-friendly, modular and extensible API.

Keras allows for easy and fast prototyping (through user-friendliness, modularity, and extensibility). It supports both convolutional networks and recurrent networks, as well as combinations of the two. It runs seamlessly on CPU and GPU.

So this is all that I needed to cover in this article. Now you know what concepts you need to learn, the process you need to follow and the tools you need to master in order to get proficient in the field of AI.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

07 Dec 2019

How to Navigate Artificial Intelligence Landscape?

Photo by [Rob Bates](https://unsplash.com/photos/0eLg8OTuCw0?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/landscape?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Artificial Intelligence (AI) is a complex and evolving field. The first challenge an AI aspirant faces is understanding the landscape and how he could navigate through it. Consider this, if you are travelling to a new city, and if you don’t have the map, you will have trouble to navigate the city and you will need to ask a lot of random people during your travel without knowing how much they know about the place. Similarly, all the newcomers to AI have this trouble, and there are two ways to deal with this, arrange the map (or a guide) or travel yourself and learn with experience.

Note: AI & DS/AI terms are used interchangeably in the article where DS means Data Science.

This post intends to serve as a map of Artificial Intelligence field.

You might have heard data science, machine learning, deep learning, artificial intelligence etc terminology but might not be fully aware of these terms, what to use when and how these topics are interconnected. After going through this post, you should be able to understand what is where in the AI field.

Multi-disciplinary field

AI is a multidisciplinary field with sub-fields of study in Math/Statistics, CS/IT & Business/Domain knowledge.

Math/Statistics is required to understand the data and relationship between data elements. CS/IT skills are required to process the data to generate insights. And Business or domain knowledge is required to apply above to skills in the context of a business problem.

Computer Science/IT

Programming is an essential skill to become a data scientist but one needs not be a hard-core programmer to learn AI. Having familiarity with basic concepts of programming will ease the process of learning AI programming tools like Python/R. These basic concepts of programming should help a candidate get a long way on the journey to pursue a career in AI as it is all about writing efficient code to analyse big data and not being a master of programming. Individuals should learn the basics of programming in Python/R (or any relevant language) before they begin to work on AI problems/projects.

Maths & Statistics

AI teams have people from diverse backgrounds like chemical engineering, physics, economics, statistics, mathematics, operations research, computer science, etc. You will find many data scientists with a bachelor’s degree in statistics and machine learning but it is not a requirement to learn AI. However, having familiarity with the basic concepts of Math and Statistics like Linear Algebra, Calculus, Probability, etc. is important to learn AI.

Domain/Business Knowledge

Subsequently, the business knowledge that the data scientists would need to have would be related to the domain that the project/analysis is in. For instance, if the data scientist is working for a credit card department in a bank, it will need to understand the specific business definitions, regulations, accounting policies & international standards, processes etc. This is the part that is more specific to the organization the data scientist is deployed in.

In my view, one thing to take care while the hiring data scientists is not to give huge preference to domain knowledge. This may severely limit the supply of AI talents to the organization. You would have a better chance of getting more value from AI by looking for those that are strong in math & programming, being able to convert business objectives to mathematical models. Based on my observation, this is a much more difficult skill to find or train, as compared to domain knowledge.

Various Terminologies

As an AI starter, you will come across many similar terminologies. First thing you need to do is to understand what each term means and where each fits in the bigger picture. Data Science, Business Intelligence, Data Mining, Machine Learning, Deep Learning, Artificial Intelligence; let’s have a look at Wikipedia definition for each term & later see how these are interconnected.

Data Science

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, like data mining.

Business Intelligence

Business intelligence comprises the strategies and technologies used by enterprises for the data analysis of business information. BI technologies provide historical, current and predictive views of business operations.

Data Mining

Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Machine Learning

Machine learning is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.

Deep Learning

Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

Artificial Intelligence

Artificial intelligence, sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.

Interconnection

Data mining uses statistics and other programming languages to find hidden patterns in the data to explain a certain phenomenon. It helps in building a perception about the data using both math and programming.

Machine Learning deploys data mining techniques as well as other algorithms to develop models of what is happening behind some data to forecast future outcomes.

Artificial Intelligence uses models developed by Machine Learning and other algorithms to lead to intelligent behaviour. AI is very much programming based.

  • Data Mining demonstrates patterns
  • Machine Learning forecasts with models
  • Artificial Intelligence shapes behaviours

So you see that these terms are different but still inter-connected.

Roles in Artificial Intelligence

Before looking into the skill-set of a data scientist, let’s have a look at various roles required to work and deliver a AI project, after all, it’s a teamwork.

Every role has its own skills that are critical to AI projects at various stages.

Data Scientist

A data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning. She spends a lot of time in the process of collecting, cleaning, and munging data. Domain knowledge is also an integral part of the skill.

Machine Learning/AI Engineer

Machine learning/AI engineers are sophisticated programmers who develop machines and systems that can learn and apply knowledge without specific domain requirement.

Data Analyst

Data analysts translate numbers into plain English. Every business collects data, whether it’s sales figures, market research, logistics, or transportation costs. A data analyst’s job is to take that data and use it to help companies make better business decisions. There are many different types of data analysts in the field, including operations analysts, marketing analysts, financial analysts, etc.

Data Engineer

Data Engineers are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems.

Data Architect

Data architects build complex computer database systems for companies, either for the general public or for individual companies. They work with a team that looks at the needs of the database, the data that is available and creates a blueprint for creating, testing and maintaining that data architecture.

Analytics Manager

The analytics manager coordinates the different tasks that must be completed by their team for an AI project. Tasks may include researching and creating effective methods to collect data, analyzing information, and recommending solutions to business.

Business Analyst

AI business analyst converts the business problem statement to an AI problem statement which means what data needs to be analyzed to arrive at the insights. The data would then be reviewed with the technology team and results would be delivered to the business team in the form of insights and data patterns. The business analyst should also be knowledgeable enough to apply various predictive modelling techniques and right model selection for generating insights for the problem at hand.

Quality Analyst

The job of quality analyst includes checking the quality of the training data-set, preparing data-sets for testing, running statistics on human-labelled data-sets, evaluating precision and recall on the resulting ML model, reporting on unexpected patterns in outputs, and implementing necessary tools to automate repetitive parts of the work. Experience in software testing with data quality or DS/ML focus, understanding of statistics, exposure to AI / Machine Learning techniques and coding proficiency in Python, are some of the skills required for the job.

To work on AI projects in any of the above mentioned roles, one needs to have an understanding of the core concepts at a high level but depth is required in the specific area you would be working in.

Academia Vs Industry

Academia and Industry are different fields with different people and culture. People working in Academia for longer tenure may find it difficult to adjust to industry culture and vice versa.

There is also an academic trap when your career trajectory is so specialized for academia that you’re unprepared for a job outside of it.

The academic trap happens in all areas of study, but for this post, we focus only on AI students who want to leave academia for AI positions.

Further, companies are often hesitant to hire people coming straight from academia for various reasons like:

  • In academia, individuals prefer writing papers over internships, making grants over learning programming languages, and not doing the things that could help you in the industry but not academia. The things that are important for academic hirings, such as papers, talks, and grants, are not as important in the industry.
  • Working as a data scientist within a corporation requires an understanding of how the business world works, including how quickly deliverable need to be made, how to craft a good presentation, and how to word an email to make a request.
  • In academia, you are encouraged to find the most innovative and elegant solution. In industry, you are encouraged to spend as little time as possible to find an analytical solution that just fits the need.
  • Salary expectations for advanced degree holders are higher than someone with only an undergraduate degree. This also pushes away recruiters as the industry works in a different way, culture is simply different than the academic one. People coming from academia need to learn these lessons at their first job, which means that there is a lot of risk for the hiring company.

Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on Twitter, LinkedIn or Instagram?

16 Nov 2019

AI For Everyone Course Review & Key Takeaways

Courtesy: <https://www.deeplearning.ai/ai-for-everyone/>

I am working in ML/AI field for 6 years now and apart from technical skills that I acquired while working on the projects, I have also discussed various aspects of ML/AI with my non-technical colleagues, who have mostly been senior manager, VPs or CXOs.

When I heard about “AI For Everyone” course, I was a bit reluctant in attending it as I thought I know most of the generic stuff that might have been talked in the course.

Recently, one of my colleagues discussed with me a few topics covered in this course which intrigued me to get a fresh perspective on these topics. So, I recently attended this course on Coursera.

My motivation to write this blog is to make sure that I have understood key aspects of this course and am able to make my non-technical colleagues and project stakeholders understand the benefits & limitations of using AI.

This blog will also serve as a refresher for the professionals who have already attended it or are working in AI field but couldn’t attend the course due to time limitation.

Review of the Course

“AI For Everyone” is a short non-technical course which helps to bridge the gap among corporate professionals who are going to be impacted by AI transformation now or in the near future. Even the professionals without programming background understand the capabilities and limitations of today’s AI, what it takes to incorporate AI into your company’s strategy, how some of the fear regarding AI is overhyped and finally many serious questions like how AI will impact repetitive automation-prone jobs.

Key aspects of this course are:

  • A non-technical overview of AI field
  • Mainly for non-technical staff including executives (VPs & CXOs)
  • Refresher for AI starters & enthusiasts
  • Brings all stakeholders for AI projects on the same page
  • Mostly covers benefits & limitations of AI

In my view, the most important benefit of this course is that it brings all the professionals on the same page about what is AI? What are its capabilities? Which really fills a lot of gaps for the people joining AI projects from different background and reduces the friction while working together.

Outline of the Course

Let’s have a look at what we are going to learn in this course:

  • First week: AI technology, what is AI and what is machine learning? What’s supervised learning, that is learning inputs, outputs, or A to B mappings. As well as what is data science, and how data feeds into all of these technologies? What AI can and cannot do?
  • Second week: What it feels like to build an AI project? What is the workflow of machine learning projects, of collecting data, building a system and deploying it, as well as the workflow of data science projects? How to carry out technical diligence to make sure a project is feasible, together with business diligence to make sure that the project is valuable before you commit to taking on a specific AI project?
  • Third week: How such AI projects could fit in the context of our company? Examples of complex AI products, such as a smart speaker, a self-driving car. What are the roles and responsibilities of large AI teams? The AI transmission playbook, what are the five-steps for helping a company become a great AI company?
  • Last week: AI and Society. What are the limitations of AI beyond just technical ones? How AI is affecting developing economies and jobs worldwide?

Week 1: What is AI?

Courtesy: <https://www.deeplearning.ai/ai-for-everyone/>

This week’s content contains all the essentials for understanding AI, its capabilities, limitations, and how to implement it in a company’s automated processes to become an AI company.

Introduction

  • AI is creating value in almost all the sectors
  • Artificial Intelligence can be categorized broadly into two streams: ANI & AGI
  • Artificial General Intelligence (AGI): Machine can do whatever a human can do
  • Artificial Narrow Intelligence (ANI): Machine can do very specific tasks like driving cars, playing asked music, searching web etc
  • What we know how to develop today are ANIs, each specialized in one task
  • AGI doesn’t exist yet (at least for a long time from now)

Machine Learning

  • AI has really taken off recently due to the rise of neural networks and deep learning
  • In the majority of the cases: AI → Machine Learning → Supervised Learning
  • Supervised Learning learns the mapping between input A to output B

What is data

  • We know that Supervised Learning learns the mapping between input A to output B
  • Examples of input A & output B from which ML algorithms learn is called data
  • Data is often unique to your business, can be structured (tables, hierarchy etc), unstructured ( text, voice and image) or semi-structured
  • Data can be obtained by: manual labeling, observing behaviours or download from websites/partnerships
  • Pre-process the data before using it for AI models: garbage in, garbage out
  • Most of the data problems are related to incorrect labels & missing values

The terminology of AI

  • Business Intelligence (BI) helps interpret past data, BI is mainly used for reporting or Descriptive Analytics.
  • Data Science is the science of extracting knowledge and insights from data which often ends up being captured in a deck.
  • Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.
  • The Machine Learning which is a subset of the AI ​​itself contains subsets of which the most powerful is the Deep Learning (DL).
  • Originally inspired by neural networks of the brain, DL models are composed of several layers of computational units (artificial neurons), each capable of detecting more and more complicated characteristics of a training dataset

What makes an AI company?

  • Any company + deep learning != AI company
  • Strategic data acquisition
  • Unified data warehouse
  • Pervasive automation
  • New roles & division of labour

5 steps to becoming an AI company

  • Execute pilot projects to gain momentum
  • Build an in-house AI team
  • Provide broad AI training
  • Develop an AI strategy
  • Develop internal & external communications

What Machine Learning can and cannot do

  • ML/AI can’t do everything
  • Expectations are inflated due to media, academics & only positive reports about AI
  • Imperfect rule: Whatever we could do with a second of thought can probably be automated using AI
  • Technical diligence is required to assess the feasibility of the project
  • It is still harder to contextualize the responses for AI models
  • ML tends to work well in learning a simple concept with lots of available data
  • ML tends to work poorly in learning a complex concept, especially from small amount of data or if performing on new type of data

Intuitive explanation of deep learning

Week 2: Building AI Projects

Courtesy: <https://www.deeplearning.ai/ai-for-everyone/>

The content of this week contains all the essentials for developing an AI project (large or small) and understanding how all jobs will be impacted by AI.

Workflow of a Machine Learning project

  • Collect data
  • Train model (iterate until its good enough)
  • Deploy model (get data back, maintain/update model)

Workflow of a Data Science project

  • Collect data
  • Analyse data (iterate until its good enough)
  • Suggest hypotheses/actions (deploy changes, re-analyse periodically)

Every job function needs to learn to use data

  • Increasing sales: optimizing sales funnel & automated lead sorting
  • Default detection: optimizing manufacturing line & automated visual inspection
  • Recruitment assistance: optimizing recruitment funnel & automated resume screening
  • Product recommendation: A/B testing & customized product recommendation
  • Precision agriculture: crop analytics & precise weed killing

How to choose an AI project

  • organize a cross-functional brainstorming with both the AI and domain teams
  • Cross-section between ‘What AI can do’ & ‘What is valuable for your business’
  • Which either increase revenue or reduce costs or both
  • Use/buy standard AI models and develop only which don’t exist or offer extra business value

Method I

  • Automation of processes/tasks rather than jobs
  • Selection of activities with the greatest impact on the company’s business
  • Selection of the main pain points in the company’s business

Method II

  • Technical diligence in terms of state of the art AI capability, data requirements & time and people required to create, train & deploy AI models
  • Business diligence in terms of cost reduction, revenue increment & new product/business opportunity
  • Ethical considerations

Working with an AI team

  • Instead of hiring AI specialist, train IT engineers in AI
  • Domain team must define the performance rate to be achieved by the AI ​​model
  • Not realistic to ask the AI ​​team to obtain a 100% performance rate
  • Due to limitations of ML, insufficient data, mislabelled data

Technical tools for AI teams

  • AI ​​evolves largely in the Open Source world
  • Free ML/DL high-quality frameworks
  • Academic research papers in ML/DL
  • Refer Arxiv for research publications
  • Refer GitHub for Open Source repositories
  • AI ​​team must also have the significant computational capacity (GPU)
  • Can be local or by prominent cloud providers (AWS, Azure & GCP)

Big data is not always required

  • Having more data almost never hurts
  • Data makes some businesses defensible
  • But with small data, you can still make progress

Week 3: Building AI in your Company

Courtesy: <https://www.deeplearning.ai/ai-for-everyone/>

This week’s content contains all the essential elements for developing AI in an organization (association, public organization, company).

Case study: Smart speaker

  • Typical examples: Amazon Echo/Alexa, Google Home, Apple Siri, Baidu DuerOS
  • Steps to process the voice command:

— Trigger word/wakeword detection

— Speech recognition

— Intent recognition

— Execute task

  • Usually, there is one AI team per process
  • Execution of the task may be more complex to be broken down in further sub-tasks
  • Example tasks: play music, volume up/down, make call, current time etc

Case study: Self-driving car

  • There are 3 main steps for a self-driving car to decide on its route and speed
  • But there are many other localization processes that are necessary for the decision
  • Key steps: Car detection → Pedestrian detection → Motion planning

Example roles of an AI team

  • Software Engineer: 50% or more of the team, in charge of developing software
  • ML Engineer: responsible for creating and training ML models
  • ML Researcher: in charge of following the evolution of state-of-the-art, doing research and possibly publishing the results of its research
  • Applied ML Scientist: in charge of adapting already published models to the specific projects of its company.
  • Data Scientist: responsible for examining the data and providing the insights to AI team/executives
  • Data Engineer: in charge of organizing and saving data in an accessible, secure and cost-effective way
  • AI Product Manager: help AI team in deciding what to build, what’s feasible & valuable
  • Start with a small team and expand based on the progress

AI Transformation Playbook

  • Execute pilot projects to gain momentum
  • Build an in-house AI team
  • Provide broad AI training
  • Develop an AI strategy
  • Develop internal & external communications

AI pitfalls to avoid

  • Don’t expect AI to solve each and everything
  • Don’t depend solely on the technical team to come up with AI use-cases
  • Don’t expect the AI project to work the first time
  • Don’t expect traditional planning & processes to apply without any changes
  • Don’t think you need superstar AI engineer to succeed

Taking your first step in AI

  • Get colleagues to learn about AI
  • Start brainstorming about projects
  • Hire a few ML/AI people to help
  • Hire or appoint an AI leader
  • Discuss with CXOs/Board about possibilities of AI transformations

Survey of major AI applications

  • Computer Vision: Image classification, Object recognition, Object detection, Image segmentation, Tracking
  • Natural Language Processing: Text classification, Information retrieval, Name entity recognition, Machine translation
  • Speech: Speech recognition, Trigger word/wakeword detection, Speaker ID, Speech synthesis
  • Robotics: Perception, Motion planning, Control
  • General Machine Learning: Unstructured data (image, audio, text), Structured data

Survey of major AI techniques

  • Unsupervised learning
  • Transfer learning
  • Reinforcement learning
  • Generative Adversarial Network (GANs)
  • Knowledge Graph

Week 4: AI and Society

Courtesy: <https://www.deeplearning.ai/ai-for-everyone/>

This week’s content contains all the essentials to understand the impacts of AI on our society.

A realistic view of AI

  • Too optimistic: Sentient/super-intelligent AI killer robots coming soon
  • Too pessimistic: AI cannot do everything, so an AI winter is coming
  • Just right: AI can’t do each and everything, but will transform industries
  • Performance limitations: with unavailability of data, less or irrelevant data or randomness of patterns can limit the performance
  • Model explainability: explainable models are more accepted in business & society

Discrimination / Bias

  • AI model may develop gender and ethnic biases
  • Problem comes from the training data which contain these biases
  • Examples such as:

— Hiring tool discriminating against women

— Facial recognition working better for light-skinned individuals

— Bank loan approvals

— Toxic effect of reinforcing unhealthy stereotypes

  • Combating bias:

— Technical solutions: removing bias in training data, using more inclusive data

— Transparency and/or auditing processes

— Having a diverse workforce

Adversarial attacks

  • AL models are sensitive to adversarial attacks
  • Which are deliberate actions to fool an AI model
  • It is possible by changing the values ​​of a few pixels of an image to fool a classifier
  • While visually we (humans) can not detect the changes
  • Adversarial defence do exist, but incur some cost
  • Endless race between forgers and authorities

Adverse uses

  • DeepFakes: Synthesize video of people doing things they never did
  • Undermining of democracy & privacy
  • Generating fake comments

AI and developing nations

  • Developing economies can ‘leapfrog’ by using technology developed by others
  • Like mobile phones, mobile payments, online education
  • Use applied AI in specific industries
  • Public-private partnerships to accelerate development
  • Invent in AI education of citizen

AI and jobs

  • Automation of activities by AI already has an impact on employment
  • But job created by AI should be much higher than job displaced by 2030
  • We should assess what are the tasks in our day-to-day job which can be automated
  • Possibility of conditional basic income
  • Building a lifelong learning society
  • Political solutions to support AI transformation in society
  • We should all complement our current knowledge with an AI knowledge

Conclusion

Let’s recap what we have learnt in this course:

  • First week: AI technology, what is AI and what is machine learning? What’s supervised learning, that is learning inputs, outputs, or A to B mappings. As well as what is data science, and how data feeds into all of these technologies? What AI can and cannot do?
  • Second week: What it feels like to build an AI project? What is the workflow of machine learning projects, of collecting data, building a system and deploying it, as well as the workflow of data science projects? How to carry out technical diligence to make sure a project is feasible, together with business diligence to make sure that the project is valuable before you commit to taking on a specific AI project?
  • Third week: How such AI projects could fit in the context of our company? Examples of complex AI products, such as a smart speaker, a self-driving car. What are the roles and responsibilities of large AI teams? The AI transmission playbook, what are the five-steps for helping a company become a great AI company?
  • Last week: AI and Society. What are the limitations of AI beyond just technical ones? How AI is affecting developing economies and jobs worldwide?

We have learned a lot in these four weeks, but AI is a complex topic, so the key is to keep learning, keep evolving.

References

AI For Everyone


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

05 Oct 2019

How to build your career in Artificial Intelligence?

When you start learning a new skill, the first thing is to look at the big picture and where your would-be skills fit in the field. It gives you a context of what role you can play or are expected to play. And when the skill and field are evolving & overwhelmingly large, you are so engrossed in the details that most probably you tend to miss the purpose.

In my view, to understand the big picture, ask yourself ‘why’ more often and start with the end in your mind.

The following analogy is not particularly about artificial intelligence but in general.

As you can see above, to land the job you want you to need to crack the interview. To get the call for the interview, you need a network that can refer you for opportunities. To have a relevant & helpful network, you need to have impressive credentials or portfolio. To have a solid portfolio, you need to work on the core skills of the job role that you want. To build core skills, you need to have the right resources. To gather the right resources, you need to have awareness of the field in which you want to build your career.

Now, reverse the order and assess at what stage you are and fill the gap in the coming months to build your career.

Based on the above approach, this post gives you a holistic view of what you need to know and do to build your road-map in the artificial intelligence field.

Now let us see what you would be doing in each of the steps in the above framework to build your career in artificial intelligence.

Before learning artificial intelligence, it is very much required that you understand the overall landscape and where all the buzzwords like artificial intelligence, machine learning, deep learning fit in. In this step, you need to learn different terminologies, their meaning and how these are interconnected.

Artificial intelligence is a vast field, not a single person has all the required knowledge or does all day to day tasks. Due to the variety of skills required in artificial intelligence projects, specific roles are evolving so that individual can contribute according to their abilities. As part of this step, you also need to know almost all the artificial intelligence roles, which will give you a better idea which one suits you.

Work on building blocks

This is the core part of artificial intelligence, knowing the concept, process & tools are the most important part of any job. In this step, you need to be aware of what is in the scope of the artificial intelligence field and which role need to have what kind of skills.

After completing this step you should know what is in the scope of the artificial intelligence field. You can relate the concepts, process & tools based on the role you think you can fit in. Think of this step as defining the scope of the artificial intelligence field with electives based on your role.

Utilize the resources

There is no dearth of learning resources but that has only confused the starters in the artificial intelligence field. I get many questions related to what books to read, what courses to enrol, what blogs/portals to follow, what data-sets to work on.

As part of this step, you need to make a list of all the prominent books, courses, blogs, portals, data-sets & podcasts etc. Please note that this list should be as you get more exposure to the field, there may be additional and better resources but I think you will get enough exposure after going through this step to evaluate other resources comparatively.

Build your portfolio

Refining & honing the skills required for a job is one part, showcasing what you can offer is another. What if you have all the required skills but nobody looking for the skill-set is aware that you do. Having a credible portfolio is an effective way to showcase your skill-set and talent.

In this step, you should know what is needed to build an impressive portfolio even before entering the market looking for a relevant job.

Network & land the job

You know, you have filtered the best-suited role & honed your skill-set & built the portfolio as well. Your mission is not accomplished unless you have like-minded professionals in your network who are aware of your capabilities and can refer you for the jobs that match your skills. Interview preparation is another task in itself.

As part of this step, you will focus on the steps required to network & land the job in the artificial intelligence field. How to network with like-minded professionals, how to search for relevant job openings, how to prepare & crack artificial intelligence interviews.

Make your career future-proof

Future-proofing your career is simply taking the extra steps to prepare yourself for constant technology disruption, one that’s going to rely heavily on adaptability.

So rather than waiting for someone or technology to replace your labour, you need to take a proactive approach to put yourself in a position where potential employers can’t afford to work with you.

Conclusion

Artificial intelligence is an evolving field due to ongoing technology disruption. While the core concepts remain the same; tools, technology & frameworks will be changing. Any roadmap prepared for artificial intelligence has to be reviewed at a regular interval, so keep a tab on what’s happening in the artificial intelligence field and revise your roadmap accordingly.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

05 Sep 2019

Artificial Intelligence for Business Leaders

Photo by [Jehyun Sung](https://unsplash.com/@jaysung?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/leaders?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

It is no secret that data is the new oil and AI is the new electricity. Due to the difference AI can make in every walk of life, it is the red hot subject among recent graduates/undergraduates and working professionals in different industries.

It is not only up to the technical team to make AI initiatives a success in the organization. Business value creation is driven top-down so business leaders are equally responsible for the success or failure of AI initiatives.

If you look at the failure rate of AI projects and the challenges faced by AI teams, you can easily figure out that most of the challenges are because of the absence of data culture in an organization. And data culture can’t be inculcated in a day, it takes efforts and leadership needs to drive it as part of their change management program.

Apart from infusing data culture in the organization, most of the decisions in business are made by its leaders, AI is providing huge opportunity to many areas of business across domains. Hence as a leader or manager, understanding AI and applying it effectively in our organization can make all the difference between us and our competitors.

Recently I published a book for AI starters and enthusiasts, where I shared a learning framework on how they can get into AI and during my discussions with managers and business leaders, they asked me if a similar framework can be built for business leaders.

Framework

If we look at the day to day working of managers and leaders, they need not know each and everything of the work & resources they are managing but just enough awareness can make a huge impact on AI initiatives. It helps in setting the right expectations with the customer or business, defining better strategy, anticipating challenges and understanding the concerns of the team.

So here is the framework that I came up with for business leaders & managers: Prepare, Strategize, Execute, Reflect.

Let us have a look at each of the phase of the above framework:

Prepare

AI is a new & evolving field, the first phase for us is to prepare ourselves to run these AI initiatives.

The very first step we need to take is to navigate the overall landscape. What are AI, ML, DL, DS, DM, BI terms? How they are interconnected? What is the typical lifecycle to execute AI project? What are the different roles and skill-set required in the lifecycle?

The second step would be to build the capability to execute AI projects. More we work on our concepts and skill-set, more we will be comfortable to run such projects and handling teams.

Strategize

After working on our concepts & skill-set, we are ready to work with a team on AI projects but hold on. How do we know what are the opportunities in our organization? Working on which projects would be more beneficial than others? To get answers to these questions, we need to define an AI strategy based on our organization & industry.

Apart from defining the strategy, we need to have an operational framework in place to get a clear understanding of what is to be done and who will be doing what.

Another area for leaders/manager is to anticipate what challenges their teams would face while working on AI projects. Having a view beforehand will help you to handle the challenges better.

Execute

This phase covers managing the on-field execution of AI projects. As AI projects are different from typical IT projects, we need to emphasize quick POC to get a better view of the problem & to provide optimized estimates of the overall efforts.

From managing stakeholders’ expectations to understanding the functional aspect of the use case to deployment & monitoring of the final solution, you need to manage everything and also need to tackle the day to day challenges your team faces.

Reflect

The last phase is to reflect on how the whole project went and what we can take out from it to do better in other upcoming AI projects.

Firstly, we need to analyse the results of our deployed solution and compare it with the initially estimated output. We also need to take the customer or business feedback on the results and agree on monitoring and maintenance framework.

Obviously, being an AI project, not everything would go smooth. Every organization has a culture and way of doing things, which may or may not be in line with what we require for our AI initiatives. So based on the lessons learnt from the current AI project, we may need to update our AI strategy & operational framework to suit our opportunities & constraints in a better way.

As you see, the above approach & framework captures what a business leader or manager needs to know and work out to make AI initiative successful for his organization. It takes vision, strategy, efforts & some self-reflection at the end.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

21 Aug 2019

4 Types of Challenges in DS/AI Projects & Initiatives

Photo by [Eila Lifflander](https://unsplash.com/@3ilaliff?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/challenges?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

In this post, I have analyzed frequent challenges encountered by DS/AI leaders & experts in their projects/initiatives. I have re-organized these challenges into 4 areas of concerns, my idea is to build a framework around these challenges to suggest possible solutions in upcoming posts.


DS/AI promises considerable economic & social benefits, even as it disrupts the way we work. Almost everyone agrees that it is a field which can change the world, from healthcare to countering terrorism and even arts & sports.

DS/AI leaders & experts are facing different challenges from ideation to productization, it is estimated that between 70% and 85% of DS/AI projects fail.

To get a holistic view of the challenges & failures, I looked at various articles available on the internet about the challenges in DS/AI projects & initiatives. You can find all the articles I referred at the end of this post.

In the past few weeks, I have also spoken to many DS/AI leaders & experts on this topic and they have also given some useful insights. I will cover these specific insights into the challenges in the upcoming blog-posts.

In order to shed some light on the reasons why we observe such a high failure rate, I also analyzed the results of a survey conducted by Kaggle in 2017. You can see the full report here. The part of the survey relevant to this article is about the challenges companies face as far as their DS/AI efforts are concerned. The following chart shows the top fifteen challenges:

While I am still documenting the challenges & possible solutions, I can categorize these challenges in four broad segments: — Cultural challenges (right talent, data literacy, realistic expectations etc) — Data-related challenges (data access, data quality, lifecycle management etc) — Operational challenges (suitable operating model with specific roles & responsibilities) — Technology-related challenges (appropriate tech-stack, right infrastructure etc).

Challenges in DS/AI Projects

Based on the above-mentioned sources (surveys, articles & experts) in this post, I have collected all the challenges and tried to categorize them into four areas of concern — Culture, Operation, Data, Technology:

  • Data Quality — Data
  • Talent Gap — Operations
  • Company Politics — Culture
  • Data Access — Operations
  • Data Literacy — Culture
  • Data Privacy — Data
  • SME Gap — Operations
  • Heterogeneous Tech-stack — Technology
  • Unrealistic Expectations — Culture
  • Co-ordination with IT & other deptt — Operations
  • Stakeholders’ Buy-in — Culture
  • Sponsorship — Culture
  • Deployment — Operations
  • Algorithms Limitations — Technology
  • Data Consolidation — Data
  • Opportunity Assessment — Operations
  • Model Explainability — Technology
  • Agility — Operations
  • Data Security — Operations
  • Solving the Wrong Problem — Operations
  • Organizational Maturity — Culture
  • Storytelling — Culture

Area of Concerns in DS/AI

Now, let’s consolidate the above challenges in the respective area of concerns:

Cultural Challenges

Company Politics, Data Literacy, Unrealistic Expectations, Stakeholders’ Buy-in, Organizational Maturity, Storytelling

Operational Challenges

Talent Gap, Data Access, SME Gap, Coordination with IT and other deptts, Deployment, Opportunity Assessment, Data Security, Solving the Wrong Problem

Data Challenges

Data Quality, Data Privacy, Data Consolidation

Technology Challenges

Heterogeneous Tech-stack, Algorithms Limitations, Model Explainability

What next?

I believe that if we categorize these challenges in the respective area of concerns, we can build a framework around it which can address these challenges and suggest possible solutions.

I intend to explore these possibilities in the upcoming blog-posts, if it looks interesting to you, stay tuned.


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also, feel free to visit my webpage https://ankitrathi.com

04 Jul 2019

Getting into Machine Learning & Artificial Intelligence

Photo by [Javier Allegue Barros](https://unsplash.com/@soymeraki?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/together?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Now that you have covered each and everything from the content perspective. Let’s reflect on what you have learnt and after that, I will give you few tips to overcome some learning obstacles.

You can notice that all these activities can be completed in 12 weeks.

As I mentioned earlier that you can not be an expert in ML/AI in such a short time but I can ensure that you have all the knowledge, concepts, processes and tools and techniques available to you to tackle any challenge in ML/AI.

After going through the ‘Navigate’ step, you need to focus as much as you can on building your skills in ‘Build’ step, you can start some of the activities of ‘Launch’ step in parallel as it would help you to grasp and apply what you learnt in ‘Build’ step.

On your path to learn and build your skill-set in ML/AI, you will find many obstacles. Some of them I am covering here and what you need to do to overcome those obstacles.

Avoid Being a Junkie

Photo by [Mark Zamora](https://unsplash.com/@mmm_mark?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/junkyard?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Online learning is like Chocolate. You can’t take just one online course or subscribe to just one newsletter/blog. Once you start, you want to have it all!

But the trouble with Chocolate is that, even if you eat the whole damn packet, you don’t have enough. The only thing that stops you is that the packet is empty, your stomach is full, and you feel kind of sick. But your appetite doesn’t really go away.

Similarly, having an unlimited access to all the online courses, MOOCs, webinars, workshops, challenges, e-books, blogs, TED talks, podcasts, and all the other free (or affordable) resources can get you in trouble.

The key to avoiding the trouble is to prioritize, plan, schedule and log what you do. And analyse at regular interval how your progress is and what is working while what needs to be improvised.

When you subscribe to a webinar or an online course, put it in your calendar. Set up reminders; make a commitment to attend it as if it was compulsory. In case of live events, if you can’t catch them live, schedule the time when you’re going to watch the replay. When you watch a webinar or an online course, don’t multitask. Close the door of your room, turn off your phone, and don’t open other tabs to check Twitter. After the webinar/course video ends, take several minutes to debrief. Find something you can put into practice right away and put it on your to-do list.

At the end of the month, evaluate. What have you learned? What worked? What didn’t?

Make a list of your favourite bloggers, podcasts, and other resources. Read them, listen to them or watch them as a part of your daily schedule. If you set aside time to do this, you will be able to concentrate on learning when you learn and to create when you create.

Learn to Solve

Photo by [Olav Ahrens Røtne](https://unsplash.com/@olav_ahrens?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/solution?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Learning is 20% information and 80% action.

If you aren’t taking action, you aren’t learning; you’re just wasting time. The key to making learning more effective is Active Learning.

According to a study into learning-centred approaches to education, students learn more when they participate in the process of learning. Active learning is discussion, practice, review, or application. Problem-solving, exploring new concepts in groups. Working out the problem on a piece of paper.

Active learning is any learning activity in which where you participate or interact with the learning process, as opposed to passively taking in the information.

When given the opportunity to actively engage with the information you are learning, you perform better. It nurtures the brain, giving it an extended opportunity to connect new and old information, correct previous misconceptions, and reconsider existing thoughts or opinions.

Active learning encourages your brain to activate cognitive and sensory networks, which helps process and store new information. One more similar research at Cornell University found that learner attention starts to wane every 10–20 minutes during lectures — which means instructors are continuously fighting to keep attention. Incorporating regular, varied active learning moments is a great solution to recapture an audience.

Similarly, when you are learning ML/AI theory, try to apply as soon as what you have learnt. This will keep you engaged and motivated for a longer period of time.

Just Enough Approach

Photo by [Scott Webb](https://unsplash.com/@scottwebb?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/minimal?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Just Enough Approach is learning just enough to perform a specific task. As we saw in the above sections that it does not help to be a course junkie. You need to learn in order to solve a problem. So don’t get into an endless learning loop and just learn enough to solve the problem at hand.

Unless you apply what you have learnt, there is no benefit of learning. And the depth in the theory of DS/AI is kind of endless, without any problem to solve, you may go as much deep as you want and still get no idea how to apply what you have learnt.

I always suggest my students to apply Just Enough Approach while learning DS/AI. This is the path of optimal learning, you learn the concept and solve the problem as well.

As we learnt in this post, there are few obstacles in the path of learning and application of ML/AI and you can overcome them by just being aware that they exist.

Conclusion

We have looked at the overall landscape, what are ML/AI terminologies and what are the roles in ML/AI projects. Then we worked on the building blocks by listing out what are the concepts, processes, and tools you need to learn. After that, we listed the resources we need to refer to during the learning stage.

Post that we learned how to build an impressive portfolio. How we can start building a network and prepare for the interview in order to search and land the job. Then we looked at the ways to make our career future-proof.

Now you have all the knowledge, concepts, processes & tools with you to tackle any challenge you face in ML/AI field.

I believe all the content provided in this series is helpful to you. If you liked (or not) the content, I would ask you to provide your feedback to me.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

03 Jul 2019

Making Your ML/AI Career Future-Proof

Photo by [Linus Ekenstam](https://unsplash.com/@linusekenstam?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/future-proof?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

This blog-post talks about how you can make you DS/AI career future proof. Lets first understand, why this is required? So what exactly does it mean to future proof your career?

Future proofing your career is simply taking the extra steps to prepare yourself for constant technology disruption, one that’s going to rely heavily on adaptability.

So rather than waiting for someone or technology to replace your labour, you’ll take a proactive approach to put yourself in a position where potential employers can’t afford to work with you.

Follow these six steps and you will secure your place in the workforce alongside disruption in technology instead of getting edged out:

Build ‘Evolve’ Mindset

Technology is only going to keep evolving, and it’s always going to get better. While you and I may not know exactly how we do know this change is inevitable. So as technology in the work environment evolves, so should the workforce.

People who are more adaptable and resilient will be the ones who will make the cut. They will also be the employees who are not threatened by technology disruption.

But how to become more resilient and adaptable?

First, you can prepare for the future like you are doing today and be ready to change course at a short notice. And two, when your environment begins changing, have an open mind about what this transition may bring — and be ready to take it head-on — instead of resisting it and sticking to your old habits.

One of the best ways to build the confidence necessary for this new technology-driven world is to level-up your digital skill-set.

Hone Core Skills

Photo by [Jo Szczepanska](https://unsplash.com/@joszczepanska?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/collections/1445889/essential-skills?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

The technology disruption will also create a need for higher skill levels. You may have already witnessed the fact that having a college degree doesn’t make you stand out anymore. In fact, it is going to get redundant in the near future.

Organizations are emphasizing more on the core skills to do the job, the trick is to never stop learning and keep honing your core skills.

You should continue to acquire new relevant skills as well— especially ones which will be in demand.

The best way professionals can do this is by enrolling in online courses or you can also learn on your own. Like degrees, don’t focus on collecting training certificates but try to gain hands-on learning as much as you can.

Develop Soft Skills

Photo by [Tim Gouw](https://unsplash.com/@punttim?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/collections/1276243/soft-skills?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

You may be able to train a robot to automate the technical skills of your job, but soft skills such as leadership, communication, collaboration, and time management are still tasks only humans do well.

Since technology is not at the point where robots have the same emotional intelligence as humans, these soft skills are and will continue to be in high-demand.

The key is to consciously try to improve on your soft-skills. Fortunately, you can also hone your soft skills with online classes so you can be more proactive in this department too.

Maintain Digital Portfolio

Photo by [Joanna Kosinska](https://unsplash.com/@joannakosinska?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/photos?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

In one of the lessons earlier, we learnt how to build our portfolio. Most people wait until they are ready to find a new job to update the portfolio. But this is a huge mistake since you may forget what you’ve been up to, or worse, forget to mention a major milestone or achievement.

A better approach is to always update your projects and accomplishments as you work through them.

This guarantees that you never forget to highlight something and you’ll always have a list of your achievements on hand.

It is also a great idea to keep your performance reviews here, both the good ones and the bad. You can always refer back to these anytime you need a pick-me-up or if you want to narrow down your specific areas needing improvement.

Expand Network Globally

Photo by [MD Duran](https://unsplash.com/@mdesign85?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/networking?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Couple the evolution of technology with the rise in telecommuting, remote work, and networking sites like LinkedIn, and you will quickly find connections outside of your local network.

To stay in touch with these global members of your team, you will want to become a pro at virtual project management tools like Trello and messaging platforms like Slack.

Apart from your team members, become more familiar with the geographic regions that pertain to your job by reaching out to professionals in those countries too.

If you get an opportunity to relocate or take on an international project, go for it and give you a leg up on your competition, it will go a long way to future-proofing your career too.

Monitor Industry Trends

Photo by [Stephen Dawson](https://unsplash.com/@srd844?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/trends?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Understanding the future of your industry is a giant factor in how well you can truly future proof your career. Essentially, there’s no point specializing in a language/tools that may be completely redundant in a few years.

The tasks most at risk are those requiring low-level repetitive tasks each day. So if that description matches your current position, it is time to add more skills to the fore and prepare for the redundancy in future.

But remember, an increased use of technology in the workplace doesn’t always mean your job is at risk.

Rather, it could mean you will just need to know how to use upcoming technology/framework as a way to potentially do your job better or more efficiently.

Stay abreast with industry news to see if it impacts only specific to your company or your entire industry as a whole. If you see an industry-wide trend, that’s a good sign you should learn those new skills and consider switching career paths.

Either way, follow these steps and you’ll ensure your career — both now and in the future — is set up for success, even as technology disruption and automation move in.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

02 Jul 2019

Networking & Landing the Job

Photo by [MD Duran](https://unsplash.com/@mdesign85?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/networking?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

So you have done all your homework, you traverse through the DS/AI landscape, you built required concepts and learnt relevant tools and you have started working on your portfolio as well.

Now its time to network with people, work on your CV/resume, start looking for a job and prepare for the interview.

Networking on LinkedIn

When it comes to networking on LinkedIn, there are two primary functions — just like in real-world networking: Building your network and nurturing relationships. I’ll cover both in this post.

Enrich Profile

Before using LinkedIn for networking, you need to update your profile and complete all sections to make it authentic, relevant, and compelling.

I would suggest you to avoid reaching out to new contacts and accepting connection requests until your profile was in tip-top shape. You need to assume that people will check out your profile when you are connecting. And with many connections, that’s your first impression.

Deny it all you want but first impression matters a lot.

Make sure your photo, headline, and summary all tell a compelling story of who you are and what you have to offer. Show that you’re worth connecting with! You can also request a recommendation from someone you’ve worked with so new connections have an impression of how you have helped others.

Interact

Being on LinkedIn for namesake is not enough, you need to interact with people to expand the network and get the visibility. Provide status updates on a regular basis. It keeps you visible to the people in your brand community.

Like and comment on LinkedIn posts that you think are valuable, and share the posts with your connections and other groups you belong to. Share content you find at other sites — like Fast Company or Forbes or Huffington Post — that you think would be valuable.

When sharing, remember to add content saying why you think it is valuable and expressing your point of view.

Publish

If you don’t have a blog, you can publish your articles on LinkedIn. As for the blog, you need to keep your article engaging, either it should entertain or educate or inspire. If your article does all of the three, nothing is better than that.

Write articles to solve problems. Divide your article into logical sections, like heading, introduction, main points and conclusion.

Keep images, embed content if required, to tell the story in your words.

More post likes should also get you LinkedIn shares, post views, and comments according to correlation data. You can encourage people to like your post with a call to action.

Reach Out

When you’re reaching out, remember to customize the request instead of using the default “I’d like to add you to my network.” You can even customize requests when using the LinkedIn mobile app.

Accept Requests

Know your criteria for accepting requests (and remember what you lose if you are a closed networker). LinkedIn alerts you when you have requested. Get in the habit of accepting them soon after receiving them.

Guide/Help Others

LinkedIn does the heavy lifting when it comes to staying on top of people in your network. They provide notifications when someone you know has a birthday, work anniversary, or new job.

You have the option of “liking” the notification or sending a message. I suggest choosing “send a message” and writing a personal note. Determine a time of day you will check-in and get in the habit of doing it daily so you don’t miss any of your connections’ important dates.

Now you have the keys for unlocking the power of networking on LinkedIn.

In all of these interactions, remember that successful personal branding is the result of being authentic and being consistent.

Preparing the CV

Keep it Short

You don’t have to list everything you’ve ever done. Try making a master resume which has all of your work history and then pulls what’s most relevant for each job.

Don’t put Objective

Objective serves no purpose in your resume since it doesn’t help you distinguish yourself while taking up spaces that can be better utilized to showcase more useful information about you such as projects, experiences, etc.

Picture

Based on in which region you are applying, you need to decide whether you can put your picture. While this is common in some countries in Europe and South America, it is not appropriate in the US and Asia.

Always Proofread

Having typos or grammar errors in your resume can be the quickest way to have your application eliminated. Use spell-check and have a friend to check it over.

Mention Accomplishments

Quantify your accomplishments where possible, use verbs to start the bullet points.

Add Cover Letter

If there is a place to submit a cover letter, do so. Just like your resume, you can have a master cover letter that you pull paragraphs from. Tailor at least the first and closing paragraph to the company and make sure you get their name correct.

Include the results and links of your projects. For instance, if you are doing a competition on Kaggle, it will be important to mention your ranking & you can put your solution on GitHub.

Mention Portfolio

Mention your LinkedIn, GitHub, Kaggle profiles links on your resume. It can reveal more useful information about yourself and it is also good to showcase the relevant work you have done.

Tailor for Job Description

It is crucial that you tailor your description of experience towards the job’s requirement because it can make your skills and experience seem more relevant and better fit for the position.

Highlight Coursework

You should list coursework that is relevant to the position you are applying to. It is a quick way to show hiring managers your background, which can have some influences in their decision on inviting you for interviews.

Don’t put Common Projects

Common projects and homework are highly guided problems that will help you to stand out. You should include projects that demonstrate your ability to solve real-world problems and your interests.

Don’t Rate your Skills

Numerical rating your skills are not meaningful since they are not standardized or calibrated.

Searching the Job

Have a Growth Mindset

Growth Mindset is believing that their most basic abilities can be developed through dedication and hard work — brains and talent are just the starting point. Don’t use “talent” to describe others as an excuse for your laziness. What you need is to learn the right way and practice many times until you are good at it.

Take Notes

Take note of all the interview questions you have been asked, especially those questions you failed to answer. You can fail again, but do not fail with the same question. Most of the time, you learn more about the topic when you reflect on the notes taken.

Browse widely

Jobs in DS/AI go by many names besides data scientist. These include product analyst, data analyst, research scientist, quantitative analyst, and machine learning engineer. Try searching for all of these terms to find positions and then use the description to evaluate the fit.

Do Self-reflection

Rather than applying to every type of data science job you find, think about your strengths and where you want to specialize. There are data scientists who have strong statistics skill and the ability to work with messy data and communicate results. While other data scientists have very strong coding skills, maybe have a background in software engineering, and focus on putting machine learning models into production.

Don’t demand perfection

Your first job in the field probably would not be your dream one. You may need to start out by moving to a position where you can leverage your other skills. That doesn’t mean you shouldn’t have certain requirements and preferences, but it does mean you’ll want to have some flexibility. The most important criteria for your first job may be that it has a supportive environment, with lots of other analysts, where you can learn a lot.

Don’t undersell yourself

Job descriptions are generally wish-lists with some flexibility. If you meet 80% of the requirements but are otherwise a good fit, you should still apply. With that said, be wary of job descriptions that describe a unicorn where they need every skill on earth. It usually means they don’t know what they’re looking for and they expect a data scientist to come and solve all their problems without any support.

Look on LinkedIn

Check if you know anyone at the company you’re interested in. If you don’t know anyone, see if there’s anyone in your alumni networks. You can also check for second connections and see if the person who bridges you can introduce you. Many jobs get hundreds if not thousands of applications for each position and having someone refer you or give you feedback on what the team is looking for is enormously helpful.

Check out Meetups/Conferences

Sometimes hiring managers will come to meetups or conferences to recruit. You may also meet someone in the company or sub-industry you’re interested in. If you ask if their company has an opening or if they can refer you, though, you’ll probably be directed to the company’s career page. This is why it’s important to build your network before you need it, means starting off with a strong ask is not a great way to build a mutually fulfilling relationship.

Cracking the interview

Prepare & Practice

Most of the questions (at least 60–70%) are based on your background and skills required for the job. Which means you can prepare a list of questions that can be asked to you and keep your responses comprehensive yet short.

You can also ask the recruiter about the number of rounds and what is expected at each stage so get as much information as you can and keep yourself prepared.

The Resume is a Fair Game

Whatever you have mentioned in your resume is a fair game. So keep everything on your fingertips. If you have mentioned a skill or a course, be prepared to be grilled about that.

Research about Company

You may have done some research when writing your cover letter, but once you get the interview, dig a little deeper. In addition to the tips in the thread below, find out about your interviewers’ professional accomplishments. I was very impressed when a candidate I was interviewing asked some technical questions about a presentation I had given.

Have Questions Ready

Each interviewer should leave time for questions at the end. If they don’t, that’s a bad sign. Interviewing is a two-way process: you’re evaluating them as much as they’re evaluating you. If you don’t know what to ask, you can prepare them based on what is important to you like the quality of life, culture, and management practices. You can ask each interviewer different questions to maximize how many you can get answered, but you could also try to ask multiple people the same questions to see if and how their answers differ.

Never Mention the Numbers

While it is a cultural thing, in some countries it is OK to mention your current salary while in other law prohibits the recruiter to even ask for it. Whatever is the case, never mention your expected salary.

If you name a number, you risk them giving you a lower offer than they would have otherwise. Their offer should not depend on your current salary or expectations; it should be your worth in the market and similar to the salary of your peers there.

Handle Rejection Gracefully

You will almost inevitably get rejected in some job interviews, maybe dozens of them. DS/AI is a competitive field and this is a very normal process that everyone goes through. If they reject you or you don’t hear back from them, you can express your disappointment politely and thank them for their consideration.

You can ask for feedback, but know that many hiring managers would not be able to give you any because they want to avoid the possibility of being sued. While it is okay to take a little time to wallow, just don’t lash out in public or to the hiring manager. It won’t help the situation, but it will hurt your professional reputation.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

01 Jul 2019

Building Your Portfolio in Data & AI

Photo by [Joanna Kosinska](https://unsplash.com/@joannakosinska?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/photos?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

This blog-post talks about how you can build your DS/AI portfolio. Lets first understand, why a portfolio is important in DS/AI field?

Besides the benefit of learning by making a portfolio, a portfolio is important as it can help get you employment.

For the purpose of this article, let’s define a portfolio as public evidence of your DS/AI skills.

People often forget that software engineers and data scientists also google their issues. If these same people have their problems solved by reading your public work, they might think better of you and reach out to you.

Working on Public data-sets

You can gain more DS/AI skills by working on prediction problems rather than getting stuck in endless learning loop.

But you will not get a project to work on from day one of your learning. Still, there are platforms where you can apply and learn DS/AI.

UCI ML

https://archive.ics.uci.edu/ml/index.php

The UCI Machine Learning Repository is a collection of data-sets that are used by the machine learning community for the analysis of machine learning algorithms. The archive was created as an FTP archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data-sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited “papers” in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged.

Kaggle Datasets

https://www.kaggle.com/datasets

Kaggle is where many data scientists spend their nights and weekends. It’s a crowd-sourced platform to attract, nurture, train and challenge data scientists from all around the world to solve data science, machine learning and predictive analytics problems. It has over half a million active members from 190+ countries and it receives close to 150K submissions per month. Started from Melbourne, Australia Kaggle moved to Silicon Valley in 2011, ultimately been acquired by the Google in March of 2017. Kaggle is the number one stop for data science enthusiasts all around the world who compete for prizes and boost their Kaggle rankings. There are only a handful of Kaggle Grandmasters in the world to this date.

Do you know that most data scientists are only theorists and rarely get a chance to practice before being employed in the real-world? Kaggle solves this problem by giving data science enthusiasts a platform to interact and compete in solving real-life problems. The experience you get on Kaggle is invaluable in preparing you to understand what goes into finding feasible solutions for big data.

Data.Gov

https://www.data.gov/

This is the home of the U.S. Government’s open data. Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. Data.gov is managed and hosted by the U.S. General Services Administration, Technology Transformation Service. Data.gov is powered by two open source applications: CKAN and WordPress, and it is developed publicly on GitHub. Learn how you can contribute to Data.gov and these larger open source projects here.

Amazon Data-sets

https://registry.opendata.aws/

This source contains many datasets in different fields such as Public Transport, Ecological Resources, Satellite Images, etc. It also has a search box to help you find the dataset you are looking for and it also has dataset description and Usage examples for all datasets which are very informative and easy to use!

The datasets are stored in Amazon Web Services (AWS) resources such as Amazon S3 — A highly scalable object storage service in the Cloud. If you are using AWS for machine learning experimentation and development, that will be handy as the transfer of the datasets will be very quick because it is local to the AWS network.

Google’s Datasets Search Engine

https://toolbox.google.com/datasetsearch

In late 2018, Google did what they do best and launched another great service. It is a toolbox that can search for datasets by name. Their aim is to unify tens of thousands of different repositories for datasets and make that data discoverable.

Microsoft Datasets

https://msropendata.com/

In July 2018, Microsoft along with the external research community announced the launch of “Microsoft Research Open Data”. It contains a data repository in the cloud dedicated to facilitating collaboration across the global research community. It offers a bunch of curated datasets that were used in published research studies.

FiveThirtyEight

https://fivethirtyeight.com/

FiveThirtyEight, sometimes rendered as 538, is a website that focuses on opinion poll analysis, politics, economics and sports blogging. The website, which takes its name from the number of electors in the United States electoral college, was founded on March 7, 2008, as a polling aggregation website with a blog created by analyst Nate Silver.

You can find the data and code behind some of the popular articles and graphics here. You can use it to check others’ work and to create stories and visualizations of your own.

Participating in Competitions

Participating in DS/AI competitions is one of the most frequent paths taken by data scientists, while it doesn’t dish you all the challenges, it can help you to build your exploratory, modelling & cross-validation skills. You can also learn from fellow competitors about their approaches once the competition is over.

Kaggle Competitions

https://www.kaggle.com/competitions

Kaggle runs a variety of different kinds of competitions, each featuring problems from different domains and have different difficulties. Before you start, navigate to the Competitions listing. It lists all of the currently active competitions.

If you click on a specific Competition in the listing, you will go to the Competition’s homepage.

DataHack by AnalyticsVidhya

https://datahack.analyticsvidhya.com/

AnalyticsVidhya Data Hack is also a platform where you can compete with the best in the world on real-life data science problems. You can learn by working on real-world problems. You can also upskill yourself and get hired in the listed companies. You can showcase your expertise and get hired in top firms. If you happen to be at the top of competitions, you can also win lucrative prizes.

Machine Hack by AIM

https://www.machinehack.com/

COMPETE. CODE. COLLABORATE.

MachineHack is an online platform for Machine Learning competitions. They host the toughest business problems that can now find solutions using Machine Learning & Data Science techniques. Companies can hire better data scientists, the can discover & evaluate talented data scientists.

Just like Kaggle & DataHack, you can enrol in competitions here and help host solve their business problem. In return, you get near real-world project experience, you can learn from fellow competitors once the competition is over.

Publishing on Git-hub

GitHub is a powerful platform for software development, but at its heart, it’s about empowering people like you by helping you learn from other developers, build the software that matters to you, and propel yourself to the next stage of your life as a software developer.

Understand GitHub

https://github.com/

GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere.

In order to work on GitHub, you need to learn essentials like repositories, branches, commits, and Pull Requests. You’ll create your own Hello World repository and learn GitHub’s Pull Request workflow, a popular way to create and review code.

Publish on Github

https://pages.github.com/

GitHub Pages are public webpages hosted and easily published through GitHub. The quickest way to get up and running is by using the Jekyll Theme Chooser to load a pre-made theme. You can then modify your GitHub Pages’ content and style remotely via the web or locally on your computer.

Writing Blog

Photo by [Anete Lūsiņa](https://unsplash.com/@anete_lusina?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/blogging?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

Writing blogs is an effective way to showcase your expertise and skills. You can write what you have learnt recently, any interesting problem you have solved or worked on any project.

Writing an engaging blog-post is an art in itself, here are few tips to write and promote your blog posts.

Take notes for ideas

Start by writing down ideas as they occur to you. Make it a habit and keep doing it consistently by installing a note-taking app (like Keep, EverNote etc) on your mobile device.

Ideas occur to us all the time. You need a way to capture them when they do so that you can turn them into a great blog post in the future.

Build a simple outline

It is an essential step to develop an easy-to-follow outline before you sit down to write a blog post.

Once you’ve picked a topic to write about, from the list of ideas that you’ve written down, create an outline. The outline contains a heading, introduction, major points you want to write about and conclusion.

To get the juices flowing, you should actually write the introduction and the conclusion first, then add a list of things that you’ll cover in the body.

Start with a story

Entertainment is the biggest factor in engaging your audience. If you’re just about to start a blog, keep this at in your mind.

Stories engage people in and help clear the doubts. You are able to develop a scene which people can relate to.

Become a memorable writer by integrating stories into your blog posts. It doesn’t have to be your own story, you can tell interesting stories about others.

Solve common problems

Consistent writing is one of the easiest ways to become a better writer. The question is, what should you write about? As a beginner, write blog posts that answer questions.

Look for the problems that are common in your field, what most of the people are struggling with. Research about that topic, try to explore the problem and its possible solution.

Learn & Share

When I write a blog post, I read a lot about the subject. On the web and in real life, there are too many questions with too few answers.

Many a time, you will end up learning yourself in an attempt to write the post on a certain topic.

Read other great writers

The truth is that if you don’t read great writers, you don’t really know how to do it and that successful blog that you dream of will evade you.

I’ve learned that I get a better education from studying authors’ best work than I do from waiting for a piece of advice from them.

Mentor Others

As part of being a successful and well-rounded data scientist, giving back can be a rewarding and beneficial aspect. Becoming a mentor, or mentoring those who want to follow in your steps of being a data scientist can sharpen your expertise and credentials.

Build a Personal Brand

Building a brand is about giving yourself more opportunities to help and connect with people in your industry. And one of the best ways to build a brand is through blogging.

A blog is a hub for your advice. It also has the added benefit of helping you rank on search engines.

I hope that reading this inspires at least a few of you who want to become a data scientist and want to get better day by day by following the above-mentioned approach.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

06 May 2019

Utilizing the Resources

Photo by [Markus Spiske](https://unsplash.com/photos/Q0mDOn9gWk8?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/resources?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

After getting to know DS/AI landscape and its building blocks in previous posts, you may ask which are the resources to refer, how do I know if a book or course is worth to spend time and/or money?

This is not an exhaustive list by any means, but it is good enough to keep as your reference. You can build your own list of references once you get more awareness of the field.

This post is my attempt to make your task easier. I am listing down major quality resources (mostly free) here and also going to provide you with my view of these resources, which will help you to make an informed decision.

You need not go through each and every resource mentioned here, I would suggest you build the foundation first using a course or a book and keep other resources for your reference.

Books to refer

Machine Learning with R

This is an excellent book for the R starter who wants to apply ML to any kind of project. All the main ML models are presented, as well as different performance metrics, bagging, pruning, tuning, ensembling etc. Easy to scan through, many tips with fully-solved textbook problems. Certainly, a very good starting point if you plan to compete on Kaggle. If you already master both R and ML, this books is obviously not for you.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Python Machine Learning

This is a fantastic introductory book in machine learning with python. It provides enough background about the theory of each (covered) technique followed by its python code. One nice thing about the book is that it starts implementing Neural Networks from scratch, providing the reader with the chance of truly understanding the key underlying techniques such as back-propagation. Even further, the book presents an efficient (and professional) way of coding in python, the key to data science.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

ISLR

The book explains the concepts of Statistical Learning from the very beginning. The core ideas such as bias-variance trade-off are deeply discussed and revisited in many problems. The included R examples are particularly helpful for beginners to learn R. The book also provides a brief, but concise description of functions’ parameters for many related R packages. Compared to The Elements of Statistical Learning, it is easy for the reader to understand. It does a wonderful job of breaking things down complex concepts. If one wishes to learn more about a particular topic, I’d recommend The Element of Statistical Learning. These two pair nicely together.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Deep Learning

This is the book to read on deep learning. Written by luminaries in the field — if you’ve read any papers on deep learning, you must have heard about Goodfellow and Bengio before — and cutting through much of the BS surrounding the topic: like ‘big data’ before it, ‘deep learning’ is not something new and is not deserving of a special name. Networks with more hidden layers to detect higher-order features, networks of different types chained together in order to play to their strengths, graphs of networks to represent a probabilistic model.

This is a theoretical book, but it can be read in tandem with Hands-On Machine Learning with Scikit-Learn and TensorFlow, almost chapter-for-chapter. The Scikit-Learn and Tensorflow example code, while only moderately interesting on its own, helps to clarify the purpose of many of the topics in the Goodfellow book.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Hands-On Machine Learning with Scikit-Learn and TensorFlow

This book provides a great introduction to machine learning for both developer and non-developers. Authors suggest to just go through even if you don’t understand math details. Highlights of this book are:

  • Extraction of field expert knowledge is very important, you should know which model will serve better for the given solution. Luckily, a lot of models are available already from other scientists.
  • Training data is the most important part, the more you have it the better. So if you can you should accumulate as much data as you can, preferably categorized, you may not still know how you will apply the accumulated data in the future but you will need it.
  • Labelling training data is very important too, to train neural network you need to have at least thousands of labelled data samples, the more the better.
  • Machine learning algorithms and neural networks are pretty common for years but the latest breakthrough is possible because of new optimization, new autoencoders ( that may help to artificially generate training data) allowing to do training faster and with fewer data.
  • Machine learning is still pretty time and resources consuming process. To train a machine learning model you need to know how to tweak parameters and how to use different training approaches fitting the particular model.

The book demonstrates (including the code) different approaches using Scikit-Learn python package and also the TensorFlow.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Data Science for Business

This is probably the most practical book to read if you are looking for an overview of data science. Either you know when terms like k-means and ROC curves are to be used or you have some context when you start digging deeper into how some of these algorithms are implemented. You will find it at the right level because there is just enough math to explain the fundamental concepts and make them stick in your head.

This isn’t a book on implementing these concepts or a bunch of algorithms. This gives the book the advantage of being something you can refer to an intelligent manager or interested developer, and they can both get a lot out of it. And if they are interested in the next level of learning there are plenty of pointers. You will also find the chapter on presenting results through ROC curves, lift curves, etc. pretty interesting. It would be cool if this book had some more hands-on, but you can go to Kaggle and browse around the current and past competitions to apply what you learn here.

https://www.amazon.com/dp/1449361323/

Courses to attend

Machine Learning

https://www.coursera.org/learn/machine-learning

Machine Learning is one of the first programming MOOCs. Coursera put online by Coursera founder and Stanford Professor Andrew Ng. This course assumes that you have basic programming skills and you have some understanding of Linear Algebra. Knowledge of Statistics & Probability is not required though.

Andrew Ng does a good job explaining dense material and slides. The course gives you a lot of structure and direction for each homework, so it is generally pretty clear what you are supposed to do and how you are supposed to do it.

Deep Learning

When you are rather new to the topic, you can learn a lot of doing the deeplearning.ai specialization. First and foremost, you learn the basic concepts of NN. How does a forward pass in simple sequential models look like, what’s a backpropagation, and so on? I experienced this set of courses as a very time-effective way to learn the basics and worth more than all the tutorials, blog posts and talks, which I went through beforehand.

Doing this specialization is probably more than the first step into DL. I would say, each course is a single step in the right direction, so you end up with five steps in total. I think it builds a fundamental understanding of the field. But going further, you have to practice a lot and eventually it might be useful also to read more about the methodological background of DL variants. But doing the course work gets you started in a structured manner — which is worth a lot, especially in a field with so much buzz around it.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Fast AI

If your goal is to be able to learn about deep learning and apply what you’ve learned, the fast.ai course is a better bet. If you have the time, interleaving the deeplearning.ai and fast.ai courses is ideal — you get the practical experience, applicability, and audience interaction of fast.ai, along with the organised material and theoretical explanations of deeplearning.ai.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Kaggle Learn

Practical data skills you can apply immediately: that’s what you’ll learn in these free micro-courses. They’re the fastest (and most fun) way to become a data scientist or improve your current skills.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Blogs to follow

KD Nuggets

KDnuggets is a leading site on AI, Analytics, Big Data, Data Mining, Data Science, and Machine Learning and is edited by Gregory Piatetsky-Shapiro and Matthew Mayo. KDnuggets was founded in February of 1997. Before that, Gregory maintained an earlier version of this site, called Knowledge Discovery Mine, at GTE Labs (1994 to 1997).

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Analytics Vidhya

Analytics Vidhya provides a community-based knowledge portal for Analytics and Data Science professionals. The aim of the platform is to become a complete portal serving all knowledge and career needs of Data Science Professionals.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Towards Data Science

TDS joined Medium’s vibrant community in October 2016. In the beginning, their goal was simply to gather good posts and distribute them to a broader audience. Just a few months later, they were pleased to see that they had a very fast-growing audience and many new contributors.

Today they are working with more than 10 Editorial Associates to prepare the most exciting content for our audience. They provide customized feedback to our contributors using Medium’s private notes. This allows them to promote their latest articles across social media without the added complexity that they might encounter using another platform.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Podcasts to listen

Data Hack

This is Analytics Vidhya’s exclusive podcast series which will feature top leaders and practitioners in the data science and machine learning industry.

So in every episode of DataHack Radio, they bring you discussions with one such thought leader in the industry. They have discussions about their journey, their learnings and plenty of other data science-related things.

[embed]https://soundcloud.com/datahack-radio[/embed]

Super Data Science

Kirill Eremenko is a Data Science coach and lifestyle entrepreneur. The goal of the Super Data Science podcast is to bring you the most inspiring Data Scientists and Analysts from around the World to help you build your successful career in Data Science.

Data is growing exponentially and so are salaries of those who work in analytics. This podcast can help you learn how to skyrocket your analytics career. Big Data, visualization, predictive modelling, forecasting, analysis, business processes, statistics, R, Python, SQL programming, tableau, machine learning, Hadoop, databases, data science MBAs, and all the analytics tools and skills that will help you better understand how to crush it in Data Science.

[embed]https://soundcloud.com/superdatascience[/embed]

The O’Reilly Data Show Podcast

Known as the father of all other data shows, “the O’Reilly Data Show” features Ben Lorica, O’Reilly Media’s chief data scientist. Lorica conducts interviews with other experts about big data and data science current affairs. While it does get technical and may not be the best place for a beginner to start, it provides interesting insights into the future of the data science industry.

[embed]https://soundcloud.com/oreilly-radar/sets/the-oreilly-data-show-podcast[/embed]

YouTube Channels

DeepLearning.TV

DeepLearning.TV is all about Deep Learning, the field of study that teaches machines to perceive the world. Starting with a series that simplifies Deep Learning, the channel features topics such as How To’s, reviews of software libraries and applications, and interviews with key individuals in the field. Through a series of concept videos showcasing the intuition behind every Deep Learning method, they show you that Deep Learning is actually simpler than you think. Their goal is to improve your understanding of the topic so that you can better utilize Deep Learning in your own projects. They provide a window into the cutting edge of Deep Learning and bring you up to speed on what’s currently happening in the field.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Sirajology

Your host here is Siraj. He is on a warpath to inspire and educate developers to build Artificial Intelligence. Games, music, chatbots, art, he teaches you how to make it all yourself. This is the fastest-growing AI community in the world. Their mission: Solve AI. Use it to benefit humanity.

He is an AI Researcher, his latest paper is here — https://drive.google.com/file/d/0BwUv84lNDk72Q1gzaXgwR2U3U2NWVlZSOFk4amZIRmV1QXI0/view

He is also a Data Scientist, AI Educator, Rapper, Author, and Director of the School of AI (www.theschool.ai)

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Data School

Are you trying to learn data science so that you can get your first data science job? You’re probably confused about what you’re “supposed” to learn, and then you have the hardest time actually finding lessons you can understand! Data School focuses you on the topics you need to master first, and offers in-depth tutorials that you can understand regardless of your educational background.

Your host here is Kevin Markham, and he is the founder of Data School. He has taught data science using the Python programming language to hundreds of students in the classroom, and hundreds of thousands of students (like you) online. Finding the right teacher was so important to his data science education, and so he sincerely hopes that he can be the right data science teacher for you.

https://www.youtube.com/user/dataschool\

Caltech Machine Learning

This is an introductory course by Caltech Professor Yaser Abu-Mostafa on machine learning that covers the basic theory, algorithms, and applications. Machine learning (ML) enables computational systems to adaptively improve their performance with experience accumulated from the observed data. ML techniques are widely applied in engineering, science, finance, and commerce to build systems for which we do not have a full mathematical specification (and that covers a lot of systems). The course balances theory and practice and covers the mathematical as well as the heuristic aspects.

[embed]https://www.youtube.com/playlist?list=PLD63A284B7615313A[/embed]

GitHub Repos

Awesome Data Science

This Repo answer the questions, “What is Data Science and what should you study to learn Data Science?” An awesome Data Science repository to learn and apply for real-world problems.

As the aggregator says, “Our favourite data scientist is Clare Corthell. She is an expert in data-related systems and a hacker and has been working on a company as a data scientist. Clare’s blog. This website helps you to understand the exact way to study as a professional data scientist.”

“Secondly, Our favourite programming language is Python nowadays for Data Science. Python’s — Pandas library has full functionality for collecting and analyzing data. We use Anaconda to play with data and to create applications.”

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

Essential Cheat Sheets for Machine Learning and Deep Learning Engineers

Machine learning is complex. For newbies, starting to learn machine learning can be painful if they don’t have the right resources to learn from. Most of the machine learning libraries are difficult to understand and the learning curve can be a bit frustrating. Kailash Ahirwar has created a repository on Github (cheatsheets-ai) containing cheatsheets for different machine learning frameworks, gathered from different sources. Have a look at the Github repository, also, contribute cheat sheets if you have any. Thanks.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]

HackerMath for Machine Learning

Math literacy, including proficiency in Linear Algebra and Statistics, is a must for anyone pursuing a career in data science. The goal of this workshop is to introduce some key concepts from these domains that get used repeatedly in data science applications.

As outlined by Amit Kapoor, “Our approach is what we call the ‘Hacker’s way’. Instead of going back to formulae and proofs, we teach the concepts by writing code. And in practical applications. Concepts don’t remain sticky if the usage is never taught.”

The focus here is on depth rather than breadth. Three areas are chosen — Hypothesis Testing, Supervised Learning and Unsupervised Learning. They are covered to sufficient depth — 50% of the time on the concepts and 50% of the time spent coding them.

[embed]https://www.amazon.com/Machine-Learning-R-Brett-Lantz-ebook/dp/B00G9581JM[/embed]


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

23 Apr 2019

Becoming Data-Driven with Data Catalog

Photo by [Clark Street Mercantile](https://unsplash.com/photos/S042liZk3A8?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/@mercantile?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

In this post you will learn about Data Catalog. What is the data catalog? Why do you need one and how can it benefit your organization to become data-driven?


Lets start with some context first:

The data landscape is growing and changing rapidly.

Data explosion in volume & variety

As we all know our data landscapes are getting more and more complex day by day the volume and variety of data coming from inside and outside your organization is increasing exponentially.

Self-service analytics

Your business users are requesting an easy way to consume data for their business needs.

Risk of non-compliance

At the same time there is this concern of data security and safety and there’s a need to stick to data compliance standards.

Cloud migrations

Many organizations are moving to cloud-based infrastructure which is driving many applications to be deployed as services which leads to more and more fragmentation and spread of data.

How to turn your dark data into a valuable asset?

So to bring value to your operational and analytical systems and your consumers to accomplish this you need to:

Classify your data

A way to categorize and classify all your data automatically at scale without any tedious manual work

Know more about your data

You need to develop a good understanding of your data and its relationships and basically you need to get to know your data as you would know people within your social network

Share your data knowledge

You should be able to share this knowledge in a compliant manner with everyone in your organization who needs this information so to do this effectively you need an intelligent data cataloging system.

How can a data catalog help your organization?

So how does the data catalog help your organization?

Self-service discovery for Analytics

A catalog promote self service by helping users to find the right data required for their analysis

Data Governance

For data governance, a catalog can provide that ground truth and it reflects the presence use and quality of the physical data in your data landscape in a way that’s understandable to your business users and

IT Impact Analysis

For IT operations, a catalog can show all data dependencies and help IT users to understand the impact of any changes that they are planning

What features an Enterprise Data Catalog provides?

So now let’s talk about the typical features of an enterprise data catalog. Enterprise data catalog is built ground up for scale to support even the most complex of your data environments it has built in machine learning to automate and simplify the collection and classification of metadata it has some unique capabilities:

Search & Discovery

Most of the tools offer an intuitive interface which makes it easy for non-technical users to search and discover and explore data assets across the enterprise.

Broad Connectivity

These tools offer the broad universal connectivity for all the systems/applications and BI/DBs across your environment.

Open APIs

Theses tools also have open REST APIs which makes it easy for users to enter the catalog content in any application of their choice as you can tell it offers the most comprehensive metadata solution for your enterprise.

What is the need for a Data Catalog?

Now let’s discuss how does it help various users in your organization:

Data Governance Office

If you’re part of the data governance office with the data catalog you can validate and impose data governance policies and definitions

Data Consumer

Data consumer can discover, understand and trust data required for your analysis.

Data Steward

You can manage metadata for key enterprise data acids and you can manage data quality through the others life-cycle.

Data Owner

As a data owner you can ensure the data managed within applications and processes deliver value to the business.

Data Architect

As a data architect you can make sure IT enables business to discover data assets within verify data quality and trace-ability as you can tell our data catalog can benefit both business and IT users.

References:

[embed]https://www.dataversity.net/build-a-data-driven-culture-with-a-data-catalog/[/embed]
[embed]https://www.dataversity.net/build-a-data-driven-culture-with-a-data-catalog/[/embed]


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to listen to me on SoundCloud.

14 Apr 2019

Working on Building Blocks

There are few core skills of every job. To perform that job, you need to be aware of core concepts, you need to be aware of the end to end process and you need to learn how to use related tools to perform that job. Data science in no different job, it has its own core concepts,processes and tools.

This post covers the core concepts you need to learn, end-to-end process you need to be aware of & important tools you need to master to work as a data scientist.

Please note that this post only outlines the concepts, processes and tools used by data scientists. I will publish the resources (mostly free) for these topics in upcoming post.

Still, if you want to build a quick understanding you can refer the following post:

[embed]https://medium.com/data-deft/data-science-the-complete-reference-series-3fb35077fc5a[/embed]

Concepts to learn

Mathematics

Data science contains math — no avoiding that! This section is for learners about basic math they need in order to be successful in almost any data science project/problem. So let’s start:

Multivariate Calculus

Calculus is a set of tools for analyzing the relationship between functions and their inputs. In Multivariate Calculus, we can take a function with multiple inputs and determine the influence of each of them separately.

In data science, we try to find the inputs which enable a function to best match the data. The slope or descent describes the rate of change off the output with respect to an input. Determining the influence of each input on the output is also one of the critical tasks. All this requires a solid understanding of Multivariate Calculus.

Linear Algebra

The word algebra comes from the Arabi word “al-jabr” which means “the reunion of broken parts”. This is the collection of methods deriving unknowns from knowns in mathematics. Linear Algebra is the branch that deals with linear equations and linear functions which are represented through matrices and vectors. In simpler words, it helps us understand geometric terms such as planes, in higher dimensions, and perform mathematical operations on them. By definition, algebra deals primarily with scalars (one-dimensional entities), but Linear Algebra has vectors and matrices (entities which possess two or more dimensional components) to deal with linear equations and functions.

Linear Algebra is central to almost all areas of mathematics like geometry and functional analysis. Its concepts are a crucial prerequisite for understanding the theory behind Data Science. You don’t need to understand Linear Algebra before getting started in Data Science, but at some point, you may want to gain a better understanding of how the different algorithms really work under the hood. So if you really want to be a professional in this field, you will have to master the parts of Linear Algebra that are important for Data Science.

Statistics & Probability

Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data. Probability is the chance that something will happen — how likely it is that some event will happen.

Statistics help you to understand your data and is an initial & very important step of Data Science. This is due to the fact that Data Science is all about making predictions and you can’t predict if you can’t understand the patterns in existing data.

Uncertainty and randomness occur in many aspects of our daily life and having a good knowledge of probability help us make sense of these uncertainties. Learning about probability helps us make informed judgments on what is likely to happen, based on a pattern of data collected previously or an estimate.

Data science often uses statistical inferences to predict or analyze trends from data, while statistical inferences use probability distributions of data. Hence knowing probability & statistics and its applications are important to work effectively on data science problems.

Programming

To execute the DS/AI pipeline, you need to learn algorithm design as well as fundamental programming concepts such as data selection, iteration and functional decomposition, data abstraction and organisation. In addition to this, you need to learn how to perform simple data visualizations using programming and embed your learning using problem-based assignments.

Machine Learning Algorithms

Machine learning algorithms can be divided into 3 broad categories —

  • Supervised learning,
  • Unsupervised learning
  • Reinforcement learning.

Supervised learning is useful in cases where a property (label) is available for a certain dataset (training set) but is missing and needs to be predicted for other instances. Unsupervised learning is useful in cases where the challenge is to discover implicit relationships in a given unlabeled dataset (items are not pre-assigned). Reinforcement learning falls between these 2 extremes — there is some form of feedback available for each predictive step or action, but no precise label or error message.

Intrinsic details of various algorithms is not in scope of this series, you can refer the resources mentioned in the next post to learn them.

Supervised learning can be further divided into Regression (Linear, Non-linear etc) & Classification (Logistics Regression, Decision Tree, Naïve Bayes etc) algorithms. Some algorithms can be used for regression as well as classification i.e. Random Forests, Support Vector Machines etc.

Unsupervised learning can also be further divided into Clustering, Anomaly Detection, Associative Mining.

Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Deep Learning Frameworks

Deep learning frameworks are a more advanced form of ML and solve specific problems where data is either unstructured or huge or both. Neural Nets, CNNs, RNNs & LSTM, GANs are the frameworks one needs to be aware of.

[embed]https://medium.com/data-deft/data-science-the-complete-reference-series-3fb35077fc5a[/embed]

Domain Knowledge

This lack of domain knowledge, while perfectly understandable, can be a major barrier to data scientists. For one thing, it’s difficult to come up with project ideas in a domain that you don’t know much about. It can also be difficult to determine the type of data that may be helpful for a project — if you want to build a model to predict an outcome, you need to know what types of variables might be related to this outcome so you can make sure to gather the right data.

Knowing the domain is useful not only for figuring out projects and how to approach them but also for having rules of thumb for sanity checks on the data. Knowing how data is captured (is it hand-entered? Is it from machines that can give false readings for any number of reasons?) can help a data scientist with data cleaning and from going too far down the wrong path. It can also inform what true outliers are and which values might just be due to measurement error.

Often the most challenging part of building a machine learning model is feature engineering. Understanding variables and how they relate to an outcome is extremely important for this. Knowing the domain can help direct the data exploration and greatly speed (and enhance) the feature engineering process.

Once features are generated, knowing what relationships between variables are plausible help for basic sanity checks. Being able to glance at the outcome of a model and determine if they make sense goes a long way for quality assurance of any analytical work.

Finally, one of the biggest reasons a strong understanding of the data is important is because you have to interpret the results of analyses and modeling work.

Knowing what results are important and which are trivial is important for the presentation and communication of results. It’s also important to know what results are actionable.

Process to follow

Problem Definition

The first thing you have to do before you solve a problem is to define exactly what it is. You need to be able to translate data questions into something actionable.

You’ll often get ambiguous inputs from the people who have problems. You’ll have to develop the intuition to turn scarce inputs into actionable outputs–and to ask the questions that nobody else is asking.

Data Collection

Once you’ve defined the problem, you’ll need data to give you the insights needed to turn the problem around with a solution. This part of the process involves thinking through what data you’ll need and finding ways to get that data, whether it’s querying internal databases, or purchasing external data-sets.

Data Understanding

The difficulty here isn’t coming up with ideas to test, it’s coming up with ideas that are likely to turn into insights. You’ll have a fixed deadline for your data science project, so you’ll have to prioritize your questions.

You’ll have to look at some of the most interesting patterns that can help explain why sales are reduced for this group. You might notice that they don’t tend to be very active on social media, with few of them having Twitter or Facebook accounts. You might also notice that most of them are older than your general audience. From that you can begin to trace patterns you can analyze more deeply.

Feature Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features from raw data that help facilitate the machine learning process. Feature Engineering is in fact an art.

Modelling

Depending on the type of question that you’re trying to answer, there are many modelling algorithms available. You run the selected algorithm/s on the training data to build the models.

Validation

Validation is a step used to evaluate the trained model on validation data. You use a series of competing for machine-learning algorithms along with the various associated tuning parameters that are geared toward answering the question of interest with the current data.

Tuning

Tuning an algorithm or machine learning technique can be simply thought of as a process which one goes through in which they optimize the parameters that impact the model in order to enable the algorithm to perform the best.

Deployment

After you have a set of models that perform well, you can operationalize them for other applications to consume. Depending on the business requirements, predictions are made either in real-time or on a batch basis. To deploy models, you expose them with an open API interface. The interface enables the model to be easily consumed from various applications.

Tools to master

The list mentioned here is not exhaustive, it depends more on what kind of problem you are solving and in what tech stack you are working.

SQL

Structured Query Language (SQL) is a standard computer language for relational database management and data manipulation. SQL is used to query, insert, update and modify data. Most relational databases support SQL.

As data collection has increased exponentially, so has the need for people skilled at using and interacting with data; to be able to think critically, and provide insights to make better decisions and optimize their businesses. The skills necessary to be a good data scientist include being able to retrieve and work with data and to do that you need to be well versed in SQL, the standard language for communicating with database systems.

[embed]https://medium.com/data-deft/data-science-the-complete-reference-series-3fb35077fc5a[/embed]

R

R is a programming language and software environment for statistical analysis, graphics representation and reporting. In the world of data science, R is an increasingly popular language for a reason. It was built with statistical manipulation in mind, and there’s an incredible ecosystem of packages for R that let you do amazing things — particularly in data visualization.

[embed]https://medium.com/data-deft/data-science-the-complete-reference-series-3fb35077fc5a[/embed]

Python

Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language. Python is no-doubt the best-suited language for a Data Scientist. It is a free, flexible and powerful open-source language. Python cuts development time in half with its simple and easy to read syntax. With Python, you can perform data manipulation, analysis, and visualization. Python provides powerful libraries for Machine learning applications and other scientific computations.

[embed]https://medium.com/data-deft/data-science-the-complete-reference-series-3fb35077fc5a[/embed]

Tensorflow

Currently, the most famous deep learning library in the world is Google’s TensorFlow. Google product uses machine learning in all of its products to improve the search engine, translation, image captioning or recommendations.

TensorFlow is the best library of all because it is built to be accessible to everyone. Tensorflow library incorporates different API to built at scale deep learning architecture like CNN or RNN. TensorFlow is based on graph computation; it allows the developer to visualize the construction of the neural network with Tensorboad. This tool is helpful to debug the program. Finally, Tensorflow is built to be deployed at scale. It runs on CPU and GPU.

[embed]https://medium.com/data-deft/data-science-the-complete-reference-series-3fb35077fc5a[/embed]

Keras

Keras is a high-level neural networks API, capable of running on top of Tensorflow, Theano, and CNTK. It enables fast experimentation through a high level, user-friendly, modular and extensible API.

Keras allows for easy and fast prototyping (through user-friendliness, modularity, and extensibility). It supports both convolutional networks and recurrent networks, as well as combinations of the two. It runs seamlessly on CPU and GPU.

[embed]https://medium.com/data-deft/data-science-the-complete-reference-series-3fb35077fc5a[/embed]


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

05 Apr 2019

Navigating the Landscape

Photo by [Rob Bates](https://unsplash.com/photos/0eLg8OTuCw0?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/landscape?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

DS/AI is a complex and evolving field. The first challenge a DS/AI aspirant faces is understanding the landscape and how he could navigate through it. Consider this, if you are travelling to a new city, and if you don’t have the map, you will have trouble to navigate the city and you will need to ask a lot of random people during your travel without knowing how much they know about the place. Similarly, all the newcomers to data science have this trouble, and there are two ways to deal with this, arrange the map (or a guide) or travel yourself and learn with experience.

This post intends to serve as a map of DS/AI field.

You might have heard data science, machine learning, deep learning, artificial intelligence etc terminology but might not be fully aware of these terms, what to use when and how these topics are interconnected. After going through this post, you should be able to understand what is where in DS/AI field.

Multi-disciplinary field

DS/AI is a multidisciplinary field with sub-fields of study in Math/Statistics, CS/IT & Business/Domain knowledge.

Math/Statistics is required to understand the data and relationship between data elements. CS/IT skills are required to process the data to generate insights. And Business or domain knowledge is required to apply above to skills in the context of a business problem.

Computer Science/IT

Programming is an essential skill to become a data scientist but one needs not be a hard-core programmer to learn DS/AI. Having familiarity with basic concepts of programming will ease the process of learning data science programming tools like Python/R. These basic concepts of programming should help a candidate get a long way on the journey to pursue a career in DS/AI as it is all about writing efficient code to analyse big data and not being a master of programming. Individuals should learn the basics of programming in Python/R (or any relevant language) before they begin to work on DS/AI problems/projects.

Maths & Statistics

Data science teams have people from diverse backgrounds like chemical engineering, physics, economics, statistics, mathematics, operations research, computer science, etc. You will find many data scientists with a bachelor’s degree in statistics and machine learning but it is not a requirement to learn DS/AI. However, having familiarity with the basic concepts of Math and Statistics like Linear Algebra, Calculus, Probability, etc. is important to learn DS/AI.

Domain/Business Knowledge

Subsequently, the business knowledge that the data scientists would need to have would be related to the domain that the project/analysis is in. For instance, if the data scientist is working for a credit card department in a bank, it will need to understand the specific business definitions, regulations, accounting policies & international standards, processes etc. This is the part that is more specific to the organization the data scientist is deployed in.

In my view, one thing to take care while the hiring data scientists is not to give huge preference to domain knowledge. This may severely limit the supply of data science talents to the organization. You would have a better chance of getting more value from data science by looking for those that are strong in math & programming, being able to convert business objectives to mathematical models. Based on my observation, this is a much more difficult skill to find or train, as compared to domain knowledge.

Various Terminologies

As a DS/AI starter, you will come across many similar terminologies. First thing you need to do is to understand what each term means and where each fits in the bigger picture. Data Science, Business Intelligence, Data Mining, Machine Learning, Deep Learning, Artificial Intelligence; let’s have a look at Wikipedia definition for each term & later see how these are interconnected.

Data Science

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, like data mining.

Business Intelligence

Business intelligence comprises the strategies and technologies used by enterprises for the data analysis of business information. BI technologies provide historical, current and predictive views of business operations.

Data Mining

Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Machine Learning

Machine learning is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.

Deep Learning

Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

Artificial Intelligence

Artificial intelligence, sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.

Interconnection

Data mining uses statistics and other programming languages to find hidden patterns in the data to explain a certain phenomenon. It helps in building a perception about the data using both math and programming.

Machine Learning deploys data mining techniques as well as other algorithms to develop models of what is happening behind some data to forecast future outcomes.

Artificial Intelligence uses models developed by Machine Learning and other algorithms to lead to intelligent behaviour. AI is very much programming based.

  • Data Mining demonstrates patterns
  • Machine Learning forecasts with models
  • Artificial Intelligence shapes behaviours

So you see that these terms are different but still inter-connected.

Data Science Roles

Before looking into the skill-set of a data scientist, let’s have a look at various roles required to work and deliver a data science project, after all, it’s a teamwork.

Every role has its own skills that are critical to data science projects at various stages.

Data Scientist

A data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning. She spends a lot of time in the process of collecting, cleaning, and munging data. Domain knowledge is also an integral part of the skill.

Machine Learning Engineer

Machine learning engineers are sophisticated programmers who develop machines and systems that can learn and apply knowledge without specific domain requirement.

Data Analyst

Data analysts translate numbers into plain English. Every business collects data, whether it’s sales figures, market research, logistics, or transportation costs. A data analyst’s job is to take that data and use it to help companies make better business decisions. There are many different types of data analysts in the field, including operations analysts, marketing analysts, financial analysts, etc.

Data Engineer

Data Engineers are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems.

Data Architect

Data architects build complex computer database systems for companies, either for the general public or for individual companies. They work with a team that looks at the needs of the database, the data that is available and creates a blueprint for creating, testing and maintaining that data architecture.

Analytics Manager

The data science manager coordinates the different tasks that must be completed by their team for a DS/AI project. Tasks may include researching and creating effective methods to collect data, analyzing information, and recommending solutions to business.

Business Analyst

Data science business analyst converts the business problem statement to a DS/AI problem statement which means what data needs to be analyzed to arrive at the insights. The data would then be reviewed with the technology team and results would be delivered to the business team in the form of insights and data patterns. The business analyst should also be knowledgeable enough to apply various predictive modelling techniques and right model selection for generating insights for the problem at hand.

Quality Analyst

The job of quality analyst includes checking the quality of the training data-set, preparing data-sets for testing, running statistics on human-labelled data-sets, evaluating precision and recall on the resulting ML model, reporting on unexpected patterns in outputs, and implementing necessary tools to automate repetitive parts of the work. Experience in software testing with data quality or DS/ML focus, understanding of statistics, exposure to Data Science / Machine Learning techniques and coding proficiency in Python, are some of the skills required for the job.

To work on DS/AI projects in any of the above mentioned roles, one needs to have an understanding of the core concepts at a high level but depth is required in the specific area you would be working in.

Academia Vs Industry

Academia and Industry are different fields with different people and culture. People working in Academia for longer tenure may find it difficult to adjust to industry culture and vice versa.

There is also an academic trap when your career trajectory is so specialized for academia that you’re unprepared for a job outside of it.

The academic trap happens in all areas of study, but for this post, we focus only on DS/AI students who want to leave academia for data science positions.

Further, companies are often hesitant to hire people coming straight from academia for various reasons like:

  • In academia, individuals prefer writing papers over internships, making grants over learning programming languages, and not doing the things that could help you in the industry but not academia. The things that are important for academic hirings, such as papers, talks, and grants, are not as important in the industry.
  • Working as a data scientist within a corporation requires an understanding of how the business world works, including how quickly deliverable need to be made, how to craft a good presentation, and how to word an email to make a request.
  • In academia, you are encouraged to find the most innovative and elegant solution. In industry, you are encouraged to spend as little time as possible to find an analytical solution that just fits the need.
  • Salary expectations for advanced degree holders are higher than someone with only an undergraduate degree. This also pushes away recruiters as the industry works in a different way, culture is simply different than the academic one. People coming from academia need to learn these lessons at their first job, which means that there is a lot of risk for the hiring company.

Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

21 Mar 2019

Understanding the Big Picture

Photo by [Joshua Ness](https://unsplash.com/photos/Fd1YZE641t8?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/big-picture?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

When you start learning a new skill, the first thing is to look at the big picture and where your would-be skills fit in the field. It gives you a context of what role you can play or are expected to play. And when the skill and field are evolving & overwhelmingly large, you are so engrossed in the details that most probably you tend to miss the purpose.

In my view, to understand the big picture, ask yourself ‘why’ more often and start with the end in your mind.

The following analogy is not particularly about DS/AI but in general.

Job — Interview — Network — Portfolio — Core skills — Resources — Awareness

As you can see above, to land the job you want you to need to crack the interview. To get the call for the interview, you need a network that can refer you for opportunities. To have a relevant & helpful network, you need to have impressive credentials or portfolio. To have a solid portfolio, you need to work on the core skills of the job role that you want. To build core skills, you need to have the right resources. To gather the right resources, you need to have awareness of the field in which you want to build your career.

Now, reverse the order and assess at what stage you are and fill the gap in the coming months to build your career.

Based on the above approach, this post gives you a holistic view of what you need to know and do to build your road-map in DS/AI field.

In upcoming sections, you will get to know about these steps and their details are covered in subsequent posts of the blog-post series.

Each section represents an upcoming blog-post of this series.

Before learning DS/AI, it is very much required that you understand the overall landscape and where all the buzzwords like data science, machine learning, deep learning fit in. In this post, you will learn different terminologies, their meaning and how these are interconnected.

DS/AI is a vast field, not a single person has all the required knowledge or does all day to day tasks. Due to the variety of skills required in data science projects, specific roles are evolving so that individual can contribute according to their abilities. In this post, you will get to know almost all the DS/AI roles, which will give you a better idea which one suits you.

Working on building blocks

This is the core part of DS/AI, knowing the concept, process & tools are the most important part of any job. Intrinsic details of the building blocks are not in the scope of this series. This post will make you aware of what is in the scope of DS/AI field and which role need to have what kind of skills.

After completing this section you will get to know what is in the scope of DS/AI field. You can relate the concepts, process & tools based on the role you think you can fit in. Think of this post as the syllabus of DS/AI field with electives based on your role.

Utilizing the resources

There is no dearth of learning resources but that has only confused the starters in the data science field. I get many questions related to what books to read, what courses to enrol, what blogs/portals to follow, what data-sets to work on.

In this post, I will provide critical reviews of all the prominent books, courses, blogs, portals, data-sets & podcasts etc. Please note that this list is based on my exposure to the field, there may be additional and better resources but I think you will get enough exposure after going through this post to evaluate other resources comparatively.

Building your portfolio

Refining & honing the skills required for a job is one part, showcasing what you can offer is another. What if you have all the required skills but nobody looking for the skill-set is aware that you do. Having a credible portfolio is an effective way to showcase your skill-set and talent.

After going through this post you will get to know what is needed to build an impressive portfolio even before entering the market looking for a relevant job.

Networking & landing the job

You know, you have filtered the best-suited role & honed your skill-set & built the portfolio as well. Your mission is not accomplished unless you have like-minded professionals in your network who are aware of your capabilities and can refer you for the jobs that match your skills. Interview preparation is another task in itself.

This post will focus on the steps required to network & land the job in DS/AI field. How to network with like-minded professionals, how to search for relevant job openings, how to prepare & crack DS/AI interviews.

Putting it all together

Before diving into the details, in the current post, you are looking at the big picture of what will be covered in this series and how that fits into your learning curve as a DS/AI starter/enthusiast.

After going through each step in detail, just like the current post, this post will revisit the high-level steps once again so that we can make sure that you have covered what you needed to.

Appendix

DS/AI is an evolving field due to ongoing technology disruption. While the core concepts remain the same, tools/technology/frameworks will be changing. Any literature published on DS/AI has to be reviewed at regular interval.

Apart from above, there are few additional topics that I would like to cover in this section like Beginners’ FAQs, Data Scientists interviews.

In this section, you will get to know what are the frequently asked questions by DS/AI starters. What current crop of data scientists has done to start their career. How you can continue from here & how to be future-ready in this ever-evolving field.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

30 Jan 2019

Data Science Self-Starter Handbook (Preview)

This is the pilot post of blog post series ‘DS/AI: Self-Starter Handbook’, this post covers the context, table of content & links to upcoming posts of this series topic-wise.

  • Why this series?
  • What these series cover?
  • Who this series is for?

***Table of Content***


Want to learn more? visit www.ankitrathi.com

It is probably not a secret that data science/artificial intelligence (DS/AI) has become one of the most exciting fields of this age. Although it may seem the buzzword of our time, it is certainly not the hype. This exciting field opens the way to new possibilities and is becoming indispensable to our daily lives.

Organizations, big or small, are heavily investing in DS/AI research and applications these days. And hence, it has become the hottest career. If you want to become a DS/AI practitioner, there is no better time than this.

Aspirants are taking different approaches to get into the field, some are fortunate enough to be put into projects as freshers, but most aspirants are building their capabilities by learning theory and applying them on public data-sets.

While there is no dearth of free & paid material, too much of information has only confused the current crop of DS/AI aspirants.

Based on the questions that I am asked by them on day to day basis, I can see how perplexed they are. Not to say, taking advantage of the situation, most of the training institutes are minting money like anything, bundling even non-relevant courses as well in data science ones.

Why this series?

From a time around when DS/AI field started picking up, every other day I get at least 8–10 messages from DS/AI starters & enthusiasts on ‘How can I get into DS/AI field?’. Over a while, I have improvised my response based on the follow-up questions they ask like:

  1. What is the difference between DS, ML, DL, AI, DM?
  2. What are the roles in DS/AI, who does what?
  3. What concepts, processes & tools they need to learn?
  4. Which books, courses, etc they need to refer to?
  5. How to build a DS/AI portfolio?
  6. How to write a resume for DS/AI?
  7. How to build a helpful network?
  8. How to search for the job?
  9. How to prepare for the interview?
  10. How to stay up to date in this still-evolving field?

You can notice that these questions are not conceptual ones and there is no dedicated material to address these roadblocks.

How about a book or course that gives you enough exposure to the DS/AI field that you can yourself analyze what is needed for you and build your own roadmap?

My answer to the above question is this series.

What these series cover?

This series is organized into 9 posts:

Post 1 is the one you are reading right now, it captures the motivation to write this blog series.

Post 2 covers the high-level approach to how you can build your road-map to learn data science & build a career in it.

Posts 3–8 cover all the steps mentioned in chapter 1 in greater depth to give you enough knowledge to build your road-map.

Post 9 guides on how to continue from here, keep yourself updated and remain ahead of the curve in this ever-evolving field.

Who this series is for?

This series can be useful for a variety of readers, but I wrote it with two main target audiences in mind. One of these target audiences is university students learning about DS/AI field, including those who are starting a career in data science and artificial intelligence. The other target audience is professionals working in other fields, who do not have a data science, statistics or programming background but want to rapidly acquire one and expand their career.

There is no assumption or prerequisites for a reader to cover before reading this series.

Looks interesting? Stay tuned to read the upcoming post as soon as it is published.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

27 Dec 2018

Data Science Introduction

This is the 2nd post of blog post series ‘Data Science: The Complete Reference’, this post covers these topics related to data science introduction.

  • What is Data Science?
  • Why Data Science is important?
  • How to do Data Science?

Want to learn more? visit www.ankitrathi.com

What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining. ~Wikipedia

Data Science is a field where we apply ‘science’ to available ‘data’ in order to get the ‘patterns’ or ‘insights’ which can help a business to optimize operations or improvise decisions.

Different aspects of Data Science

[embed]https://www.edureka.co/blog/what-is-data-science/[/embed]
[embed]https://www.edureka.co/blog/what-is-data-science/[/embed]
[embed]https://www.edureka.co/blog/what-is-data-science/[/embed]

Why Data Science is important?

Every business has data but its business value depends on how much they know about the data they have.

Data Science has gained importance in recent times because it can help businesses to increase business value of its available data which in turn can help them to take competitive advantage against their competitors.

It can help us to know our customers better, it can help us to optimize our processes, it can help us to take better decisions. Because of data science, data has become strategic asset.

In the following chart, you can have a look at the business use cases where data science is being used in the industry.

<https://www.edureka.co/blog/what-is-data-science/>

[embed]https://www.edureka.co/blog/what-is-data-science/[/embed]
[embed]https://www.edureka.co/blog/what-is-data-science/[/embed]
[embed]https://www.edureka.co/blog/what-is-data-science/[/embed]

How to do Data Science?

A typical data science process looks like this, which can be modified for specific use case:

  • Understand the business
  • Collect & explore the data
  • Prepare & process the data
  • Build & validate the models
  • Deploy & monitor the performance

<https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview>

We will discuss each step in detail in upcoming posts.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

06 Nov 2018

AMA Session with AR on Data Science

While I keep answering questions asked by DS beginners and keep discussing interesting topics with DS practitioners on LinkedIn, recently we had a kind of ‘Ask Me Anything’ (AMA) session where many beginners/practitioners asked me some interesting questions. This post is the collection of questions asked and my responses, I believe you will find it useful.


Latish Khubnani: Is it wiser to start with Data Analyst position than Data Scientist? specialy when u don’t have PhD?

Ankit Rathi: Having a PhD is not a requirement to become Data Scientist unless there is any sophisticated research work there & having PhD helps in particular. However, if building a skill-set for Data Scientist looks intimidating, one can start with Data Analyst job.

Eshan Bhatt: When it comes to DS, its not only important to possess the technical knowledge and an analytical approach, but also to have the domain knowledge of the specific industry. Well it is definitely not easy to have good knowledge of all the industries. In this case how would you tackle such situation while looking for jobs in DS/DA?

Ankit Rathi: As the DS field is evolving, while hiring a Data Scientist, we focus on DS know-how (Stats/Maths, IT/Programming) and have a domain expert separately in the project as it very rare to find the unicorn. Data Scientists learn the domain over the period of time working with domain experts, so it’s a nice to have skill, at least in the projects I have worked till date.

Devendra Kumar: Since I believe it will take a much time to grab a data science job, can i meanwhile work as a data analyst and can prepare for DS and again switch to DS, is it a safe approach? Learning excel, advance excel and sql, scripting tool eg python and reporting tool such as power bi or tableu, would this be enough to get a job as Data Analyst?

Ankit Rathi: This looks like a good approach, if you are finding it difficult to land a job as DS you can start with any data related job/project (data analysis, visualization, SQL related etc) and keep honing your skills for the right opportunity.

Every job requires some generic & some specific skills, have a look at the various job descriptions for data analysts on job portals, you will get to know what skills are general and frequently required.

Gayatri Iyer: How to think like a data scientist? In our day to day life, we may come across many instances where data science can be applied. How to identify these potential data science problems hidden in the mundane world?

Ankit Rathi: To me, thinking like data scientist is analyzing a use case and backtracking it to its data sources & identifying the patterns in data that are helpful to meet the business objective. In other words, identifying actionable insights by analyzing relevant data of a business function. Identifying data science use cases requires both, knowledge of data science as well as domain expertise.

So as a data scientist, if we don’t have expertise in that domain, its critical for us to educate business leaders and operators data science in intuitive way so that they can identify the use cases for us and we can validate that.

We can also correlate the use cases of our current domain with previously done projects in other domains, like for that kind of business function we built that use-case so is there similar business function in this domain?

Lakshmipathi G: What are the main components are should follow when non engineering or non technical guy want to become a Data Scientist?

Ankit Rathi: There are 3 main aspects of Data Science: Maths/Stats, IT/Programming & Business/Domain. Non-engineering candidates can focus on Programming & Maths part in parallel, while domain is learnt while working on real projects. In my view, quickest way to start with DS is ‘Kaggle Learn’ for anyone these days. Please refer this podcast:

[embed]https://link.medium.com/zZyNd99nBR[/embed]

Vishal Tiwari: What are the techniques to get the feature in text classification?

Ankit Rathi: Basically, you need to convert text into different kind of numerical representation, some methods are common, some are contextual. Some common methods based on vectorization are mentioned in these articles:

[embed]https://link.medium.com/zZyNd99nBR[/embed]
[embed]https://link.medium.com/zZyNd99nBR[/embed]

Sohini Aich: How being a post graduate in control and instrumentation get a break into Data science field and land a job?

Ankit Rathi: As I responded above, there are 3 main aspects to learn in Data Science: Maths/Stats, IT/Programming, Business/Domain Please refer following podcast:

[embed]https://link.medium.com/zZyNd99nBR[/embed]

Mukul Sharma: Can we use bootstraping where statistical test is accuracy, and putting a significance level for model testing? I think this will give better results than cross validation.

Ankit Rathi: Statistical tests are subjective to the problem space we are working, there is no silver bullet as such. Generally, I don’t rely on a specific method and try different methods in the context, if most of the results from these methods say the same thing, I rely on that.

Manish Shinde: What things I should learn in order to build a small data science project which mimics actual production level scenario.? I am trying to learn aws and django to build something in data science. Eg. Find players within budget with Moneyball dataset.

Ankit Rathi: In my view, it’s difficult to do a self-initiated project which mimics a production level project as there are many things missing like business stakeholders’ expectations, technology ecosystem limitations etc. But at the same time, I believe it’s a good exercise to get that kind of exposure, so please carry on.

Purnasai Gudikandula: Everything about kaggle . tips and tricks, and how much it takes for beginner to become kaggler and then from kaggler to grandmaster like you Ankit Rathi sir?

Ankit Rathi: First of all, I am just a Kaggle Competitions Expert, these are tags you get based on how you perform in Kaggle competitions, there are other tags related to kernels & discussions.

In my view, start with ‘Kaggle Learn’ section to learn the basics, then learn from ‘Kaggle Kernels’ about how other DS have solved common problems using DS techniques, then you can participate in basic competitions for learning like Titanic, Housing Prices, Digit Recognizer etc.

Once you are comfortable with above, you can participate in any live competitions. To become a Kaggle expert, it depends, I took 2 years as I was active intermittently but many brilliant people have achieved that in 6 months. So it depends how much you are active and engaged.


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to listen to me on SoundCloud & visit my website https://ankitrathi.com.

17 Sep 2018

Dealing with Noisy Data in Data Science

This article discusses the types of noise you encounter while working on data (tabular data) in data science projects and possible approaches you can take to deal with such noise. For detailed explanation of the methods mentioned in this post, please refer the links in ‘Reference’ section or explore yourself.

  • Noise in data
  • Noise as an item (Noise1)
  • Noise as a feature (Noise2)
  • Noise as a record (Noise3)
  • Unsupervised methods

Visit ankitrathi.com now to:

— to read my blog posts on various topics of AI/ML

— to keep a tab on latest & relevant news/articles daily from AI/ML world

— to refer free & useful AI/ML resources

— to buy my books on discounted price

— to know more about me and what I am up to these days

We were working on a dataset for our data science project, where we saw that our model was not performing up to the mark. While performance is a subjective term and there can be many reasons for an under-performing model, our hunch was that this is because of the noise in the dataset.

We tried many approaches to identify and reduce this noise. Some of them worked, and some of them didn’t, because of the specific nature of the problem and the patterns in the data.

Based on my above experience, I am going to discuss various type of noise in data, and the approaches and methods to identify & reduce noise in a given dataset.

Understanding Noise in Data

Noise (in the data science space) is unwanted data items, features or records which don’t help in explaining the feature itself, or the relationship between feature & target. Noise often causes the algorithms to miss out patterns in the data.

Noise in tabular data can be of three types:

  1. Anomalies in certain data items (Noise 1: certain anomalies in features & target)
  2. Features that don’t help in explaining the target (Noise 2: irrelevant/weak features)
  3. Records which don’t follow the form or relation which rest of the records do (Noise 3: noisy records)

Benefits of identifying & treating noise in data:

  • enables the DS algorithm to train faster.
  • reduces the complexity of a model and makes it easier to interpret
  • improves the accuracy of a model if the right subset is chosen
  • reduces overfitting

These are the ways of dealing noise within data based on the type of noise:

Noise as an item

We can analyse the features & target and identify the noise in terms of outliers.

Outlier detection & treatment: either remove the records or put upper and lower ceiling.

Noise as a feature

This type of noise is introduced when there are features in the data which are not related to target or doesn’t help explaining target.

Feature Selection or Elimination

Not all features are important, so we can use various methods to find the best subset of features:

Filter method

We can perform various statistical tests between feature & response to identify which features are more relevant than others.

Please note that above methods don’t identify or deal with multicollinearity, we need to figure that out separately.

Wrapper method

Here we add/remove features to baseline model and compare the performance of the model:

  • Forward selection
  • Backward elimination
  • Recursive elimination

Embedded Methods (Regularization)

This method make use of filter & wrapper method, it is implemented using algos which have its own built-in feature selection methods.

Noise as a record

In these methods, we can try to find the set of records which have noise.

K-fold validation

In this method, we can look at the cross validation score of each fold and analyse the folds which have poor CV scores, what are the common attributes of records having poor scores, etc.

Manual method

Here we can evaluate CV of each record (predicted vs. actual) and filter/analyse the records having a poor CV score. This will help us in analyzing why this is happening in the first place.


If you are loving this post, check my post on ‘How to launch your DS/AI Career in 12 weeks?’

[embed]https://medium.com/\@rathi.ankit/how-to-launch-your-ds-ai-career-in-12-weeks-8a7e6950ffe6[/embed]


Unsupervised Methods (Anomaly Detection)

We can also use unsupervised learning algorithms to identify anomalies in data, these are mostly categorized as Anomaly Detection techniques.

Density-based anomaly detection

This method assumes normal data points occur around a dense neighborhood and abnormalities are far away. i.e. kNN & LOF based methods

Clustering-based anomaly detection

Using clustering technique, we can analyse the clusters to analyse which has noise. Data instances falling outside the clusters can be marked as anomalies. i.e. k-Means clustering

SVM-based anomaly detection

This technique uses SVM to learn the soft boundary in the training set and tune on validation set to identify anomalies. In this approach, the need of large samples by the previous approach is reduced by using Support Vector Machine while maintaining the high quality of clustering-based anomaly detection methods. i.e. One-class SVM

Autoencoder-based anomaly detection

Auto-encoders are used in deep learning for unsupervised learning, we can use them for anomaly detection to identify noisy data-set. These methods are advanced and outperforms traditional anomaly detection methods. i.e. Variational Autoencoder based Anomaly Detection using Reconstruction Probability.

Conclusion

Not every method mentioned above suits in every situation or problem. We need to analyse what kind of noise we have in our data, and try corresponding methods to remove or minimize it. In our project some of methods we tried & worked based on the specific patterns in our data-set.

References

[embed]https://medium.com/\@rathi.ankit/how-to-launch-your-ds-ai-career-in-12-weeks-8a7e6950ffe6[/embed]
[embed]https://medium.com/\@rathi.ankit/how-to-launch-your-ds-ai-career-in-12-weeks-8a7e6950ffe6[/embed]
[embed]https://medium.com/\@rathi.ankit/how-to-launch-your-ds-ai-career-in-12-weeks-8a7e6950ffe6[/embed]
http://dm.snu.ac.kr/static/docs/TR/SNUDM-TR-2015-03.pdf


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also, feel free to visit my webpage https://ankitrathi.com.

25 Aug 2018

To Kaggle Or Not To Kaggle?

To Kaggle Or Not To Kaggle?

This is my 2nd podcast on SoundCloud, this podcast covers one of the most frequently discussed topic by beginners as well as experts—To Kaggle Or Not To Kaggle?

I have attached my podcast to this blog post to share the transcript and the links of the resources mentioned in the podcast.

Want to learn more? visit www.ankitrathi.com


Listen to my podcast here:

[embed]https://soundcloud.com/ankitrathi/e02-to-kaggle-or-not-to-kaggle[/embed]


Podcast Transcript:

Hello everyone, my name is Ankit Rathi & welcome to ‘Data Deft’, a data science podcast.

When I got introduced to Data Science in 2012, the second platform that helped me to shape my data science skills was Kaggle, first one was (of course) Coursera’s Machine Learning course by Andrew Ng. That time, Kaggle was only about competitions, other useful sections like Kernels, Datasets & Learn were not there.

I have participated in 6 competitions till now, learnt a lot and won medals in 3 competitions (1 silver & 2 bronze). I am a Kaggle expert, having my highest rank of around 1K out of 80K participants back then. These days, I don’t get much time to participate but I look at the winning solutions of recently concluded competitions whenever I get some time to keep myself updated.

[embed]https://www.kaggle.com/ankitrathi/competitions[/embed]

In this post, I am going to put my views on what ‘Kaggle competitions is good for & what it is not’. In short, Kaggle is great in many aspects but it is not everything to data science. When you work on a real-world data science project, you need to deal with much more challenges and different set of skills are required for that.

Note: By Kaggle competition here, I mean any data science hackathon or competition you intend to compete.

The major difference between Kaggle competitions & real-world data science projects is that Kaggle competitions are based on supervised learning while data science projects can be anything, supervised or unsupervised.

To make the difference clearer, I will elaborate the gap for each step in data science framework. Lets take CRISP-DM methodology as a baseline here:

1. Business Understanding

In Kaggle competitions, you get the business problems formulated for you. While in data science projects, you have to identify and build the problem statement yourself. Most of the time, a stakeholder or customer doesn’t know what problems can be solved by data science; sometimes, even if they know, they have a vague requirement like ‘we need to increase our sales or we need to improvise our operations or we need to optimize our business decisions’. Data scientists need to sit with stakeholders to formulate the problem statement & translate it into a data science problem.

2. Data Understanding

Another point which differentiates Kaggle competitions from real world problems is that in Kaggle competitions, you get the data which is mostly processed & classified into train & test. While in data science projects, you identify what data qualifies for your problem statement. Most of the time, you have to identify what qualifies as a feature, what is the suitable target variable. Sometimes, its not straight-forward to identify the target variable, you define it by working with domain experts. You also have to define the split methodology for data to be split into train, valid & test sets.

3. Data Preparation

This step doesn’t have much difference in Kaggle competitions & real world projects but real-world data is more complex & dirty so more cleaning & preparation is required. Overall, participating in Kaggle competitions will help you to improvise your data cleaning & data preparation skills.


If you liked this post, you may also like this post where I talk about how to start in DS/AI field:

[embed]https://www.kaggle.com/ankitrathi/competitions[/embed]


4. Modeling

Again, this step doesn’t have much difference in Kaggle competitions & data science projects. In fact, participating in Kaggle competition is beneficial for this particular step as you get to know which model works better for what kind of problem.

5. Evaluation

Another step where Kaggle competitions are different from real world problems is that the evaluation metric is defined for you. While in data science projects, you choose which evaluation metric will be suitable for your project. But participating in Kaggle competitions will give you exposure to evaluation matrices and what metrics to use where. You will also get to know how not to overfit your model on train data.

6. Deployment

In Kaggle competitions, you get a submission format in which you submit your predictions. While in data science projects, you have to deploy the models in live environment for business to use. You also have to understand tech-ecosystem of the customer, how you will integrate your solution and how you will monitor the performance of your model.

Another major difference between Kaggle competitions and data science projects is that the participants build way too many models and keep ensembling them to get advantage on the leaderboard and ultimately, these complex models are not fit to be deployed in production.

But over the years, Kaggle has recognized that gap and now they have other sections like Learn, Kernels & Datasets, do check them out to improvise your skills further.

Thanks for listening to the podcast, I will be waiting for your feedback.

References:

[embed]https://www.kaggle.com/ankitrathi/competitions[/embed]
[embed]https://www.kaggle.com/ankitrathi/competitions[/embed]
[embed]https://www.kaggle.com/ankitrathi/competitions[/embed]


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to listen to me on SoundCloud.

15 Aug 2018

Bayesian Statistics for Data Science

Bayesian Statistics for Data Science

This is the 5th post of blog post series ‘Probability & Statistics for Data Science’, this post covers these topics related to Bayesian statistics and their significance in data science.

  • Frequentist Vs Bayesian Statistics
  • Bayesian Inference
  • Test for Significance
  • Significance in Data Science

Visit ankitrathi.com now to:

— to read my blog posts on various topics of AI/ML

— to keep a tab on latest & relevant news/articles daily from AI/ML world

— to refer free & useful AI/ML resources

— to buy my books on discounted price

— to know more about me and what I am up to these days

Frequentist Vs Bayesian Statistics

Frequentist Statistics tests whether an event (hypothesis) occurs or not. It calculates the probability of an event in the long run of the experiment. A very common flaw found in frequentist approach i.e. dependence of the result of an experiment on the number of times the experiment is repeated.

Frequentist statistics suffered some great flaws in its design and interpretation which posed a serious concern in all real life problems:

  1. p-value & Confidence Interval (C.I) depend heavily on the sample size.
  2. Confidence Intervals (C.I) are not probability distributions

Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data.

Frequentist Vs Bayesian Statistics

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]

Bayesian Inference

To understand Bayesian Inference, you need to understand Conditional Probability & Bayes Theorem, if you want to review these concepts, please refer my earlier post in this series.

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]

Bayesian inference is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available.

An important part of Bayesian Inference is the establishment of parameters and models. Models are the mathematical formulation of the observed events. Parameters are the factors in the models affecting the observed data. To define our model correctly , we need two mathematical models before hand. One to represent the likelihood function ** and the other for representing the distribution of prior beliefs .** The product of these two gives the posterior belief distribution.

Courtesy: <http://jason-doll.com/wordpress/?page_id=127>

[embed]https://www.youtube.com/watch?v=5NMxiOGL39M[/embed]

Likelihood Function

A likelihood function is a function of the parameters of a statistical model, given specific observed data. Probability describes the plausibility of a random outcome, without reference to any observed data while Likelihood describes the plausibility of a model parameter value, given specific observed data.

Likelihood function

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]

Prior & Posterior Belief distribution

Prior Belief distribution is used to represent our strengths on beliefs about the parameters based on the previous experience. Posterior Belief distribution is derived from multiplication of likelihood function & Prior Belief distribution.

As we collect more data, our posterior belief move towards prior belief from likelihood:

Courtesy: <https://jimgrange.wordpress.com/2016/01/18/pesky-priors/>

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]

Test for Significance

Bayes factor

Bayes factor is the equivalent of p-value in the Bayesian framework. The null hypothesis in Bayesian framework assumes ∞ probability distribution only at a particular value of a parameter (say θ=0.5) and a zero probability else where. The alternative hypothesis is that all values of θ are possible, hence a flat curve representing the distribution.

Courtesy: <http://areshenk-research-notes.com/bayes-factors-and-stopping-rules/>

Using Bayes Factor instead of p-values is more beneficial in many cases since they are independent of intentions and sample size.

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]

High Density Interval (HDI)

High Density Interval (HDI) or Credibility Interval is equivalent to Confidence Interval (CI) in Bayesian framework. HDI is formed from the posterior distribution after observing the new data.

Courtesy: <https://www.slideshare.net/ASQwebinars/bayesian-methods-in-reliability-engineering-15204318>

Using High Density Interval (HDI) instead of Confidence Interval (CI) is more beneficial since they are independent of intentions and sample size.

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]

Moreover, there is a nice article published on AnalyticsVidhya on this which elaborate on these concepts with examples:

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]

Significance in Data Science

Bayesian statistics encompasses a specific class of models that could be used for Data Science. Typically, one draws on Bayesian models for one or more of a variety of reasons, such as:

  • having relatively few data points
  • having strong prior intuitions
  • having high levels of uncertainty

And there are scenarios where Bayesian statistics will perform drastically, please read following discussion for details:

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]

References:

[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]
[embed]https://www.coursera.org/lecture/bayesian/frequentist-vs-bayesian-inference-q5CTh[/embed]


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

04 Aug 2018

Descriptive Statistics for Data Science

Descriptive Statistics for Data Science

This is the 3rd post of blog post series ‘Probability & Statistics for Data Science’, this post covers these topics related to descriptive statistics and their significance in data science.

  • Introduction to Statistics
  • Descriptive Statistics
  • Uni-variate Analysis
  • Bi-variate Analysis
  • Multivariate Analysis
  • Function Models
  • Significance in Data Science

Visit ankitrathi.com now to:

— to read my blog posts on various topics of AI/ML

— to keep a tab on latest & relevant news/articles daily from AI/ML world

— to refer free & useful AI/ML resources

— to buy my books on discounted price

— to know more about me and what I am up to these days

Statistics Introduction

Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data.

Informal definition :D

Statistics, in short, is the study of data. It includes descriptive statistics (the study of methods and tools for collecting data, and mathematical models to describe and interpret data) and inferential statistics (the systems and techniques for making probability-based decisions and accurate predictions.

Descriptive Vs Inferential Statistics

Population vs Sample

Population means the aggregate of all elements under study having one or more common characteristic while sample is a part of population chosen at random for participation in the study.

Population Vs Sample

Descriptive Statistics

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information. Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at hand.

Types of Variable

Dependent and Independent Variables: An independent variable (experimental or predictor) is a variable that is being manipulated in an experiment in order to observe the effect on a dependent variable (outcome).

Other names of variables

Categorical and Continuous Variables: Categorical variables (qualitative) represent types of data which may be divided into groups. Categorical variables can be further categorized as either nominal, ordinal or dichotomous. Continuous variables (quantitative) can take any value. Continuous variables can be further categorized as either interval or ratio variables.

*Categorical Vs continuous variables*

Central Tendency

Central tendency is a central or typical value for a distribution.It may also be called a center or location of the distribution. The most common measures of central tendency are the arithmetic mean, the median and the mode.

Mean, median & mode as Central Tendency

Mean is the numerical average of all values, median is directly in the
middle of the data set while mode is the most frequent value in the data set.

Spread or Variance

Spread (dispersion or variability) is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and inter-quartile range (IQR).

Spread or Variance

Inter-quartile range (IQR) is the distance between the 1st quartile and 3rd quartile and gives us the range of the middle 50% of our data. Variance is the average of the squared differences from the mean while standard deviation is the square root of the variance.

Upper outliers: Q3+1.5 ·IQR
Lower outliers: Q1–1.5 ·IQR

Standard Score or Z score: For an observed value x, the Z score finds the number of standard deviations x is away from the mean.

Standard deviation & Z-score

The Central Limit Theorem is used to help us understand the following facts regardless of whether the population distribution is normal or not:\

  1. the mean of the sample means is the same as the population mean\
  2. the standard deviation of the sample means is always equal to the standard error.\
  3. the distribution of sample means will become increasingly more normal as the sample size increases.

Uni-variate Analysis

In uni-variate analysis, appropriate statistic depends on the level of measurement. For nominal variables, a frequency table and a listing of the mode(s) is sufficient. For ordinal variables the median can be calculated as a measure of central tendency and the range (and variations of it) as a measure of dispersion. For interval level variables, the arithmetic mean (average) and standard deviation are added to the toolbox and, for ratio level variables, we add the geometric mean and harmonic mean as measures of central tendency and the coefficient of variation as a measure of dispersion.

For interval and ratio level data, further descriptors include the variable’s skewness and kurtosis. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.

Skewness & Kurtosis

Mainly, bar graphs, pie charts and histograms are used for uni-variate analysis.

Histogram, Pie-chart & Bar-graph

Bi-variate Distribution

Bivariate analysis involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.

For two continuous variables, a scatter-plot is a common graph. When one variable is categorical and the other continuous, a box-plot or violin-plot (also Z-test and t-test) is common and when both are categorical a mosaic plot is common (also chi-square test).

Box-plot, Scatter-plot, Mosaic-plot & Violin-plot

z & t-Tests

Multi-variate Analysis

Multi-variate analysis involves observation and analysis of more than one statistical outcome variable at a time. Multi-variate scatter plot, grouped box-plot (or grouped violin-plot), heat-map are used for multi-variate analysis.

Multi-variate scatter-plot, Grouped box-plot, Grouped violin-plot, Heat-map

Function Models

A function can be expressed as an equation, as shown below. In the equation, f represents the function name and x represents the independent variable and y represents the dependent variable.

A typical function

A linear function has the same average rate of change on every interval. When a linear model is used to describe data, it assumes a constant rate of change.

Linear function

Exponential functions have variable appears as the exponent (or power) instead of the base.

Exponential functions

The logistic function has effect of limiting upper bound, a curve that grows exponentially at first and then slows down and hardly grows at all.

Logistic functions

Significance in Data Science

Descriptive Statistics helps you to understand your data and is initial & very important step of Data Science. This is due to the fact that Data Science is all about making predictions and you can’t predict if you can’t understand the patterns in existing data.

References:

[embed]https://classroom.udacity.com/courses/ud827-india[/embed]
[embed]https://classroom.udacity.com/courses/ud827-india[/embed]


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

24 Jul 2018

Data Science Beginners’ FAQs

Data Science Beginners’ FAQs

On a regular working day, I receive at-least 4–5 messages from beginners on LinkedIn asking me different questions about Machine Learning (ML)/ Data Science (DS). Initially, I used to respond each question, later I started copying my previous responses and referring to my relevant blog-posts. But this week I realized that I could write a blog-post on DS/ML beginners’ FAQs and refer anyone to this post if someone asks me any of these questions.

I have not reinvented the wheel as such, for each question I have given my perspective and provided relevant posts, links or sources which you can go through to gain an understanding.

  • Who all are fit to learn ML/DS? What are the pre-requisites to learn ML/DS?
  • How to start with ML/DS? What are the good sources to learn different topics in ML/DS?

Almost anybody can learn DS/ML, you can follow an approach to learn DS/ML from scratch. There is no prerequisite as such, you just need a curious & logical mind.

I have written several blog-posts from different view-points, please have a look. Kaggle also has a learn section which is kind of fastest way to get started with ML/DS.

[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]

  • Which MOOCs are worth spending time?

Yes, these days scarcity of information is not a problem, plethora of courses are available. The problem is for the beginners to decide which course to follow. In my view, every course has its own pros & cons. Some courses are theoretical, some are practical, some are for beginners and some are for experienced. One needs to do a kind of research what they are looking for.

[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]

  • I have done XYZ course on ML/DS, how to get hands-on experience?

Public data-sets are the best way to get hands-on experience. First, start with a data-set on which most the people have worked on so that you can compare how good/bad you are doing. Later, you can move to different data-sets or the problem spaces that you care to get exposure and showcase the skills you have built.

[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]

  • How can I get job in ML/DS area as a fresher?

Yes, it is difficult to get job in ML/DS area as a fresher, the reason are many. Some organizations don’t have funds or time to invest on training freshers, some projects have tight delivery timelines or some teams don’t have ML/DS experts to lead etc. But there are ways, if you can build your skills and showcase your prospective employer that you can be productive from day one, you can increase your chances to get hired by leaps & bounds.

[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]

  • I am working in IT for X years, how can I start with ML/DS?
  • I am a working professional and done XYZ course on ML/DS, how can I use these skills in my current job?

If you are already working, and you have learnt ML/DS skills as mentioned above, there is a fair chance that you can identify opportunities in your current organization as every business has data which can give them competitive advantage. Convincing your manager or business can be challenging but it depends how good you have become to convince them.

[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]

  • What kind of problems data scientists solve? What is their day-to-day job looks like?

Every organization’s culture is different, its business is different, its data is different, its challenges are different. Yet you can find similarities in how data scientists work in those organizations on DS/ML projects.

[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]
[embed]https://medium.com/\@rathi.ankit/the-machine-learning-curve-d1ef041fff78[/embed]

So, I have covered almost all the questions I have been asked till date, I may expand this post if someone asks me a different question which I feel can benefit larger pool of beginners.

Happy learning.


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to listen to me on SoundCloud.

18 Jul 2018

Data Science Learn Apply Compete Ask Connect Stay Updated

Data Science: Learn, Apply, Compete, Ask, Connect & Stay Updated

Want to learn more? visit www.ankitrathi.com

In your data science journey, you will be doing multiple activities and using many platforms to gather the required knowledge and skills. In this post, I am going to share platforms that I have used over the period of time to build the skill-set and stay updated with latest developments in this ever-evolving field.

*By any means, this is not the comprehensive list, it is more of a list of platforms that I have used or use often.

  • Kaggle

Kaggle is a community of data scientists and data enthusiasts. This platform enables you to learn from and mentor each other on your personal, academic, and professional data science journeys. To get involved, you can learn about data science, enter a machine learning competition, publish an open dataset or share code in their reproducible data science environment.

  • Analytics Vidhya

Analytics Vidhya provides a community based knowledge portal for Analytics and Data Science professionals. The aim of the platform is to become a complete portal serving all knowledge and career needs of Data Science Professionals.

  • Medium

Medium taps into the brains of the world’s most insightful writers, thinkers, and storytellers to bring you the smartest takes on topics that matter. So whatever your interest, you can always find fresh thinking and unique perspectives. While it is a general platform, you can find read or write articles related to all data science topics. You can read my latest articles on data science here.

  • Quora

Quora is a question-and-answer site where questions are asked, answered, edited, and organized by its community of users in the form of opinions. While it is a general platform, you can find answers related to all data science topics and if you have any specific query, you can ask here (more conceptual in nature).

  • Stack Overflow

Stack Overflow is the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. More than 50 million professional and aspiring programmers visit Stack Overflow each month to help solve coding problems, develop new skills, and find job opportunities. While it is a general platform, you can find ask or find answers related to all data science topics (more technical in nature).

  • KDNuggets

KDnuggets is a leading site on Business Analytics, Big Data, Data Mining, Data Science, and Machine Learning. It is currently reaching over 500K unique monthly visitors, and over 200K subscribers via email, Twitter, LinkedIn, Facebook, feedly/RSS, and Google+.

  • Data Science Central

Data Science Central is the industry’s online resource for big data practitioners. From Analytics to Data Integration to Visualization, it provides a community experience that includes a robust editorial platform, social interaction, forum-based technical support, the latest in technology, tools and trends and industry job opportunities.

  • Twitter

From breaking news and entertainment to sports and politics, from big events to everyday interests. If it’s happening anywhere, it’s happening on Twitter. While it is a general platform, you can follow the people & events related to data science and stay updated about latest happenings in the field. You can follow me for latest news and articles on Twitter here.

  • LinkedIn

LinkedIn is the world’s largest professional network with more than 562 million users in more than 200 countries and territories worldwide. While it is a general platform, you can connect with data science professionals or enthusiasts, read and write articles about data science and stay updated about the latest happenings in the field. You can connect or follow me on LinkedIn here.

Summary

Platforms to develop Data Science skills


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

27 Jun 2018

Who needs a Data Engineer?

Who needs a Data Engineer?

Visit ankitrathi.com now to:

— to read my blog posts on various topics of AI/ML

— to keep a tab on latest & relevant news/articles daily from AI/ML world

— to refer free & useful AI/ML resources

— to buy my books on discounted price

— to know more about me and what I am up to these days

Today, who needs a Data Engineer when everyone else wants to hire a Data Scientist?

Let me start with a real-time situation; a new enthusiastic data scientist joins a firm. He knows how to analyse data, how to build models around it, how to create data stories. Now, the business wants him to work on a use-case, data scientist understand the use-case and start looking around for data to work on. And he keeps on waiting because there is no ready-made data available, data is hidden across various data stores. Now, the data scientist needs help and here comes data engineer to his rescue.

“A Data Engineer is responsible for the creation, processing and maintenance of data pipelines which gives processed data that enables data scientists to work on their use-cases.”

So I would like to call ‘data science’ as ‘data science & engineering’ which gives a better idea of the engineering skills required in this field.

But not all organizations realizes that they require both roles and most of the time data scientists end up doing data engineering tasks most of their time.

Skills of a Data Engineer

An article from DataQuest mentions the following skills what a data engineer should have:

  • Architecting distributed systems
  • Creating reliable pipelines
  • Combining data sources
  • Architecting data stores
  • Collaborating with data science teams and building the right solutions for them

[embed]https://www.dataquest.io/blog/what-is-a-data-engineer/[/embed]

Panoply has published a decent article on ‘How to Become A Data Engineer’ which also highlights the skills required for the role:

[embed]https://www.dataquest.io/blog/what-is-a-data-engineer/[/embed]

Data Scientists Vs Data Engineers:

In general, data scientists are great at advanced analytics and data engineers are good at programming front in general.

The differences between data engineers and data scientists are explained in the following article by DataCamp from the following aspects: responsibilities, tools, languages, job outlook, salary, etc.

[embed]https://www.dataquest.io/blog/what-is-a-data-engineer/[/embed]

Following the article on O’Really coins a term ‘Machine Learning Engineer’ for a role who fills the gap between a Data Scientist & Data Engineer.

[embed]https://www.dataquest.io/blog/what-is-a-data-engineer/[/embed]

Ratios of data engineers to data scientists

Even if an organization/department realizes that they need both roles, a common issue is to figure out the ratio of data engineers to data scientists. Considering that building data pipelines require more efforts, a common starting point is 2–3 data engineers for every data scientist.

References:

[embed]https://www.dataquest.io/blog/what-is-a-data-engineer/[/embed]
[embed]https://www.dataquest.io/blog/what-is-a-data-engineer/[/embed]


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on Twitter, LinkedIn or Instagram

24 Jun 2018

Having only Data Scientists is not Enough

Data Science is not Enough

Visit ankitrathi.com now to:

— to read my blog posts on various topics of AI/ML

— to keep a tab on latest & relevant news/articles daily from AI/ML world

— to refer free & useful AI/ML resources

— to buy my books on discounted price

— to know more about me and what I am up to these days

A data scientist is probably one of the hottest job titles these days. But there are many important skills that are required to build a useful data science solution/product. It is really challenging to find a data scientist with this kind of unicorn skill-set.

Successful organizations view data science as a team sport. They assemble individuals with different skill sets and assign them different responsibilities to support each step of the data science process.

While the demand for various data science roles is increasing by the day, people in industry have used the designations and descriptions a bit loosely. Hence, there is a lot of confusion around who does what in the industry.

The AI Hierarchy of Needs by [Monica Rogati](https://hackernoon.com/@mrogati?source=post_header_lockup)

Below are the roles and the contributions they should be making to ensure you’re producing quality outputs in the most efficient way possible:

Business (Data) Analyst: The first task from business is to frame the business problem & to define the scope, Business Analyst with data oriented skills helps with that.

Data Engineer: Once your team is aligned on the problem you’re trying to solve, the next step is to collect the raw data that will act as the foundation of your data model, basically Extract, Transform, and Load (ETL). Data Engineer builds these pipelines.

Data Scientist: Data Scientist applies algorithms and build models specifically chosen based on the use case your team has defined and the data that’s available. Apart from this, they productionize their findings by integrating them into your decision makers’ workflows.

Data Architect: When you are working on multiple data science use-cases, there will be situations when same data will be consumed by many use-cases & same tech-stack will be needed by many projects. Data Architect builds the platform to optimize the use of data & tech-stack.

Data Steward: If you are working on data science projects, data quality & data security are the major concerns to be addressed. Data Steward helps in managing & governing data sources & pipelines.

Analytics Manager: When multiple stakeholders/resources work on project/programmes, it becomes important to manage expectations, priorities & conflicts. Analytics Manager manages analytics or data science team.

References:

[embed]https://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html[/embed]
[embed]https://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html[/embed]
[embed]https://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html[/embed]
[embed]https://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html[/embed]
[embed]https://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html[/embed]
[embed]https://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html[/embed]
[embed]https://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html[/embed]


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to listen to me on SoundCloud.

13 May 2018

Deep Reinforcement Learning

Deep Reinforcement Learning

While neural networks are responsible for recent breakthroughs in problems like computer vision, machine translation and time series prediction — they can also combine with reinforcement learning algorithms to create something astounding like AlphaGo.

What is Deep Reinforcement Learning?

To understand deep reinforcement learning, lets first look at some definitions from Wikipedia:

Reinforcement learning (RL) is an area of machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Deep learning is loosely related to information processing and communication patterns in a biological nervous system, such as neural coding that attempts to define a relationship between various stimuli and associated neuronal responses in the brain.

Deep reinforcement learning (DRL) is a machine learning method that extends reinforcement learning approach using deep learning techniques.

So by above definitions we can infer that the traditional Reinforcement learning aims to solve problems of how agents can learn to take the best actions on the environment to get the maximum cumulative reward over time. A major part of this process is carefully engineering feature representations. The advances in algorithms for Deep learning have brought up a new wave of successful applications in Reinforcement Learning, because it offers the opportunity to efficiently work with high dimensional input data (like images). In this context the trained deep neural network can be seen as a kind of Deep Reinforcement learning approach, where the agent can learn a state abstraction and a policy approximation directly from its input data.

Why Deep Reinforcement Learning is required?

In those kinds of situations where you use supervised & unsupervised learning , you already have a pretty good idea of the data you have, what’s going on and how to solve the problem. You’re using machine learning to find interesting patterns in that data to get to a better solution, accelerate the process and get to your solution faster. But what about those situations or problem spaces where you have partial data or no data, where an agent can only learn by trial and error. In these situations reinforcement learning comes handy, domain experts and organizations typically know what they want a system to do, but they want to automate or optimize a specific process. Recent advances in Deep learning area has also fueled in Reinforcement learning as it doesn’t need hand-engineered features any more because of this ability. After appropriate many backpropagations, deep neural network knows which information is important to do the task.

How to use Deep Reinforcement Learning?

Reinforcement learning is inspired by behavioral psychology.

Instead of providing the model with ‘correct’ actions, we provide it with rewards and punishments. The model receives information about the current state of the environment (e.g. the computer game screen). It then outputs an action, like a joystick movement. The environment reacts to this action and provides the next state, alongside with any rewards.

The model then learns to find actions that lead to maximum rewards.

Q-learning intuition:

Most modern RL algorithms are some adaptation of Q-Learning. A good way to understand Q-learning is to compare it with playing chess.

Q(S,A) = R + γ * max Q(S’,A’)

The expected future reward Q(S,A) for a given a state S and action A is calculated as the immediate reward R, plus the expected future reward thereafter Q(S’,A’). We assume the next action A’ is optimal.

As a regression problem:

When playing a game, we generate lots of experiences. These experiences are our training data. We can frame the problem of estimating Q(S,A) as a regression problem. To solve this, we can use a neural network.

Training the experiences:

In training process, batch of experiences is trained on neural net using a loss function, where we calculate how far or near is predicted outcome from actual outcome.

Building the model:

In the next step, we build a model that will learn a Q-function for the game.

Exploration:

This is the final step of Q-Learning, where agent will choose some random option for exploration, which will not necessarily the best.

References:

What is deep reinforcement learning, and how does it work?

Welcome to Deep Reinforcement Learning

Deep reinforcement learning: where to start


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

14 Apr 2018

Data Science Digest

Data Science Digest

Data Science is an amalgamation of many other fields like mathematics, technology & domain; it has its own concepts, process & tools. It’s really tough to know each and everything related to the subject unless you have really worked on complex data science problems in industry for couple of years.

In this post, I have tried to aggregate & organize all the data science related topics from Quora (generic definitions), Medium (in-depth working) & GitHub (code). This post is organized in these sections of data science area:

  1. Introduction
  2. Prerequisites
  3. Concepts
  4. Algorithms
  5. Process
  6. Tools

Data Science Introduction

In this section, you can get introduced to data science world. What is data science? Why it is important? What is the difference between Artificial Intelligence, Data Science, Machine Learning & Deep Learning?


Data Science Prerequisites

Before diving deep into data science, one needs to cover a lot of ground like decent understanding of linear algebra, statistics, probability & data engineering.


Data Science Concepts

In this section, you can learn the data science concepts like types of learning and when to use which kind of learning algorithms?


Data Science Algorithms

This section covers various (mostly used) data science algorithms in detail. Which kind of problems these algorithms solve & what are the pros & cons of using these algorithms?


Data Science Process

In this section, you will get to know data science as a process; once you have a problem, what approach will you take? How will you collect & clean data? Which evaluation and tuning technique will you use to optimize your data science algorithm.


Data Science Tools

This section covers the tools being used in data science field like R, python, SQL or machine learning platforms provided by Azure & Amazon.


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to connect on Slideshare.

28 Jan 2018

Data Literacy for Professionals (Series)

Data Literacy for Professionals

This is the pilot post of blog post series ‘Data Literacy for Professionals’, this post covers the context, table of content & links to related posts topic-wise. I would like to mention that most topics mentioned here is a field of study in itself (with further sub-fields), what I am trying here is to give you an overview and an approach on every topic. I encourage you to explore these topics further on your own and build an understanding for yourself.


In previous series about ‘Becoming Data-Driven’, I talked/posted about building Data Strategy, exploiting Emerging Technologies, applying Data Governance & building Data-Driven Culture for a business. While all these aspects are important, I feel building Data-Driven Culture is the most challenging yet the most rewarding aspect. And to create a Data-Driven Culture, first & foremost thing is to make every employee, every professional data literate.

“In an organization, data in the hands of a few data experts can be powerful but data at the fingertips of almost every professional can be truly transformational.”

I interacted with many professionals who went through my posts but apart from data professionals/executives, not many were able to fully comprehend what I wanted to convey because of their different backgrounds. While I am talking & posting about a thing which has started to impact our daily lives & business, still not everyone is data literate enough to understand, so there is clearly a data literacy gap.

Hence these posts, where I want to explain everything about data from scratch. What is data? How to define data from different viewpoints? How to do basic data analysis using Excel & SQL? What, why & how of Data Models, Data Architecture, Data Science? What are tools in Data Technology & what to use when? How to apply Data Governance & build Data Strategy? And finally, how every aspect mentioned above fits together in business & technology ecosystem?

I plan to organize my blog posts under following titles:

In the next post we will be discussing about ‘Data Basics’, please stay tuned for upcoming posts in this blog post series.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

22 Dec 2017

Becoming Data-Driven (Series)

This is the pilot post of blog post series ‘Becoming Data-Driven’, this post covers the table of content & links to related posts topic-wise.

I would like to mention that each topic mentioned here is a field of study in itself (with further sub-fields), what I am trying here is to give you an overview and an approach on every topic related to data-driven business. I encourage you to explore these topics further on your own and build an understanding for yourself.


There has not been a more exciting time than this to talk about data. Data is everywhere, it is being called the new oil, it has become a strategic asset. An organization’s success or failure is now quite dependent on whether it is able to exploit the business value of the data available to them.

In this blog post series I am going to cover an approach for an organization or business to become data-driven. Why a data strategy is important? How we can align it to business strategy? Which areas to focus on to become data-driven? Which are the emerging technologies available to enable us in our data-driven journey? When to use which technology? Why data science is just a part of the whole puzzle? Why data governance is so important? How it enhances the business value of data? Which aspects of data governance are critical? Why data-literacy is important for business? How it can help building data-driven culture? I will explore all these areas and will provide answers to these questions.

I plan to organize my blog posts under following titles:

  1. Data-Driven, What & Why?
  2. Building Data Strategy
  3. Exploiting Emerging Technologies
  4. Applying Data Governance
  5. Building Data-Driven Culture

In the next post we will be discussing about ‘Data-Driven, What & Why?’, please stay tuned for upcoming posts in this blog post series.


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?

14 Aug 2017

Data Science Framework

A lot of material is available on ‘how to learn machine learning (ML)/data science (DS)?’ but when we work on actual ML/DS project, we realize that the core aspects (modeling & evaluation) that we learnt is actually just a small part of the overall solution. When working as a data scientist, nobody tells us whats the ML/DS problem that we need to solve or the prediction that we need to make, we need to understand the business process first and identify the problem and qualify the problem suitable for a ML/DS solution.

Then we need to collect underlying data being used by the business and assess whether its enough & useful to convert this business problem to ML/DS problem. Further, we explore the data & prepare it to be consumed by prediction algorithms/models & evaluate the model performance before deploying the model in production. In between, we also need to identify a suitable evaluation methodology & agree monitoring & support activities with business.

In this article, I will cover these aspects to give you a holistic view of Data Science Framework built on CRISP/DM methodology:

  • Business understanding
  • Data understanding
  • Data preparation
  • Modeling
  • Evaluation
  • Deployment

These three activities are performed in iterative manner to reach most optimized & generalized model avoiding under-fitting or over-fitting.

Data preparation <-> Modeling <-> Evaluation

Business understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data science problem definition and a preliminary plan designed to achieve the objectives.

-Define the business problem

-Set the criteria for success

-Convert business problem to DS problem

-Categorize the DS problem (Classification/Regression/Anomaly Detection etc)

-Prepare a high-level plan to achieve results

-Visulaize the DS pipeline in context of objective (Evaluation Criteria/Algorithms/Transformations)

Data understanding

The data understanding phase starts with initial data collection and proceeds with activities that enable you to become familiar with the data, identify data quality problems, discover first insights into the data, and/or detect interesting subsets to form hypotheses regarding hidden information.

-Collect & integrate initial data

-Understand the attributes & its relationship

-Identify data quality issues

-Perform EDA (Exploratory Data Analysis)

Data preparation

The data preparation phase covers all activities needed to construct the final dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools.

-Integrate data sources

-Handle missing values, outliers

-Apply Feature Engineering

Modeling

In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, going back to the data preparation phase is often necessary.

-Apply multiple models

-Choose most optimal model

-Create a feedback pipeline

-Ensemble/Stack different models

Evaluation

Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, to be certain the model properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data science results should be reached.

-Refine evaluation criteria

-Evaluate the models

-Handle Overfitting/Underfitting

Deployment

Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable DS process across the enterprise. In many cases, it is the customer, not the data analyst, who carries out the deployment steps. However, even if the analyst will carry out the deployment effort, it is important for the customer to understand up front what actions need to be carried out in order to actually make use of the created models.

-Prepare a detailed deployment plan

-Agree on post-deployment monitoring & support

-Monitor & support

Reference

CRISP-DM Guide

Data science framework overview

EDISON Data Science Framework to define the Data Science Profession


Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to connect on Slideshare.

Originally published at https://www.linkedin.com/today/posts/ankitrathi on August 14, 2017.

22 Feb 2017

Data Lakes in Modern Data Architecture

Data Lake (DL): a storage repository that holds a vast amount of raw data in its native format until it is needed

DWH Vs DL

Data Lakes will be central to the modern data architecture because of these features:

  • Agility: ability to convert data >> information >> action
  • Insight: ability to give business insights
  • Scalability: ability to accommodate data growth

All data is welcome:

  • Stores all type of data: structured, semi-structured & unstructured
  • Stores raw data in its original form for extended period of time
  • Uses various tools to correlate, enrich & query for the insights on the data
  • Provides democratized access via single unified view across the Enterprise

Traditional Data Architecture

Sources >> ETL >> EDW >> Data Discovery/Analytics/BI

Modern Data Architecture

Streaming/Unstructured/Various Sources >> Data Lake (Derived/Discovery Sandbox) >> EDW >> Data Science/Data Discovery/Analytics/BI

Data Lake Challenges & Complications

In Building:

  • Rate of change in data sources
  • Skill gap in the industry
  • Complexity involved in accommodating different data sources

In Managing:

  • Ingestion of different data sources
  • Lack of visibility for future requirements
  • Privacy & Compliance related

In Delivering:

  • Quality Issues with data
  • Reliance on IT
  • Reusability of data

Approach for Data lakes

Enable the Data Lake

  • Ingest the data
  • Organize the data
  • Register in Catalog

Govern the data in the lake

  • Cleanse the data
  • Secure the data
  • Operationalize the data

Engage with business

  • Discover the data
  • Enrich the platform
  • Provision the data sources

Data Lake Reference Architecture

Data Lake Management Platform

  • Unified Data Management
  • Managed Ingestion
  • Data Reliability
  • Data Visibility
  • Data Privacy & Security

Getting Started

References

Building a Modern Data Architecture

DWH Vs DL


Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on YouTube, Twitter, LinkedIn or Instagram?