Sunday, July 7, 2019

Machine Learning Interview Questions and Answers

1. What are the different variants of Machine Learning algorithms?

Well, quite a simple and expected question on this subject. 

Machine Learning is a set of algorithms that makes the machines learn the data, predict the trends and make decisions to be used in various walks of our life. 

The Machine learning models can be divided into three types based on the way they learn.

i.) Supervised Learning - In Supervised learning algorithms, the labelled data which is also called as input or tagged data is fed to the algorithm, which is taught what the expected output should be. 

Here the algorithm is taught again and again by humans about what a particular data says. 

These algorithms are useful when all the data is labelled. 

Supervised Learning problems can be further categorized into:

i.) Classification problems - when the output variable is a category

ii.) Regression problems - when the output variable is a real value. 

Examples of such models are: Logistics Regression, Random Forest, Nearest Neighbors, Support Vector Machines etc. 

ii.) Unsupervised Learning - In unsupervised learning algorithms, the inputs are fed to the algorithm without telling it the expected output. These algorithms are capable of solving their own problems.

Here, the algorithm observes the patterns and structures in the given data & makes its own decisions. 

These algorithms are useful in cases when the data is not tagged or labelled or divided into categories. 

Unsupervised Learning Problems can be categorized into: 

i.) Clustering

ii.) Association

Some examples of such models are : k-means, GMM, PCA etc. 

When only some of the data is labelled while the major portion of it is unlabeled, the algorithms used are semi-supervised. 

iii.) Reinforcement Learning - In reinforcement learning, the algorithm continuously learns from the environment in an iterative process until it has explored the complete range of possibilities. 

The agent or the algorithm just needs reward feedback to learn and fine tune its behavior, which is referred to as a Reinforcement Signal.

2. What are the qualities of a good Machine Learning code?

A machine learning program is expected to work with a large amount of data, time and again, even in complex situations. 

This needs it to possess following important qualities: 

i.) Scalability - A good machine learning code is capable of being scaled up, when the complexities increase with time. 

ii.) No manual checks should be required - Since the machines are expected to work on this code, it should not require any human intervention to check which functions and parameters were run together. 

iii.) Data should get auto saved - The data, used and generated both, should get automatically saved at the right place. Humans should not be required to keep a record of that. 

iv.) Easy to understand for others - Other members from your team may need to work on your code to upscale it, fix it or maximize it. It should easy for them to read and work on it.

3. What is the difference between Train Data vs Test Data?

When you want to create a supervised learning algorithm, you use Train Data and Test Data sets. 

Training Dataset contains both the input and the expected output. It is used to train the algorithm. 

Testing Dataset contains just the input and examines how well was the algorithm trained. 

During the process you have to be careful that your Test Data doesn't leak into the Train Data otherwise while the algorithm might perform very well during training and testing, it can fail miserably in real life situations.

4. What are the most popular Regression algorithms used in Machine Learning?

The most popular Regression algorithms used in Machine Learning are: 

i.) Linear Regression

ii.) Logistic Regression

iii.) Clustering

iv.) Support Vector Machines

v.) Decision Trees

vi.) Naïve Bayes

5. What are the advantages and disadvantages of K-Nearest Neighbors Algorithm?

While it is important to know the technical details and working of each algorithm, it is also very important to know the advantages and limitations of each model, so that you can decide if a particular model is favorable to be used in a particular case or not. 

Talking about the KNN algorithm, its main advantages are: 

i.) It is very easy to understand and implement. Works well with basic recognition problems. 

ii.) It is non-parametric so doesn't require any assumptions to be made and met by the data like other parametric models. 

iii.) Since it tags the new data simply based on the learning from the historical data and the labels of the nearest neighbors, it doesn't require much training time. 

iv.) It continuously evolves itself with the new data getting into the system. 

v.) It works well with both Regression and Classification problems.

Some of the disadvantages of this model are: 

i.) Declining Speed - As the data grows, the speed declines. 

ii.) Not effective with large no. of input variables.   

iii.) Choosing the optimal number of neighbors while trying to classify a new entry is another problem. 

iv.) If your data inclines towards a particular class, there's a high possibility of getting a new entry classified wrongly with KNN algorithm.

v.) The outliers may also affect the performance of the model as the classification is based on the distance. 

vi.) The model doesn't learn anything from training data, it just uses to the training data to classify the data in actual situations. 

vii.) Changing the value of K can change the predicted class variable. 

So, these are some of the advantages and disadvantages of KNN. Prepare yourself to answer this type of a question for other models as well.

6. What are the important stages in Machine Learning Life Cycle?

The process of Machine Learning goes through a set of stages in its life cycle. They include: 

i.) Gathering Data

ii.) Preparing Data - i.e. making it usable for our machine learning algorithm. This data is divided into training data and test data. Cleaning, Normalization of data etc. are a part of this step. 

iii.) Choosing the right model - Depending on your data type. 

iv.) Training the model - This stage consumes the maximum time because until and unless you are happy with the model's performance at this stage, you need to readjust it to make correct predictions

v.) Testing the model - At this stage you test it with the "Test Data Set" that you kept aside at the time of splitting the data initially. This data is different from what you used for training the model.

vi.) Tuning the parameters - This is done to further improve the performance of the model. Here, it is also very important to be alert that you don't land up over tuning the model or parameters because that can lead to a lot of wastage of time and inaccurate predictions. 

vii.) Making predictions - This is the outcome of all the hard work you put in to make your machines learn to predict in the real world.

7. Difference between Classification and Regression in Machine Learning.

To be able to solve a prediction problem correctly, it is very important to clearly understand if the problem is that of Classification or Regression.

The biggest difference between the two is - 

The output variable in case of Regression is numerical (or continuous) while in case of Classification, it is categorical (or discrete). 

So, Regression is the task of predicting a continuous quantity while Classification is the task of predicting a discrete class label.

8. The data files used for Machine Learning Algorithms can be very large at times. How would you handle them to avoid the crashing of algorithms or letting it run out of memory?

The datasets in Machine learning can get really large and may lead to problems like crashing of algorithm or the system running out of memory. 

Following are some ways to deal with large data files: 

i.) Increase the memory of your computer - This is one of the easiest ways to deal with the problem. If your requirement is heavier, you can even consider renting computer time on cloud services. 
ii.) Re-configure your tools and libraries - Check the tool or the library you are using& try to re-configure it to allocate more memory. Some of them are limited. 
iii.) Decrease the dataset size - to something that you really require. If the system doesn't need that large data, why use it?
iv.) Use memory saving data format - Try converting your dataset to a format that can load it faster or uses lesser memory or may be doesn't need the complete dataset to be loaded into the memory at a time. Progressive loading can help you save the memory tremendously. 
v.) Use RDBMS - This will need you to use algorithms that support these databases. 
vi.) Use Big Data platforms if the dataset is really huge and nothing else gives you a good performance.

9. What is Classification Accuracy?

Classification Accuracy is the ratio of correct predictions made by your model to the total predictions. Usually it is presented in the percentage format. 

The reverse of classification accuracy rate is error rate. 

The main limitation with Classification Accuracy Rate is, it at times doesn't give you a very clear picture of the performance of your model specially when your data contains 3 or more classes or when the number of classes are not even. 

In these cases, you do not understand if the model is working equally well on all the classes or if it is ignoring some particular classes. So, while your accuracy percentage may be high, you can still not be sure about the performance of your model.

10. What is Data Leakage? How would you prevent it?

Data Leakage is when the model uses data other than the training data while getting created. This usually happens when validation or test data leaks into the training data. 

Following are certain things you can do to prevent Data Leakage: 

i.) Split your dataset into train, validation and test data & keep everything other than your train data away. Use it consciously when you are fully done with training the model. 

ii.) Avoid over preparing the data otherwise it may lead to over fitting. 

iii.) Remove all the data that you have prior to the event of your interest. 

iv.) If you suspect some variables to be leaking into the model, consider removing them.

11. What do you know about Bagging?

Bagging or Bootstrap Aggregation is an ensemble method. 

Ensemble methods take the predictions from various machine learning algorithms together and make predictions which are expected to be more accurate than any single method. 

Bagging is used to reduce the variance of high variance algorithms like Decision Trees and can be used for both classification and regression problems.

No comments:

Post a Comment

Get max value for identity column without a table scan

  You can use   IDENT_CURRENT   to look up the last identity value to be inserted, e.g. IDENT_CURRENT( 'MyTable' ) However, be caut...