Sunday, July 7, 2019

Data Science Interview Questions and Answers

1. What are the primary skills that a Data Scientist must possess?

Well, quite a simple question to begin the interview with but something that can give the interviewer a very deep insight into your understanding of this role. 

We have divided this answer into three segments based on the seniority of the role. See, what is suitable to you and use it to answer this question in your interview. 

The primary skills that the Data Scientists at an entry level are expected to possess are: 

i.) Good knowledge of Statistics and Mathematics

ii.) Ability to think logically coupled with an analytical approach to things

iii.) A good understanding of Data Science models with an ability to rework the existing models

iv.) And, off course an ability to code

However, the Data scientists are not expected to be high level coders like Software Developers. 

If you are a mid-level Data Scientist, in addition to all the above skills, you should: 

i.) Be able to identify the flaws in a model and put it into production.

ii.) Know the details of the products you have worked on but are not expected to know the complete architecture

iii.) While you may not have solutions, you are expected to understand the problems that the business is facing 

iv.) You should possess good communication skills and should be able to work as bridge between the higher management and lower level Data Scientists.

The Data Scientists at the next level are the actual brains behind every project. The most important skills they bring with them are: 
i.) Business Acumen - They clearly understand the business problems and try to solve them

ii.) Ability to create high end Data Science projects

iii.) Ability to lead a cross functional team

iv.) Ideas about new products which Product managers adapt as per the market requirements

v.) Good communication and people skills.

2. What tips would you give to a person, to excel in Data Science career?

Understand this question as, "What tips would "YOU" practice to succeed in this profession?" 

So, some of the things you can talk about to answer this question effectively are: 

i.) Strong foundation - In statistics and mathematics.

ii.) Quality of Data - Ensure that the quality of data you are using is good. The size of the population and its authenticity is very important.  

iii.) Be doubtful of your own assumptions - Question them even if you believe them to be correct. This becomes all the more important when you are dealing with human behavior and issues that can not be predicted with high level of certainty. Being overly confident can lead to failures. 

iv.) Acknowledge if you are biased - It is possible for human beings to fall a victim to their own biases. If the issue is closer to your heart, you may not be able to make unbiased assumptions. Beware of this! 

v.) Don't let your curiosity die down - Data Scientists are people who think from various angles and always have an element "what if". Ensure that it stay alive always. 

vi.) Know the purpose of data - Before you start working on a data ask what it will be used for. This makes you take the right approach because if the two understandings don't match, the data collected is not effective and unnecessarily takes time. 

vii.) Dedicate time to master some tools - You won't be able to master all but some can definitely be mastered.

viii.) Learn new things - This is an evolving field. Those who keep themselves updated, will rise faster than others. 

ix.) Practice - Just reading about things or seeing your colleagues do them won't help. Get your hands dirty. Practice them.

3. What are the various types of analytics used in Data Science?

Data Scientists work to derive the human understandable meaning of the data. 

The four major types of analytics they carry out are: 

i.) Descriptive Analytics - As the name suggests, Descriptive Analysis describes in layman's language what the raw data says about an event. It helps in understanding any patterns to deal with the future. 

ii.) Diagnostic Analytics - Here the Data Scientists dig deeper to find the source of problem. 

iii.) Predictive Analytics - The Predictive Analytics models use various related factors or variables to find the probability or timing of an event or trend for the future. This helps the businesses in gearing themselves up for the future. 

iv.) Prescriptive Analytics - As the name suggests, this type of analytics prescribe the actions that can be taken in the future to get the desired results.

4. What are Predictor variables? Would you have too many or just a few of them in a model? Why?

Predictor variables are also referred to as independent variables or x-variables. 

In a model, you try to see how the change in a predictor variable affects the outcome. 

We would prefer to have only a few relevant predictor variables in a model because: 

i. Having too many predictor variables might mean that some of them have a similar effect on the model. So, they unknowingly get an element of redundancy into the model. 

ii. It is also possible that not all the predictor variables are relevant to the model thus making it less effective and time consuming to execute. 

iii. Having too many predictor variables in the model may increase its complexity and ultimately the performance in real case scenario. 

So, to get a good model, it is advisable to select most relevant and limited number of predictor variables.

5. Explain False Positive and False Negative?

i. False Positive - When a test wrongly identifies the presence of a condition when it is actually not, it is called as False Positive. 

For e.g. if a medical test identifies the presence of a medical condition, when actually it is not, it is called false positive. In such a case, the patient may unnecessarily take the medicine or go through the treatment which may further harm him. 

ii. False Negative - When a test wrongly indicates the absence of a condition when it is actually present, it is called False Negative. 

For e.g. If a medical test say that a person doesn't have a medical condition or disease when he actually has it, it is False Negative. 

This situation is bad because either the patient will go without the required treatment or will have to take further tests which costs more money.

6. What is the difference between Linear and Non-linear regression models?

A lot of students believe that linear equations are the ones that produce straight line when plotted on the chart while the non linear equations produce curves. But, the difference between the two is not that simple. 

The terms in a linear regression model will fall into one of the following categories: 

i. The constant

ii. A parameter multiplied by an independent variable

The equation would be: 

Y = a + b*X + c*X1

The function should be linear in parameter while the independent variables may be squared to form a curve. The model will still stay linear. 

So, Y = a + b*X + c*X12

is linear. 

The presence of log terms or inverse terms can change the type of curve but it'll still be linear because it is still linear in parameters. 

The non-linear equations are not comprised of just addition and multiplication& anything that doesn't look like a linear model is non-linear.

7. What is Regression Analysis? What are its major types?

Regression analysis is a type of Predictive modelling technique that tries to find out the relationship between dependent and independent variables, referred to as "target" and "predictor" respectively. 

The technique is mainly used for forecasting and to estimate the relationship between various variables. 

Regression analysis is divided into various types depending upon the following factors: 

i.) No. of Predictors i.e. Independent variables

ii.) Shape of Regression Line

iii.) Type of dependent variable

The various types of Regression Analysis are: 

i. Linear Regression - Linear regression establishes the relation between dependent variable and independent variables. It is the most widely used form of regression. There are two type of linear regressions - Simple linear regression (when the predictor is only one, Multiple linear regression (when there are many predictors).

ii. Logistic Regression - When the dependent variable has a binary value, Logistic regression is used.  There are two type of Logistic Regression - Ordinal and Multinomial. 

iii. Polynomial Regression

iv. Stepwise Regression

v. Ridge Regression

vi. Lasso Regression

vii. ElasticNet Regression

8. What are outliers? What are the various ways to detect them?

Outliers are also referred to as anomalies. These are the values in the dataset that lie far away from all other values. 

It is important to deal with them because when you are creating the Machine Learning Models, you have to either explain the significance for their occurrence or get rid of them so that they don't disturb the model. 

Following are the most commonly used methods to detect them: 

i.) Standard Deviation - Here, usually if a value is three times the standard deviation, it is an outlier. 

ii.) Boxplots - The data here is plotted on the graph. The boundaries of data are called upper and lower whiskers. Any values that lie on or beyond these whiskers are anomalies. 

iii.) DBScan Clustering - This method converts the data into clusters which has core points and border points. To consider the two points to be a part of a cluster, a maximum distance "eps" is calculated. Any values that fall beyond border points are called as noise points. 

The biggest challenge with this method is the right calculation of "eps". 

iv.) Isolation Forest - This method works differently than all other methods. It assigns a score to each data point and believes that the anomalies are only a few in numbers and their attribute values are different from the normal values. This method works well with large datasets. 

v.) Random Cut Forest - This method also works by associating a score with the data values. Low score value means that the data is normal while the high score value tags it to be an anomaly. This method works with both online and offline data & can take care of high dimensional data.

9. What are some common statistical problems that you would always stay attentive to as a Data Scientist?

Some of the statistical things that I would stay attentive to as a Data Scientist are: 

i.) Ensure that the dataset is of high quality with no missing or redundant values. 

ii.) Understand the objective function clearly so that you can build a good model that meets your expectations. 

iii.) Look at the data closely and ask which model would work the best and why. 

iv.) Make sure that the data getting into the system at the time of running it is same as per your assumptions. A different data would get you wrong predictions. 

v.) Run your model in actual out sample environment to ensure that it runs well in all the conditions

vi.) Work with a small set of data to begin with and ensure that your approach is right. A wrong output doesn't always mean lack of data but also points towards your approach.

No comments:

Post a Comment

Get max value for identity column without a table scan

  You can use   IDENT_CURRENT   to look up the last identity value to be inserted, e.g. IDENT_CURRENT( 'MyTable' ) However, be caut...