OVER 100 Data Scientist Interview Questions and Answers!

栏目: IT技术 · 发布时间: 4年前

内容简介：Q: The probability that item an item at location A is 0.6, and 0.8 at location B. What is the probability that item would be found on Amazon website?We need to make some assumptions about this question before we can answer it.We can reword the above as P(A

Statistics, Probability, and Mathematics

Q: The probability that item an item at location A is 0.6, and 0.8 at location B. What is the probability that item would be found on Amazon website?

We need to make some assumptions about this question before we can answer it. Let’s assume that there are two possible places to purchase a particular item on Amazon and the probability of finding it at location A is 0.6 and B is 0.8. The probability of finding the item on Amazon can be explained as so :

We can reword the above as P(A) = 0.6 and P(B) = 0.8. Furthermore, let’s assume that these are independent events, meaning that the probability of one event is not impacted by the other. We can then use the formula…

P(A or B) = P(A) + P(B) — P(A and B)

P(A or B) = 0.6 + 0.8 — (0.6*0.8)

P(A or B) = 0.92

Q: You randomly draw a coin from 100 coins — 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, what is the probability that the coin is unfair?

This can be answered using the Bayes Theorem. The extended equation for the Bayes Theorem is the following:

Assume that the probability of picking the unfair coin is denoted as P(A) and the probability of flipping 10 heads in a row is denoted as P(B). Then P(B|A) is equal to 1, P(B∣¬A) is equal to 0.⁵¹⁰, and P(¬A) is equal to 0.99.

If you fill in the equation, then P(A|B) = 0.9118 or 91.18%.

Q: Difference between convex and non-convex cost function; what does it mean when a cost function is non-convex?

Taken from Cho-Jui Hsieh, UCLA

A convex function is one where a line drawn between any two points on the graph lies on or above the graph. It has one minimum.

A non-convex function is one where a line drawn between any two points on the graph may intersect other points on the graph. It characterized as “wavy”.

When a cost function is non-convex, it means that there’s a likelihood that the function may find local minima instead of the global minimum, which is typically undesired in machine learning models from an optimization perspective.

Q: Walk through the probability fundamentals

For this, I’m going to look at the eight rules of probability laid out here and the four different counting methods (see more here ).

Eight rules of probability

Rule #1: For any event A, 0 ≤ P(A) ≤ 1 ; in other words, the probability of an event can range from 0 to 1.
Rule #2: The sum of the probabilities of all possible outcomes always equals 1.
Rule #3: P(not A) = 1 — P(A) ; This rule explains the relationship between the probability of an event and its complement event. A complement event is one that includes all possible outcomes that aren’t in A.
Rule #4: If A and B are disjoint events (mutually exclusive), then P(A or B) = P(A) + P(B) ; this is called the addition rule for disjoint events
Rule #5: P(A or B) = P(A) + P(B) — P(A and B) ; this is called the general addition rule.
Rule #6: If A and B are two independent events, then P(A and B) = P(A) * P(B) ; this is called the multiplication rule for independent events.
Rule #7: The conditional probability of event B given event A is P(B|A) = P(A and B) / P(A)
Rule #8: For any two events A and B, P(A and B) = P(A) * P(B|A) ; this is called the general multiplication rule

Counting Methods

Factorial Formula: n! = n x (n -1) x (n — 2) x … x 2 x 1

Use when the number of items is equal to the number of places available.

Eg. Find the total number of ways 5 people can sit in 5 empty seats.

= 5 x 4 x 3 x 2 x 1 = 120

Fundamental Counting Principle (multiplication)

This method should be used when repetitions are allowed and the number of ways to fill an open place is not affected by previous fills. Eg. There are 3 types of breakfasts, 4 types of lunches, and 5 types of desserts. The total number of combinations is = 5 x 4 x 3 = 60

Permutations: P(n,r)= n! / (n−r)!

This method is used when replacements are not allowed and order of item ranking matters.
Eg. A code has 4 digits in a particular order and the digits range from 0 to 9. How many permutations are there if one digit can only be used once?

P(n,r) = 10!/(10–4)! = (10x9x8x7x6x5x4x3x2x1)/(6x5x4x3x2x1) = 5040

Combinations Formula: C(n,r)=(n!)/[(n−r)!r!]

This is used when replacements are not allowed and the order in which items are ranked does not mater.

Eg. To win the lottery, you must select the 5 correct numbers in any order from 1 to 52. What is the number of possible combinations?

C(n,r) = 52! / (52–5)!5! = 2,598,960

Q: Describe Markov chains?

Brilliant provides a great definition of Markov chains ( here ):

“A Markov chain is a mathematical system that experiences transitions from one state to another according to certain probabilistic rules. The defining characteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed. In other words, the probability of transitioning to any particular state is dependent solely on the current state and time elapsed.”

The actual math behind Markov chains requires knowledge on linear algebra and matrices, so I’ll leave some links below in case you want to explore this topic further on your own.

See more here or here .

Q: A box has 12 red cards and 12 black cards. Another box has 24 red cards and 24 black cards. You want to draw two cards at random from one of the two boxes, one card at a time. Which box has a higher probability of getting cards of the same color and why?

The box with 24 red cards and 24 black cards has a higher probability of getting two cards of the same color. Let’s walk through each step.

Let’s say the first card you draw from each deck is a red Ace.

This means that in the deck with 12 reds and 12 blacks, there’s now 11 reds and 12 blacks. Therefore your odds of drawing another red are equal to 11/(11+12) or 11/23.

In the deck with 24 reds and 24 blacks, there would then be 23 reds and 24 blacks. Therefore your odds of drawing another red are equal to 23/(23+24) or 23/47.

Since 23/47 > 11/23, the second deck with more cards has a higher probability of getting the same two cards.

Q: You are at a Casino and have two dices to play with. You win $10 every time you roll a 5. If you play till you win and then stop, what is the expected payout?

Let’s assume that it costs $5 every time you want to play.
There are 36 possible combinations with two dice.
Of the 36 combinations, there are 4 combinations that result in rolling a five ( see blue ). This means that there is a 4/36 or 1/9 chance of rolling a 5.
A 1/9 chance of winning means you’ll lose eight times and win once (theoretically).
Therefore, your expected payout is equal to $10.00 * 1 — $5.00 * 9= -$35.00.

Edit: Thank you guys for commenting and pointing out that it should be -$35!

Q: How can you tell if a given coin is biased?

This isn’t a trick question. The answer is simply to perform a hypothesis test:

The null hypothesis is that the coin is not biased and the probability of flipping heads should equal 50% (p=0.5). The alternative hypothesis is that the coin is biased and p != 0.5.
Flip the coin 500 times.
Calculate Z-score (if the sample is less than 30, you would calculate the t-statistics).
Compare against alpha (two-tailed test so 0.05/2 = 0.025).
If p-value > alpha, the null is not rejected and the coin is not biased.
If p-value < alpha, the null is rejected and the coin is biased.

Learn more about hypothesis testing here .

Q: Make an unfair coin fair

Since a coin flip is a binary outcome, you can make an unfair coin fair by flipping it twice. If you flip it twice, there are two outcomes that you can bet on: heads followed by tails or tails followed by heads.

P(heads) * P(tails) = P(tails) * P(heads)

This makes sense since each coin toss is an independent event. This means that if you get heads → heads or tails → tails, you would need to reflip the coin.

Q: You are about to get on a plane to London, you want to know whether you have to bring an umbrella or not. You call three of your random friends and ask each one of them if it’s raining. The probability that your friend is telling the truth is 2/3 and the probability that they are playing a prank on you by lying is 1/3. If all 3 of them tell that it is raining, then what is the probability that it is actually raining in London.

You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s 25%.

P(A) = probability of it raining = 25%

P(B) = probability of all 3 friends say that it’s raining

P(A|B) probability that it’s raining given they’re telling that it is raining

P(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27

Step 1: Solve for P(B)

P(A|B) = P(B|A) * P(A) / P(B), can be rewritten as

P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

P(B) = (2/3)³ * 0.25 + (1/3)³ * 0.75 = 0.25*8/27 + 0.75*1/27

Step 2: Solve for P(A|B)

P(A|B) = 0.25 * (8/27) / ( 0.25*8/27 + 0.75*1/27)

P(A|B) = 8 / (8 + 3) = 8/11

Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.

Q: You are given 40 cards with four different colors- 10 Green cards, 10 Red Cards, 10 Blue cards, and 10 Yellow cards. The cards of each color are numbered from one to ten. Two cards are picked at random. Find out the probability that the cards picked are not of the same number and same color.

Since these events are not independent, we can use the rule:

P(A and B) = P(A) * P(B|A) ,which is also equal to

P(not A and not B) = P(not A) * P(not B | not A)

For example:

P(not 4 and not yellow) = P(not 4) * P(not yellow | not 4)

P(not 4 and not yellow) = (36/39) * (27/36)

P(not 4 and not yellow) = 0.692

Therefore, the probability that the cards picked are not the same number and the same color is 69.2%.

Q: How do you assess the statistical significance of an insight?

You would perform hypothesis testing to determine statistical significance. First, you would state the null hypothesis and alternative hypothesis. Second, you would calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true. Last, you would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.

Q: Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

Example of a long tail distribution

A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically.

3 practical examples include the power law, the Pareto principle (more commonly known as the 80–20 rule), and product sales (i.e. best selling products vs others).

It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.

Q: What is the Central Limit Theorem? Explain it. Why is it important?

From Wikipedia

Statistics How To provides the best definition of CLT, which is:

“The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.” [1]

The central limit theorem is important because it is used in hypothesis testing and also to calculate confidence intervals.

Q: What is the statistical power?

‘Statistical power’ refers to the power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true. [2]

Q: Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

Selection biasis the phenomenon of selecting individuals, groups or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.

Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.

Types of selection bias include:

sampling bias : a biased sample caused by non-random sampling
time interval : selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
exposure : includes clinical susceptibility bias, protopathic bias, indication bias. Read more here .
data : includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
attrition : attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
observer selection : related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it. [3]

Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

Q: Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data?

Observational datacomes from observational studies which are when you observe certain variables and try to determine if there is any correlation.

Experimental datacomes from experimental studies which are when you control certain variables and hold them constant to determine if there is any causality.

An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The test group is told to drink a glass of wine every night for 30 days. Then research can be conducted to see how wine affects sleep.

Q: Is mean imputation of missing data acceptable practice? Why or why not?

Mean imputationis the practice of replacing null values in a data set with the mean of the data.

Mean imputation is generally bad practice because it doesn’t take into account feature correlation. For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that he actually should.

Second, mean imputation reduces the variance of the data and increases bias in our data. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.

Q: What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset.

An outlier is a data point that differs significantly from other observations.

Depending on the cause of the outlier, they can be bad from a machine learning perspective because they can worsen the accuracy of a model. If the outlier is caused by a measurement error, it’s important to remove them from the dataset. There are a couple of ways to identify outliers:

Z-score/standard deviations:if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier.

Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets , and the presence of too many outliers can throw off z-score.

Interquartile Range (IQR):IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.

Photo from Michael Galarnyk

Other methods include DBScan clustering, Isolation Forests, and Robust Random Cut Forests.

An inlier is a data observation that lies within the rest of the dataset and is unusual or an error. Since it lies in the dataset, it is typically harder to identify than an outlier and requires external data to identify them. Should you identify any inliers, you can simply remove them from the dataset to address them.

Q: How do you handle missing data? What imputation techniques do you recommend?

There are several ways to handle missing data:

Delete rows with missing data
Mean/Median/Mode imputation
Assigning a unique value
Predicting the missing values
Using an algorithm which supports missing values, like random forests

The best method is to delete rows with missing data as it ensures that no bias or variance is added or removed, and ultimately results in a robust and accurate model. However, this is only recommended if there’s a lot of data to start with and the percentage of missing values is low.

Q: You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?

First I would conduct EDA — Exploratory Data Analysis to clean, explore, and understand my data. See my article on EDA here . As part of my EDA, I could compose a histogram of the duration of calls to see the underlying distribution.

My guess is that the duration of calls would follow a lognormal distribution (see below). The reason that I believe it’s positively skewed is because the lower end is limited to 0 since a call can’t be negative seconds. However, on the upper end, it’s likely for there to be a small proportion of calls that are extremely long relatively.

Lognormal Distribution Example

You could use a QQ plot to confirm whether the duration of calls follows a lognormal distribution or not. See here to learn more about QQ plots.

Q: Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?

Administrative datasets are typically datasets used by governments or other organizations for non-statistical reasons.

Administrative datasets are usually larger and more cost-efficient than experimental studies. They are also regularly updated assuming that the organization associated with the administrative dataset is active and functioning. At the same time, administrative datasets may not capture all of the data that one may want and may not be in the desired format either. It is also prone to quality issues and missing entries.

Q: You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

There are a number of potential reasons for a spike in photo uploads:

A new feature may have been implemented in October which involves uploading photos and gained a lot of traction by users. For example, a feature that gives the ability to create photo albums.
Similarly, it’s possible that the process of uploading photos before was not intuitive and was improved in the month of October.
There may have been a viral social media movement that involved uploading photos that lasted for all of October. Eg. Movember but something more scalable.
It’s possible that the spike is due to people posting pictures of themselves in costumes for Halloween.

The method of testing depends on the cause of the spike, but you would conduct hypothesis testing to determine if the inferred cause is the actual cause.

Q: Give examples of data that does not have a Gaussian distribution, nor log-normal.

Any type of categorical data won’t have a gaussian distribution or lognormal distribution.
Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.

Q: What is root cause analysis? How to identify a cause vs. a correlation? Give examples

Root cause analysis:a method of problem-solving used for identifying the root cause(s) of a problem [5]

Correlationmeasures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.

Example: a higher crime rate is associated with higher sales in ice cream in Canada, aka they are positively correlated. However, this doesn’t mean that one causes another. Instead, it’s because both occur more when it’s warmer outside.

You can test for causation using hypothesis testing or A/B testing.

Q: Give an example where the median is a better measure than the mean

When there are a number of outliers that positively or negatively skew the data.

Q: Given two fair dices, what is the probability of getting scores that sum to 4? to 8?

There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):

P(rolling a 4) = 3/36 = 1/12

There are combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):

P(rolling an 8) = 5/36

Q: What is the Law of Large Numbers?

The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value.

Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.

Q. How do you calculate the needed sample size?

Formula for margin of error

You can use the margin of error (ME) formula to determine the desired sample size.

t/z = t/z score used to calculate the confidence interval
ME = the desired margin of error
S = sample standard deviation

Q: When you sample, what bias are you inflicting?

Potential biases include the following:

Sampling bias: a biased sample caused by non-random sampling
Under coverage bias: sampling too few observations
Survivorship bias: error of overlooking observations that did not make it past a form of selection process.

Q: How do you control for biases?

There are many things that you can do to control and minimize bias. Two common things include randomization , where participants are assigned by chance, and random sampling , sampling in which each member has an equal probability of being chosen.

Q: What are confounding variables?

A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not causally related.

Q: What is A/B testing?

A/B testing is a form of hypothesis testing and two-sample hypothesis testing to compare two versions, the control and variant, of a single variable. It is commonly used to improve and optimize user experience and marketing.

Check out my article, A Simple Guide to A/B Testing for Data Science.

Q: How do you prove that males are on average taller than females by knowing just gender height?

You can use hypothesis testing to prove that males are taller on average than females.

The null hypothesis would state that males and females are the same height on average, while the alternative hypothesis would state that the average height of males is greater than the average height of females.

Then you would collect a random sample of heights of males and females and use a t-test to determine if you reject the null or not.

Q: Infection rates at a hospital above a 1 infection per 100 person-days at risk are considered high. A hospital had 10 infections over the last 1787 person-days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard.

Since we looking at the number of events (# of infections) occurring within a given timeframe, this is a Poisson distribution question.

The probability of observing k events in an interval

Null (H0): 1 infection per person-days

Alternative (H1): >1 infection per person-days

k (actual) = 10 infections

lambda (theoretical) = (1/100)*1787

p = 0.032372 or 3.2372% calculated using .poisson() in excel or ppois in R

Since p-value < alpha (assuming 5% level of significance), we reject the null and conclude that the hospital is below the standard.

Q: You roll a biased coin (p(head)=0.8) five times. What’s the probability of getting three or more heads?

Use the General Binomial Probability formula to answer this question:

General Binomial Probability Formula

p = 0.8

n = 5

k = 3,4,5

P(3 or more heads) = P(3 heads) + P(4 heads) + P(5 heads) = 0.94 or 94%

Q: A random variable X is normal with mean 1020 and a standard deviation 50. Calculate P(X>1200)

Using Excel…

p =1-norm.dist(1200, 1020, 50, true)

p= 0.000159

Q: Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most three people show up in a four hour period?

x = 3

mean = 2.5*4 = 10

using Excel…

p = poisson.dist(3,10,true)

p = 0.010336

Q: An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject from a population of prevalence 0.1% receives a positive test result. What is the precision of the test (i.e the probability he is HIV positive)?

Equation for Precision (PV)

Precision = Positive Predictive Value = PV

PV = (0.001*0.997)/[(0.001*0.997)+((1–0.001)*(1–0.985))]

PV = 0.0624 or 6.24%

See more about this equation here .

Q: You are running for office and your pollster polled hundred people. Sixty of them claimed they will vote for you. Can you relax?

Assume that there’s only you and one other opponent.
Also, assume that we want a 95% confidence interval. This gives us a z-score of 1.96.

Confidence interval formula

p-hat = 60/100 = 0.6

z* = 1.96

n = 100

This gives us a confidence interval of [50.4,69.6]. Therefore, given a confidence interval of 95%, if you are okay with the worst scenario of tying then you can relax. Otherwise, you cannot relax until you got 61 out of 100 to claim yes.

Q: Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour.

Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean
a 95% confidence interval implies a z score of 1.96
one standard deviation = 10

Therefore the confidence interval = 100 +/- 19.6 = [964.8, 1435.2]

Q: The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really noteworthy?

Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean
a 95% confidence interval implies a z score of 1.96
one standard deviation = sqrt(115) = 10.724

Therefore the confidence interval = 115+/- 21.45 = [93.55, 136.45]. Since 99 is within this confidence interval, we can assume that this change is not very noteworthy.

Q: Consider influenza epidemics for two-parent heterosexual families. Suppose that the probability is 17% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 12% while the probability that both the mother and father have contracted the disease is 6%. What is the probability that the mother has contracted influenza?

Using the General Addition Rule in probability:

P(mother or father) = P(mother) + P(father) — P(mother and father)

P(mother) = P(mother or father) + P(mother and father) — P(father)

P(mother) = 0.17 + 0.06–0.12

P(mother) = 0.11

Q: Suppose that diastolic blood pressures (DBPs) for men aged 35–44 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. About what is the probability that a random 35–44 year old has a DBP less than 70?

Since 70 is one standard deviation below the mean, take the area of the Gaussian distribution to the left of one standard deviation.

= 2.3 + 13.6 = 15.9%

Q: In a population of interest, a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain volume in this new population?

Confidence interval for sample

Given a confidence level of 95% and degrees of freedom equal to 8, the t-score = 2.306

Confidence interval = 1100 +/- 2.306*(30/3)

Confidence interval = [1076.94, 1123.06]

Q: A diet pill is given to 9 subjects over six weeks. The average difference in weight (follow up — baseline) is -2 pounds. What would the standard deviation of the difference in weight have to be for the upper endpoint of the 95% T confidence interval to touch 0?

Upper bound = mean + t-score*(standard deviation/sqrt(sample size))

0 = -2 + 2.306*(s/3)

2 = 2.306 * s / 3

s = 2.601903

Therefore the standard deviation would have to be at least approximately 2.60 for the upper bound of the 95% T confidence interval to touch 0.

Q: In a study of emergency room waiting times, investigators consider a new and the standard triage systems. To test the systems, administrators selected 20 nights and randomly assigned the new triage system to be used on 10 nights and the standard system on the remaining 10 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 3 hours with a variance of 0.60 while the average MWT for the old system was 5 hours with a variance of 0.68. Consider the 95% confidence interval estimate for the differences of the mean MWT associated with the new system. Assume a constant variance. What is the interval? Subtract in this order (New System — Old System).

See here for full tutorial on finding the Confidence Interval for Two Independent Samples.

Confidence Interval = mean +/- t-score * standard error (see above)

mean = new mean — old mean = 3–5 = -2

t-score = 2.101 given df=18 (20–2) and confidence interval of 95%

standard error = sqrt((0.⁶²*9+0.⁶⁸²*9)/(10+10–2)) * sqrt(1/10+1/10)

standard error = 0.352

confidence interval = [-2.75, -1.25]

Q: To further test the hospital triage system, administrators selected 200 nights and randomly assigned a new triage system to be used on 100 nights and a standard system on the remaining 100 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 4 hours with a standard deviation of 0.5 hours while the average MWT for the old system was 6 hours with a standard deviation of 2 hours. Consider the hypothesis of a decrease in the mean MWT associated with the new treatment. What does the 95% independent group confidence interval with unequal variances suggest vis a vis this hypothesis? (Because there’s so many observations per group, just use the Z quantile instead of the T.)

Assuming we subtract in this order (New System — Old System):

confidence interval formula for two independent samples

mean = new mean — old mean = 4–6 = -2

z-score = 1.96 confidence interval of 95%

st. error = sqrt((0.⁵²*99+²²*99)/(100+100–2)) * sqrt(1/100+1/100)

standard error = 0.205061

lower bound = -2–1.96*0.205061 = -2.40192

upper bound = -2+1.96*0.205061 = -1.59808

confidence interval = [-2.40192, -1.59808]

以上所述就是小编给大家介绍的《OVER 100 Data Scientist Interview Questions and Answers!》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

OVER 100 Data Scientist Interview Questions and Answers!

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

程序员面试宝典（第5版）

欧立奇、刘洋、段韬 / 电子工业出版社 / 2015-10 / 55.00

容提要《程序员面试宝典（第5版）》是《程序员面试宝典》的第5 版，在保留第4 版的数据结构、面向对象、程序设计等主干的基础上，修正了前4 版近40 处错误，解释清楚一些读者提出的问题，并使用各大IT 公司及相关企业最新面试题（2014-2015）替换和补充原内容，以反映自第4 版以来两年多的时间内所发生的变化。《程序员面试宝典（第5版）》取材于各大公司面试真题（笔试、口试、电话面试......一起来看看《程序员面试宝典（第5版）》这本书的介绍吧!

码农工具