Data Science

Data Science has written 144 posts for Ravi Mandal's Blog

Quiz on Python Dictionaries


Another useful data type built into Python is the dictionary (see Mapping Types — dict). Dictionaries are sometimes found in other languages as “associative memories” or “associative arrays”. Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers can always be keys. Tuples can be used as keys if they contain only strings, numbers, or tuples; if a tuple contains any mutable object either directly or indirectly, it cannot be used as a key.

For reading the topic in detail, please click below:



Click on below link to attend the Quiz(Free):



Machine Learning: An Introduction

Quora Question Pairs Can you identify question pairs that have the same intent?

Where else but Quora can a physicist help a chef with a math problem and get cooking tips in return? Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.


Why this Google engineer gave up on Silicon Valley and moved back to India

By all accounts, including her own, Nupur Dave had the dream life.

A native of India, she had spent the past decade living in the US. She was working at Google at the perk-filled “Googleplex” headquarters in Mountain View, California, at a job she loved. And she had obtained a permanent residence, her green card.

She was a manager for a part of Google called Network Content Distribution, the network tech that makes Google run faster (in geek speak: it’s Google’s homegrown alternative to a content distribution network like Akamai).


Google Cloud & YouTube-8M Video Understanding Challenge

Can you produce the best video tag predictions?

Video captures a cross-section of our society. And major advances in analyzing and understanding video have the potential to touch all aspects of life from learning and communication to entertainment and play. In this competition, Google is inviting the Kaggle community to join efforts to accelerate research in large-scale video understanding, while giving participants early access to the Google Cloud Machine Learning (Cloud ML) beta platform.

Today, one of the greatest obstacles to rapid improvements in video understanding research has been the lack of large-scale, labeled datasets open to the public. For example, the availability of large, labeled datasets such as ImageNet has enabled continued breakthroughs in machine learning and machine perception. To that end, Google’s recent release of the YouTube-8M (YT-8M) dataset represents a significant step in this direction. Making this resource open to everyone from students and industry professionals is expected to kickstart innovation in areas such as representation learning and video modeling architectures.

In this competition, you are challenged to develop classification algorithms which accurately assign video-level labels using the new and improved YT-8M V2 dataset. The dataset was created from over 7 million YouTube videos (450,000 hours of video) and includes video labels from a vocabulary of 4716 classes (3.4 labels/video on average). It also comes with pre-extracted audio & visual features from every second of video (3.2B feature vectors in total). By taking part, Kagglers will not only play a pivotal role in setting state-of-the-art benchmarks, but also improve search and organization of video archives.

for more Details :


Top Analytics/Data Science Tools

R remains the leading tool, with 49% share, but Python grows faster and almost catches up to R. RapidMiner remains the most popular general Data Science platform. Big Data tools used by almost 40%, and Deep Learning usage doubles.Top tools

For More Details visit :


Analytics Practice Question

1. You fit a multiple regression to examine the effect of a particular variable a worker in another department is interested in. The variable comes back insignificant, but your co-worker says that this is impossible as it is known to have an effect. What would you say/do?

2. You have 1000 variables and 100 observations. You would like to find the significant variables for a particular response. What would you do?

3. What is the difference between Regression and Logistic Regression? Can you explain the Assumptions/Conditions?

4. If ANOVA is about comparing means, why is it called analysis of variance?

Hints: Suppose we gather the data for the simplest case above. Let’s say we wind the weights of (say) 20 single people, 30 married people, 10 cohabitating people, and so on. We can then take the mean of each group. The means will not be the same in the different groups. Nor will all people in any group weigh the same. That is, for example, not all single people weigh the same amount.

We have variation within each group, and we have variation between groups. One way of measuring variation is with a statistic called the variance. If most of the variance is within groups, then we cannot conclude that the groups are different with regard to weight. On the other hand, if most of the variance is between groups that is evidence that the groups differ. How do we decide which variance is bigger, and what it means? We analyze it. That is, we perform an ANOVA, an analysis of variance.

5. Explain the difference between R2 and adjusted-R2. When is R2 (or adjusted-R2) not useful?

6. What is Hosmer lemeshow test of goodness of fit? Option available for the same in SAS.

7. What do you understand by standard normal variable?

8. Explain Poisson distribution with an example?

9. What is the difference between skewness and kurtosis?

10. Why do we use DESC keyword in proc logistic?

11. What do you understand by residual chi square test?

12. What is the difference between exploratory data analysis and confirmatory data analysis?

13. Explain the four scale of measurement.

14. Can we draw a bar diagram for quantitative data? Explain

15. What is the difference between chi square test of independence and correlation?

16. What is p-value? Why we accept the null hypothesis when p-value is greater than given level of significance?

17. What do you understand by hypothesis testing?

18. What are the different methods of estimation?

19. What do you understand by standard error?

20. What is a random variable?

21. What is odds?

22. What is odd ratio?

23. What do you understand by Ordinary least square technique and where we apply it?

24. What do you understand by maximum likelihood technique and where we apply it?

25. What are one tail and two tail test?

26. Under what circumstances we apply t test or z test?

27. What do you understand by stepwise selection in logistic regression?

28. What is the difference between R-square and adjusted R-square?

29. Why we call regression as simple linear regression and multiple linear regressions?

30. What is confidence interval?

31. What is level of significance?

32. When we do factor analysis?

33. What is KMO-MSA measure?

34. What is Eigen value and the mineigen criterion to keep the factors?

35. What is factor loading?

36. What is cluster analysis?

37. What are the different methods of clustering?

38. What is Single linkage method?

39. What is dendrogram?

40. What are the demerits of chi square test of independence?

41. What is the purpose of Tukey’s test in two way anova?

42. Why we use the model keyword in proc glm or proc anova code?

43. When we use proc glm over proc anova?

44. What is goodness of fit in regression?

45. What are the assumptions of classical linear regression model? Explain

46. What is white noise?

47. How do we check for multicollinearity?

48. How we overcome the problem of multicollinearity?

49. What is spec test?

50. What do you understand by autocorrelation in regression?

51. Why do we split the data set into two i.e. training and validation?

52. What is VIF and its use?

53. What is ROC?

54. What is the measure of ROC?

55. What is classification table? Explain.

56. What is false positive and false negative?

57. What is the error term in regression equation?

58. What is Box Jenkins methodology?

59. What is stationarity?

60. How to make a time series data stationary?

61. What do you understand by mean stationary?

62. What do you understand by variance stationary?

63. What are the parameters for normal distribution?

64. What is the use of RANUNI keyword?

65. What are the four models of time series? Explain.

66. What is correlation and partial correlation?

67. What is individual null hypothesis?

68. What is global null hypothesis?

69. What is 1-specificity?

70. What is Durbin Watson test?

71. What do you understand by paired sample t-test? What is the assumption for paired sample t-test? How do we test it?

72. What do you understand by degree of freedom?

73. Under what condition we apply t-test or anova?

74. What is scatter plot?

75. What do you mean by explanatory variable or independent variable?

76. What is type I error?

77. What is type II error?

78. Why we need to standardize the variable before doing cluster analysis?

SAS Quiz Question 1 :

Here is a dataset of answers to 5 questions from customer survey.
Only A,B,C,D, and E are valid values for the answers.

data qest;
     input cusmid $ (answ1 – answ5) ($1. +1);
001 A D C A B
002 b A C A D
003 C C D F B
004 S C B E B
005 E A C E e
006 B a c e B
007 N A D A C
008 A S W B B
009 D A E E B
010 Z B V F B

a. Convert all low case letters into upper cases letters respectively.
b. Create a variable of INVALNUM to count the number of invalid values
   of each customer’s ID using ARRAY statement.


3rd Annual Data Science Bowl Launch: Join the fight against cancer

Can you improve lung cancer detection?

In the United States, lung cancer strikes 225,000 people every year, and accounts for $12 billion in health care costs. Early detection is critical to give patients the best chance at recovery and survival.

Two years ago, the office of the U.S. Vice President spearheaded a bold new initiative, the Cancer Moonshot, to make a decade’s worth of progress in cancer prevention, diagnosis, and treatment in just 5 years.

In 2017, the Data Science Bowl will be a critical milestone in support of the Cancer Moonshot by convening the data science and medical communities to develop lung cancer detection algorithms.

Using a data set of thousands of high-resolution lung scans provided by the National Cancer Institute, participants will develop algorithms that accurately determine when lesions in the lungs are cancerous. This will dramatically reduce the false positive rate that plagues the current detection technology, get patients earlier access to life-saving interventions, and give radiologists more time to spend with their patients.

This year, the Data Science Bowl will award $1 million in prizes to those who observe the right patterns, ask the right questions, and in turn, create unprecedented impact around cancer screening care and prevention. The funds for the prize purse will be provided by the Laura and John Arnold Foundation.

Visit DataScienceBowl.com to:
• Sign up to receive news about the competition
• Learn about the history of the Data Science Bowl and past competitions
• Read our latest insights on emerging analytics techniques

DSB 2017

For more Details :


Ans :Puzzle #1 The gold Chain


%d bloggers like this: