//
archives

# Archive for

### Top Analytics/Data Science Tools

R remains the leading tool, with 49% share, but Python grows faster and almost catches up to R. RapidMiner remains the most popular general Data Science platform. Big Data tools used by almost 40%, and Deep Learning usage doubles.

For More Details visit :

http://www.kdnuggets.com/2016/06/r-python-top-analytics-data-mining-data-science-software.html

### Analytics Practice Question

1. You fit a multiple regression to examine the effect of a particular variable a worker in another department is interested in. The variable comes back insignificant, but your co-worker says that this is impossible as it is known to have an effect. What would you say/do?

2. You have 1000 variables and 100 observations. You would like to find the significant variables for a particular response. What would you do?

3. What is the difference between Regression and Logistic Regression? Can you explain the Assumptions/Conditions?

4. If ANOVA is about comparing means, why is it called analysis of variance?

Hints: Suppose we gather the data for the simplest case above. Let’s say we wind the weights of (say) 20 single people, 30 married people, 10 cohabitating people, and so on. We can then take the mean of each group. The means will not be the same in the different groups. Nor will all people in any group weigh the same. That is, for example, not all single people weigh the same amount.

We have variation within each group, and we have variation between groups. One way of measuring variation is with a statistic called the variance. If most of the variance is within groups, then we cannot conclude that the groups are different with regard to weight. On the other hand, if most of the variance is between groups that is evidence that the groups differ. How do we decide which variance is bigger, and what it means? We analyze it. That is, we perform an ANOVA, an analysis of variance.

5. Explain the difference between R2 and adjusted-R2. When is R2 (or adjusted-R2) not useful?

6. What is Hosmer lemeshow test of goodness of fit? Option available for the same in SAS.

7. What do you understand by standard normal variable?

8. Explain Poisson distribution with an example?

9. What is the difference between skewness and kurtosis?

10. Why do we use DESC keyword in proc logistic?

11. What do you understand by residual chi square test?

12. What is the difference between exploratory data analysis and confirmatory data analysis?

13. Explain the four scale of measurement.

14. Can we draw a bar diagram for quantitative data? Explain

15. What is the difference between chi square test of independence and correlation?

16. What is p-value? Why we accept the null hypothesis when p-value is greater than given level of significance?

17. What do you understand by hypothesis testing?

18. What are the different methods of estimation?

19. What do you understand by standard error?

20. What is a random variable?

21. What is odds?

22. What is odd ratio?

23. What do you understand by Ordinary least square technique and where we apply it?

24. What do you understand by maximum likelihood technique and where we apply it?

25. What are one tail and two tail test?

26. Under what circumstances we apply t test or z test?

27. What do you understand by stepwise selection in logistic regression?

28. What is the difference between R-square and adjusted R-square?

29. Why we call regression as simple linear regression and multiple linear regressions?

30. What is confidence interval?

31. What is level of significance?

32. When we do factor analysis?

33. What is KMO-MSA measure?

34. What is Eigen value and the mineigen criterion to keep the factors?

36. What is cluster analysis?

37. What are the different methods of clustering?

38. What is Single linkage method?

39. What is dendrogram?

40. What are the demerits of chi square test of independence?

41. What is the purpose of Tukey’s test in two way anova?

42. Why we use the model keyword in proc glm or proc anova code?

43. When we use proc glm over proc anova?

44. What is goodness of fit in regression?

45. What are the assumptions of classical linear regression model? Explain

46. What is white noise?

47. How do we check for multicollinearity?

48. How we overcome the problem of multicollinearity?

49. What is spec test?

50. What do you understand by autocorrelation in regression?

51. Why do we split the data set into two i.e. training and validation?

52. What is VIF and its use?

53. What is ROC?

54. What is the measure of ROC?

55. What is classification table? Explain.

56. What is false positive and false negative?

57. What is the error term in regression equation?

58. What is Box Jenkins methodology?

59. What is stationarity?

60. How to make a time series data stationary?

61. What do you understand by mean stationary?

62. What do you understand by variance stationary?

63. What are the parameters for normal distribution?

64. What is the use of RANUNI keyword?

65. What are the four models of time series? Explain.

66. What is correlation and partial correlation?

67. What is individual null hypothesis?

68. What is global null hypothesis?

69. What is 1-specificity?

70. What is Durbin Watson test?

71. What do you understand by paired sample t-test? What is the assumption for paired sample t-test? How do we test it?

72. What do you understand by degree of freedom?

73. Under what condition we apply t-test or anova?

74. What is scatter plot?

75. What do you mean by explanatory variable or independent variable?

76. What is type I error?

77. What is type II error?

78. Why we need to standardize the variable before doing cluster analysis?

### SAS Quiz Question 1 :

Here is a dataset of answers to 5 questions from customer survey.
Only A,B,C,D, and E are valid values for the answers.

data qest;
input cusmid \$ (answ1 – answ5) (\$1. +1);
datalines;
001 A D C A B
002 b A C A D
003 C C D F B
004 S C B E B
005 E A C E e
006 B a c e B
007 N A D A C
008 A S W B B
009 D A E E B
010 Z B V F B
;
run;

a. Convert all low case letters into upper cases letters respectively.
b. Create a variable of INVALNUM to count the number of invalid values
of each customer’s ID using ARRAY statement.

### Can you improve lung cancer detection?

In the United States, lung cancer strikes 225,000 people every year, and accounts for \$12 billion in health care costs. Early detection is critical to give patients the best chance at recovery and survival.

Two years ago, the office of the U.S. Vice President spearheaded a bold new initiative, the Cancer Moonshot, to make a decade’s worth of progress in cancer prevention, diagnosis, and treatment in just 5 years.

In 2017, the Data Science Bowl will be a critical milestone in support of the Cancer Moonshot by convening the data science and medical communities to develop lung cancer detection algorithms.

Using a data set of thousands of high-resolution lung scans provided by the National Cancer Institute, participants will develop algorithms that accurately determine when lesions in the lungs are cancerous. This will dramatically reduce the false positive rate that plagues the current detection technology, get patients earlier access to life-saving interventions, and give radiologists more time to spend with their patients.

This year, the Data Science Bowl will award \$1 million in prizes to those who observe the right patterns, ask the right questions, and in turn, create unprecedented impact around cancer screening care and prevention. The funds for the prize purse will be provided by the Laura and John Arnold Foundation.

Visit DataScienceBowl.com to:
• Learn about the history of the Data Science Bowl and past competitions
• Read our latest insights on emerging analytics techniques

For more Details :

https://www.kaggle.com/c/data-science-bowl-2017

### Business Analyst – Amazon Bangalore, IN

##### Job description

Are you customer obsessed, flexible, smart and analytical, execution focused and passionate about e-commerce?
Amazon seeks a Business Analyst to be a key member of its wholesale team. Amazon wholesale business serves as the procurement solution provider for businesses of all size and types by offering wide selection, consistently low prices and convenient shopping experience to members. This business within Amazon offers a creative, fast-paced, entrepreneurial work environment where you’ll be at the center of Amazon business innovation. We are looking for an experienced Business Analyst who relishes diving deep into data. You will have the exciting opportunity to deliver on a strategy to enable broad use of AmazonBusiness by small/medium firms and large enterprises in India.
Working in a dynamic environment, you will be responsible for monitoring key input metrics, identifying problem areas and business challenges and collaboratively shaping solutions to augment business growth. The successful candidate has a passion for extracting actionable insights from data. He/she rolls up his/her sleeves, innovates, and quickly becomes a subject matter expert to assess business performance across sellers and market segments. He/she has significant experience working with customers, analyzing data, identifying trends, extracting conclusions, and presenting findings in a simple and clear manner. He/she enjoys problem solving and is proficient using Excel and other tools to analyze large data sets.
Key Responsibilities

• Create organisation’s reporting metrics deck and craft performance metrics for various business teams. Own and partner with cross-functional teams to influence these metrics
• Build models to understand cause-effect relationships between inputs and output metrics
• Conduct ad hoc data analysis and data quality investigations. Use Amazon’s tools to problem solve and validate solutions
• Build forecasting models for financial planning based on past data trends, including developing reports from ERP for key insights
• Recommend business actions based on analytical findings, define new metrics, techniques, and strategies to improve performance

The Successful Candidate Will

• Demonstrate intense customer focus and highest level of integrity, intellectual honesty, and strong work ethic.
• Be sharp, analytical, and thoughtful. Thoroughly thinks through problems to come up with solutions.
• Proven ability to successfully thrive in an ambiguous environment and changing market conditions.
• Takes initiative. Doesn’t wait to be asked. Plans efficiently while avoiding analysis paralysis.
• Strong team player, acts like an owner, and ultimately focused on delivering results with high standards.
• High attention to detail and proven ability to manage multiple, competing priorities simultaneously.
• Superior verbal and written communications skills; particularly, the ability to deliver analysis in a clear and actionable format. Deliver effective presentations of findings and recommendations to multiple levels of stakeholders.
• A bachelor’s degree at a well-regarded institution in an analytical field (e.g. Economics, Computer Science, Mathematics, Statistics, or Finance. 3+ years of work experience analyzing data with strong statistical, quantitative modeling, and forecasting skills.
• Experience in a highly analytical, results-oriented environment with external customer interaction.
• Proven analytical skills and demonstrated ability to manage the business “by the numbers”. Must be metrics-driven.
• Proven analytical and quantitative skills (including proficiency with tools such as Excel and Access) and an ability to use data to back up assumptions and develop business cases.
• B.E /B.Tech
• Proficiency in SQL
• An ability to think quantitatively and qualitatively about operating processes and outcomes
• Experience with e-commerce, retail, advertising, or media would be an advantage

### Extracting First Name, Middle Name, Last Name from Name using sas

Data char_fun;
input name \$30.;
length first \$10. middle \$10. last \$10.;

first = scan(name,1,’ ‘);
middle = scan(name,2,’ ‘);
last = scan(name,3,’ ‘);
if missing(last) then do;
last = middle;
middle = ” “;
end;

cards;
Ravi mandal
Binay kumar
Shashi Bhushan Mandal
;

proc print data = char_fun;
run;

### Concatenate multiple rows into a single value

Try this:

data in Table;

Data new;
Input Id name \$;
cards;
1 A
1 B
1 C
1 h
2 f
;
run;
proc sort data=new;by id;run;

Use Below solutions :

/* Method 1 */
Data final;
retain id;
do until(last.id);
set new;
by id;
new=catx(” “,new,name);
end;
run;

/* Method 2*/
data method2(drop=number);
set new ;
by id;
length l \$ 4000;
retain l;
l=catx(‘,’,l,name);
if last.id then do; output;call missing(l);end;
run;

/* Method 3*/
proc transpose data=new out=new2 ;
by id ;
var name ;
run ;
data method3(keep = id cat) ;
set new2 ;
length cat \$20 ;
cat = catx( ‘,’ , of col: ) ;
run ;

/* Method 4*/

data method4 (drop = invar);
set new;
by id;
length outvar \$200;
retain outvar;
if first.id then call missing(outvar);
outvar = catx(‘, ‘, outvar, Name);
if last.id;
run;

/* Method 5*/

data method5 ;
length cat \$200 ;
retain cat ;
set new ;
by id notsorted ;
if first.id then cat = cats(name) ;
else cat = catx(‘,’,cat,name) ;
if last.id then output ;
run ;

### An Introduction to SAS (SQL & Macro)

Macro programming is generally considered an advanced topic. But, while macros certainly can be challenging, it is also true that the basic concepts are not difficult to learn. This paper is designed for people who know the basics of SAS programming, but know nothing about SAS macro programming. We explain how the macro processor works, and how to use macros and macro variables. Using these techniques you can create flexible, reusable code that can save you time and effort.

Because macro code takes longer to write and debug than standard SAS code, you generally won’t use macros in programs that will be run only a few times. But if you find yourself writing similar code over and over again, then macros may make your job easier. Macros can help in several ways. First, with macros you can make one small change in your program and have SAS echo that change throughout your program. Second, macros can allow you to write a piece of code and use it over and over again in the same program or in different programs. Third, you can make your programs data driven, letting SAS decide what to do based on actual data values.