Analytics University: 2018

Sunday, December 2, 2018

Exploratory Data Analysis (EDA) Using Python (Jupyter Notebook)

In this video you will learn how to perform Exploratory Data Analysis using Python. We will see how to slice data using Pandas, how to perform computing summary statistics using Numpy and how to visualize data using Matplotlib and Seaborn.

Exploratory data analysis is very use full while building Statistical/Machine Learning models. It helps to understand the structure of the data in order to be able to build a good predictive model

Exploratory Data Analysis (EDA) Using Python (Jupyter Notebook)

Monday, September 3, 2018

Credit Risk Analytics Interview Questions and Answers

In this video you will learn about 50 very important credit risk modelling interview questions and their answers.

To learn credit risk modelling (Development of POP, PD, LGD, EAD models, Model validation, Stress testing, Back testing). Contact : analyticsuniversity@gmail.com

Credit Risk Analytics Study packs: http://analyticuniversity.com/credit-risk-analytics-study-pack/

Some of the credit risk modelling questions discussed in the videos (and many more):

1- What were the main objectives of Basel 1

2-What were the main objectives of Basel 2

3-What is Capital Adequacy ratio

4-What are tier 1 & tier 2 capital

5-How does IFRS9 effects credit loss modeling?

6- What is CCAR?

7- What is PPNR?

8-What are the objectives of credit rating model?

9-What are LCR & NSFR

10-What is the difference between Expected loss and Unexpected loss

11- What is the main difference between wholesale & retail banking?

12- How do we test for multicollinearity

13-How do you deal with autocorrelation?

14-How do you deal with Heteroskedasticity??

15-What are the metrics used for model monitoring?

16-What are the aspects of model risk?

17-Guidelines for model development

18-What are the different aspects of Model Validation?

19-What are the aspects of model audit?

20-How do you perform back testing?

Thursday, July 12, 2018

Data Science Projects in Insurance Industry

Insurance industry has been using quantitative research for a very long time. Actuaries have been using statistical modelling to price insurance products and quantify risk for decades. But with advances in the data sciences, insurance industry is getting further disruptive. Here you will learn what are the projects that you can do in the insurance industry using data science

1- Consumer Targeting

2- Risk based pricing

3- Better claim management

4- Customer retention

5- Fraud Detection

6- Automate Under writing

7- Cyber Crime

Tuesday, July 10, 2018

10 Data Science Projects in the Retail Industry

In this video you will learn about 10 data science projects that you can do in the retail industry. Retail industry is leveraging Big Data analytics to serve customer better. Here are the list of projects

1- Repeat Purchase - Predicting chances of repurchase

2- Cross Sell - Selling an additional products to an existing customer

3- Personalized Recommendation -

4- Pricing products

5- Loyalty analysis

6-Campaign Analysis

7- Market Basket Analysis

8- A/B testing

9-CTR rate prediction

10-Segmentation Analysis

You can get data sets for such projects on Kaggle, Data.Gov and UCI ML data repository

Saturday, June 23, 2018

8 Data Science Projects in Banking

Financial data analysis is as much a broad area as Finance. You can use it for managing/mitigating different types of financial risk, taking decisions on investment, managing portfolio, valuing assets etc. Below are a few beginner level projects you can try working on.

1- Build a Credit Scorecard Model - Credit scorecards are basically used to assess credit worthiness of customers. Use German Loan data-set (publicly available credit data) to build credit scorecard for customers. The data set has historical data on default status of 1000 customers and the different factors that are possibly correlated with the customer’s chances of defaulting such as salary age, marital status etc. and attributes of the loan contract such as term, APR rate etc. Build a classification model (using techniques like Logistic Regression, LDA, Decision Tree, Random Forest, Boosting, Bagging) to classify good and bad customers (default and non default customers) and use the model to score new customers in future and lend to customers that have a minimum score. Credit scorecards are heavily used in the industry for taking decisions on grating credit, monitoring portfolio, calculating expected loss etc.

2- Build a Stock Price Forecasting Model - These models are used to predict price of a stock or an index for a given time period in future. You can download stock price of any of the publicly listed companies such as Apple, Microsoft, Facebook, Google from Yahoo finance. Such data is known as uni-variate time series data. You can use ARIMA (AR, MA, ARMA, ARIMA) class of models or use Exponential Smoothing models.

3- Portfolio Optimization Problem - Assume you are working as an adviser to a high net worth individual who wants to diversify his 1 million cash in 20 different stocks. How would you advise him? you can find 20 least correlated stocks (that mitigates the risk) using correlation matrix and use optimization algorithms (OR algos) to find out how you would distribute 1million among these 20 different stocks.

4- Segmentation modelling - Financial services are increasingly becoming tailored made. Doing so helps banks in targeting customers in a in a more efficient way. How do banks do so? They use segmentation modelling to cater differently to different segments of customers. You need historical data on customer attributes & data on financial product/services to build a segmentation model. Techniques such as Decision Trees, Clustering are used to build segmentation models.

5- Revenue Forecasting - Revenue forecasting can be done using statistical analysis as well (apart from the conventional accounting practices that companies follow). You can take data for factors affecting revenue of a company or a group of companies for a set of periods of equal interval (monthly, Quarterly, Half year, annual) to build a regression model. make sure you correct for problem of auto-correlation as the data has time series component and the errors are likely to be correlated (that violates assumptions of regression analysis)

6- Pricing Financial Products : You can build models to price financial products such as mortgages, auto loans, credit card transactions etc. (pricing in this case would be charging right interest rate to account for the risk involved, earn profit from the contract and yet be competitive in the market). You can also build models to price forward, future, options, swaps (relatively more complicated though)

7- Prepayment models - Prepayment is a problem in loan contracts for banks. Use loan data to predict customers could potentially prepay. You can build another model in parallel to this to know if a customer prepays, when is he likely to prepay in the life time of the loan (time to prepay). You may also build a model to know how much loss the company would incur if a section of the portfolio of customer prepay in future.

8 - Fraud Model - These models are being used to know if a particular transaction is a fraudulent transaction. Historical data having details of fraud and non-fraud transactions can be used to build a classification model that would predict chances of fraud happening in a transaction. Since we normally have high volume of data, one can try not just relatively simpler models like Logistic Regression or Decision trees but also should try more sophisticated ensemble models.

Wednesday, June 20, 2018

ANOVA | Analysis of Variance

Analysis of variance (ANOVA) is used to compare means of two or more samples. While t test can be used to compare means for two samples, it can not be used for more than two. Anova is used in such situation.

ANOVA was invented by Sir Ronald Fisher who applied this technique first to agriculture and cotton industry

It is now a popular technique used in many areas, most notably in design of experiments.

Sunday, June 10, 2018

Common Mistakes Made in Cross Validation | Machine Learning

In this video we will discuss few of the common mistakes often made while performing cross validation of Machine Learning Models. While root means square error and accuracy rate are the two most popular metrics used in evaluating model performance in cross validation, there are limitations of using these when the performance is more important for the researcher in one section of the data than the other

For example we could be interested in better performance in predicting house price of a segment of the sample (say the middle priced houses) than the other segments. Similarly, we could be interested in predicting more default customers than the non default customers in a classification set up.

Sunday, June 3, 2018

Occam's Razor (Parsimony ) in Machine Learning | Model Selection

Occam's Razor (Parsimony ) is one hypothesis that states that out of all possible models that provides similar results (or performance), the one that is most simple should be selected as the final model.

It dates back to many centuries ago when it was studied,not in relation to ML though but was studied in general. This is now widely accepted means of selecting the best model out of many models

Why everyone should learn some Data Science?

Everyone, irrespective career choices should learn some data science. Data Science skills are very useful every where. As part of data science you learn statistical analysis, forecasting, data visualization & mathematical programming that are very useful no matter which career you are interested in . Computational skills are going to be very important in future in any jobs

Friday, June 1, 2018

No Free Lunch theorem in Machine Learning

The No Free Lunch theorem in Machine Learning says that no single machine learning algorithm is universally the best algorithm. In fact, the goal of machine learning models is not find an algorithm that will be the best.

If one algorithm works good for a given problem it may not work well for some other problem. So there is no universally best algorithm that works very well in all cases

Wednesday, May 30, 2018

IID Assumption & Machine Learning Models

IID stands for independent and identical distribution in which it is assumed that data points are independent with each other and the are having similar distributed . Because of fulfillment of IID assumptions we are able to use cross validation to evaluate models.

Since data points are assumed to be IID, we are able to split the data in to training & test type . Thats because we assume both test & training data create from same data generating process

Tuesday, May 29, 2018

9 Types of Machine Learning Problems/Tasks

In this video I have discussed about the 9 types of Machine Learning problems and tasks. We have discussed about these broad categories in detail in this video

1- Regression

2- Classification

3- Transcription

4- Machine Translation

5- Structured Output

6- Anomaly detection

7 - Missing value Imputation

8 - Denoising

9- Probability density or PMF estimation

Monday, May 28, 2018

General Data Protection Regulation(GDPR) and Machine Learning

GDPR that stands for General Data Protection Regulation that will come in to force in 25th May 2018. It is a data protection regulation to protect personal data of European Union & EEU citizens.

In this video we have discussed about how GDPR will affect model building due to increased regulation on personal data

Saturday, May 26, 2018

Law of Large Numbers (R demo) | Statistics & probability | Machine Learning

The Law of Large Numbers is a fundamental concept in the field of Statistics and Probability which states that if a random experiment is performed over a large number of times, the average of the empirical outcome would be close to the actual or theoretical average outcome.

Here we have taken the example of dice experiment to showcase how the empirical average outcome would converse to theoretical average when performed for infinite number of times (very large N)

Wednesday, May 23, 2018

Stem and Leaf Plot | Data Visualization | Statistics

Stem and leaf plots are similar to histograms but has one major difference. They would retain the actual data points on the plot unlike histograms that only show ranges

Stem and Leaf Plot | Data Visualization | Statistics

Stem and leaf plots are similar to histograms but has one major difference. They would retain the actual data points on the plot unlike histograms that only show ranges

Tuesday, May 22, 2018

Box and Whisker Plot | Application in R | Data Vizualisation

Box and Whisker plot is a way to visualize continuous variable. This helps in understanding how the data is concentrated. The box ranges from the Quantile 1 to Quantile 3 , where as the whiskers range from begging to Quantile 1 and then quantile 3 to the end of the data.