{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Descriptive Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Descriptive Statistics, provides a summary of your dataset giving a measure of the centre, dispersion and shape of your data. Here the data is described as a sample of the whole population, and there are no inferences made from the sample to the whole population, unlike Inferential Statistics, in which we model the data on the basis of probability theory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Elements of Descriptive Statistics\n", "\n", "### Measures Of Central Tendency\n", "\n", "* Mean\n", "* Median\n", "* Mode\n", "\n", "### Measures Of Spread\n", "\n", "* Range\n", "* Outliers\n", "* Interquantile Range\n", "* Variance\n", "\n", "### Dependence\n", "\n", "* Correlation v/s Causation\n", "* Covariance" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipywidgets as widgets\n", "from ipywidgets import interact\n", "from ipywidgets import interact_manual" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# house price prediction\n", "data = pd.read_csv('Datasets/train.csv')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1460, 81)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfigLandSlopeNeighborhoodCondition1Condition2BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleRoofMatlExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeatingHeatingQCCentralAirElectrical1stFlrSF2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520032003GableCompShgVinylSdVinylSdBrkFace196.0GdTAPConcGdTANoGLQ706Unf0150856GasAExYSBrkr85685401710102131Gd8Typ0NaNAttchd2003.0RFn2548TATAY0610000NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPubFR2GtlVeenkerFeedrNorm1Fam1Story6819761976GableCompShgMetalSdMetalSdNone0.0TATACBlockGdTAGdALQ978Unf02841262GasAExYSBrkr1262001262012031TA6Typ1TAAttchd1976.0RFn2460TATAY29800000NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520012002GableCompShgVinylSdVinylSdBrkFace162.0GdTAPConcGdTAMnGLQ486Unf0434920GasAExYSBrkr92086601786102131Gd6Typ1TAAttchd2001.0RFn2608TATAY0420000NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPubCornerGtlCrawforNormNorm1Fam2Story7519151970GableCompShgWd SdngWd ShngNone0.0TATABrkTilTAGdNoALQ216Unf0540756GasAGdYSBrkr96175601717101031Gd7Typ1GdDetchd1998.0Unf3642TATAY035272000NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPubFR2GtlNoRidgeNormNorm1Fam2Story8520002000GableCompShgVinylSdVinylSdBrkFace350.0GdTAPConcGdTAAvGLQ655Unf04901145GasAExYSBrkr1145105302198102141Gd9Typ1TAAttchd2000.0RFn3836TATAY192840000NaNNaNNaN0122008WDNormal250000
\n", "
" ], "text/plain": [ " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "0 1 60 RL 65.0 8450 Pave NaN Reg \n", "1 2 20 RL 80.0 9600 Pave NaN Reg \n", "2 3 60 RL 68.0 11250 Pave NaN IR1 \n", "3 4 70 RL 60.0 9550 Pave NaN IR1 \n", "4 5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities LotConfig LandSlope Neighborhood Condition1 \\\n", "0 Lvl AllPub Inside Gtl CollgCr Norm \n", "1 Lvl AllPub FR2 Gtl Veenker Feedr \n", "2 Lvl AllPub Inside Gtl CollgCr Norm \n", "3 Lvl AllPub Corner Gtl Crawfor Norm \n", "4 Lvl AllPub FR2 Gtl NoRidge Norm \n", "\n", " Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt \\\n", "0 Norm 1Fam 2Story 7 5 2003 \n", "1 Norm 1Fam 1Story 6 8 1976 \n", "2 Norm 1Fam 2Story 7 5 2001 \n", "3 Norm 1Fam 2Story 7 5 1915 \n", "4 Norm 1Fam 2Story 8 5 2000 \n", "\n", " YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType \\\n", "0 2003 Gable CompShg VinylSd VinylSd BrkFace \n", "1 1976 Gable CompShg MetalSd MetalSd None \n", "2 2002 Gable CompShg VinylSd VinylSd BrkFace \n", "3 1970 Gable CompShg Wd Sdng Wd Shng None \n", "4 2000 Gable CompShg VinylSd VinylSd BrkFace \n", "\n", " MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure \\\n", "0 196.0 Gd TA PConc Gd TA No \n", "1 0.0 TA TA CBlock Gd TA Gd \n", "2 162.0 Gd TA PConc Gd TA Mn \n", "3 0.0 TA TA BrkTil TA Gd No \n", "4 350.0 Gd TA PConc Gd TA Av \n", "\n", " BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF \\\n", "0 GLQ 706 Unf 0 150 856 \n", "1 ALQ 978 Unf 0 284 1262 \n", "2 GLQ 486 Unf 0 434 920 \n", "3 ALQ 216 Unf 0 540 756 \n", "4 GLQ 655 Unf 0 490 1145 \n", "\n", " Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF \\\n", "0 GasA Ex Y SBrkr 856 854 0 \n", "1 GasA Ex Y SBrkr 1262 0 0 \n", "2 GasA Ex Y SBrkr 920 866 0 \n", "3 GasA Gd Y SBrkr 961 756 0 \n", "4 GasA Ex Y SBrkr 1145 1053 0 \n", "\n", " GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr \\\n", "0 1710 1 0 2 1 3 \n", "1 1262 0 1 2 0 3 \n", "2 1786 1 0 2 1 3 \n", "3 1717 1 0 1 0 3 \n", "4 2198 1 0 2 1 4 \n", "\n", " KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu \\\n", "0 1 Gd 8 Typ 0 NaN \n", "1 1 TA 6 Typ 1 TA \n", "2 1 Gd 6 Typ 1 TA \n", "3 1 Gd 7 Typ 1 Gd \n", "4 1 Gd 9 Typ 1 TA \n", "\n", " GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual \\\n", "0 Attchd 2003.0 RFn 2 548 TA \n", "1 Attchd 1976.0 RFn 2 460 TA \n", "2 Attchd 2001.0 RFn 2 608 TA \n", "3 Detchd 1998.0 Unf 3 642 TA \n", "4 Attchd 2000.0 RFn 3 836 TA \n", "\n", " GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch \\\n", "0 TA Y 0 61 0 0 \n", "1 TA Y 298 0 0 0 \n", "2 TA Y 0 42 0 0 \n", "3 TA Y 0 35 272 0 \n", "4 TA Y 192 84 0 0 \n", "\n", " ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold \\\n", "0 0 0 NaN NaN NaN 0 2 2008 \n", "1 0 0 NaN NaN NaN 0 5 2007 \n", "2 0 0 NaN NaN NaN 0 9 2008 \n", "3 0 0 NaN NaN NaN 0 2 2006 \n", "4 0 0 NaN NaN NaN 0 12 2008 \n", "\n", " SaleType SaleCondition SalePrice \n", "0 WD Normal 208500 \n", "1 WD Normal 181500 \n", "2 WD Normal 223500 \n", "3 WD Abnorml 140000 \n", "4 WD Normal 250000 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option('max_columns', 81)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d9fc4ea95ff84b3bae672a13217a9354", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='column', options=('Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'O…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@interact\n", "def check(column = list(data.select_dtypes('number').columns),\n", " column2 = list(data.select_dtypes('number').columns)[1:]):\n", " print(\"Correlation : \",data[column].corr(data[column2]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1a8d35cbe5fd4e45b2fea9583634c515", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='column1', options=('MSZoning', 'Street', 'Alley', 'LotShape', 'Lan…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.rcParams['figure.figsize'] = (15, 5)\n", "plt.style.use('fivethirtyeight')\n", "\n", "@interact_manual\n", "def check(column1 = list(data.select_dtypes('object').columns),\n", " column2 = list(data.select_dtypes('number').columns)):\n", " sns.boxplot(data[column1], data[column2])\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "21e49d0dd61f403a85075f80dc43f5be", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='column1', options=('MSSubClass', 'LotFrontage', 'LotArea', 'Overal…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.rcParams['figure.figsize'] = (15, 5)\n", "plt.style.use('fivethirtyeight')\n", "\n", "@interact_manual\n", "def check(column1 = list(data.select_dtypes('number').columns)[1:],\n", " column2 = list(data.select_dtypes('number').columns)[2:]):\n", " sns.scatterplot(data[column1], data[column2])\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1460, 81)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# lets check the shape of the data\n", "data.shape" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',\n", " 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',\n", " 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',\n", " 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',\n", " 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',\n", " 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',\n", " 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',\n", " 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',\n", " 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',\n", " 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',\n", " 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',\n", " 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',\n", " 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',\n", " 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',\n", " 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',\n", " 'SaleCondition', 'SalePrice'],\n", " dtype='object')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# lets check the column names\n", "data.columns" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfigLandSlopeNeighborhoodCondition1Condition2BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleRoofMatlExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeatingHeatingQCCentralAirElectrical1stFlrSF2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520032003GableCompShgVinylSdVinylSdBrkFace196.0GdTAPConcGdTANoGLQ706Unf0150856GasAExYSBrkr85685401710102131Gd8Typ0NaNAttchd2003.0RFn2548TATAY0610000NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPubFR2GtlVeenkerFeedrNorm1Fam1Story6819761976GableCompShgMetalSdMetalSdNone0.0TATACBlockGdTAGdALQ978Unf02841262GasAExYSBrkr1262001262012031TA6Typ1TAAttchd1976.0RFn2460TATAY29800000NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520012002GableCompShgVinylSdVinylSdBrkFace162.0GdTAPConcGdTAMnGLQ486Unf0434920GasAExYSBrkr92086601786102131Gd6Typ1TAAttchd2001.0RFn2608TATAY0420000NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPubCornerGtlCrawforNormNorm1Fam2Story7519151970GableCompShgWd SdngWd ShngNone0.0TATABrkTilTAGdNoALQ216Unf0540756GasAGdYSBrkr96175601717101031Gd7Typ1GdDetchd1998.0Unf3642TATAY035272000NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPubFR2GtlNoRidgeNormNorm1Fam2Story8520002000GableCompShgVinylSdVinylSdBrkFace350.0GdTAPConcGdTAAvGLQ655Unf04901145GasAExYSBrkr1145105302198102141Gd9Typ1TAAttchd2000.0RFn3836TATAY192840000NaNNaNNaN0122008WDNormal250000
\n", "
" ], "text/plain": [ " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "0 1 60 RL 65.0 8450 Pave NaN Reg \n", "1 2 20 RL 80.0 9600 Pave NaN Reg \n", "2 3 60 RL 68.0 11250 Pave NaN IR1 \n", "3 4 70 RL 60.0 9550 Pave NaN IR1 \n", "4 5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities LotConfig LandSlope Neighborhood Condition1 \\\n", "0 Lvl AllPub Inside Gtl CollgCr Norm \n", "1 Lvl AllPub FR2 Gtl Veenker Feedr \n", "2 Lvl AllPub Inside Gtl CollgCr Norm \n", "3 Lvl AllPub Corner Gtl Crawfor Norm \n", "4 Lvl AllPub FR2 Gtl NoRidge Norm \n", "\n", " Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt \\\n", "0 Norm 1Fam 2Story 7 5 2003 \n", "1 Norm 1Fam 1Story 6 8 1976 \n", "2 Norm 1Fam 2Story 7 5 2001 \n", "3 Norm 1Fam 2Story 7 5 1915 \n", "4 Norm 1Fam 2Story 8 5 2000 \n", "\n", " YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType \\\n", "0 2003 Gable CompShg VinylSd VinylSd BrkFace \n", "1 1976 Gable CompShg MetalSd MetalSd None \n", "2 2002 Gable CompShg VinylSd VinylSd BrkFace \n", "3 1970 Gable CompShg Wd Sdng Wd Shng None \n", "4 2000 Gable CompShg VinylSd VinylSd BrkFace \n", "\n", " MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure \\\n", "0 196.0 Gd TA PConc Gd TA No \n", "1 0.0 TA TA CBlock Gd TA Gd \n", "2 162.0 Gd TA PConc Gd TA Mn \n", "3 0.0 TA TA BrkTil TA Gd No \n", "4 350.0 Gd TA PConc Gd TA Av \n", "\n", " BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF \\\n", "0 GLQ 706 Unf 0 150 856 \n", "1 ALQ 978 Unf 0 284 1262 \n", "2 GLQ 486 Unf 0 434 920 \n", "3 ALQ 216 Unf 0 540 756 \n", "4 GLQ 655 Unf 0 490 1145 \n", "\n", " Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF \\\n", "0 GasA Ex Y SBrkr 856 854 0 \n", "1 GasA Ex Y SBrkr 1262 0 0 \n", "2 GasA Ex Y SBrkr 920 866 0 \n", "3 GasA Gd Y SBrkr 961 756 0 \n", "4 GasA Ex Y SBrkr 1145 1053 0 \n", "\n", " GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr \\\n", "0 1710 1 0 2 1 3 \n", "1 1262 0 1 2 0 3 \n", "2 1786 1 0 2 1 3 \n", "3 1717 1 0 1 0 3 \n", "4 2198 1 0 2 1 4 \n", "\n", " KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu \\\n", "0 1 Gd 8 Typ 0 NaN \n", "1 1 TA 6 Typ 1 TA \n", "2 1 Gd 6 Typ 1 TA \n", "3 1 Gd 7 Typ 1 Gd \n", "4 1 Gd 9 Typ 1 TA \n", "\n", " GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual \\\n", "0 Attchd 2003.0 RFn 2 548 TA \n", "1 Attchd 1976.0 RFn 2 460 TA \n", "2 Attchd 2001.0 RFn 2 608 TA \n", "3 Detchd 1998.0 Unf 3 642 TA \n", "4 Attchd 2000.0 RFn 3 836 TA \n", "\n", " GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch \\\n", "0 TA Y 0 61 0 0 \n", "1 TA Y 298 0 0 0 \n", "2 TA Y 0 42 0 0 \n", "3 TA Y 0 35 272 0 \n", "4 TA Y 192 84 0 0 \n", "\n", " ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold \\\n", "0 0 0 NaN NaN NaN 0 2 2008 \n", "1 0 0 NaN NaN NaN 0 5 2007 \n", "2 0 0 NaN NaN NaN 0 9 2008 \n", "3 0 0 NaN NaN NaN 0 2 2006 \n", "4 0 0 NaN NaN NaN 0 12 2008 \n", "\n", " SaleType SaleCondition SalePrice \n", "0 WD Normal 208500 \n", "1 WD Normal 181500 \n", "2 WD Normal 223500 \n", "3 WD Abnorml 140000 \n", "4 WD Normal 250000 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# head of the dataset\n", "pd.set_option('max_columns', 81)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2ecb2f518a6246b6930d0a0ac7b7ccac", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='column', options=('MSZoning', 'Street', 'Alley', 'LotShape', 'Land…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# stats at a glance\n", "@interact\n", "def check(column = list(data.select_dtypes('object').columns)):\n", " return data[[column,'SalePrice']].groupby(column).agg(['max','min','mean','median','std','sum','count'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 208500\n", "1 181500\n", "2 223500\n", "3 140000\n", "4 250000\n", "5 143000\n", "6 307000\n", "7 200000\n", "8 129900\n", "9 118000\n", "10 129500\n", "11 345000\n", "12 144000\n", "13 279500\n", "14 157000\n", "15 132000\n", "16 149000\n", "17 90000\n", "18 159000\n", "19 139000\n", "Name: SalePrice, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# lets check the Target Column of the Data\n", "data['SalePrice'].head(20)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "If we look at the values, they are spread all over.\n", "Some houses are ~120,000 dollars and some are over ~200,000\n", "AND THIS IS JUST IN 20 OBSERVATIONS OF THE DATA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mean SalePrice" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "180921.19589041095\n" ] } ], "source": [ "# checking the average price of houses\n", "mean = np.mean(data['SalePrice'])\n", "print(mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Disadvantage of Mean\n", "\n", "* Finding mean is not a good approach as the 'Mean is often affected by Outliers' or in simple words if there are some observations larger or smaller than majority of the other observations then the mean tends to deviate towards these values.\n", "\n", "* To generalize it if the distribution of datasets is skewed(troubled by outliers), we do not choose mean. Here we will have to go for Median." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Median of SalePrice" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "163000.0\n" ] } ], "source": [ "# checking the average price of houses\n", "median = np.median(data['SalePrice'])\n", "print(median)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We can see there is a Huge difference in the Mean and Median Values, which tells us that there are Outliers in this column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Median and Inter Quantile Range\n", "\n", "* Taking the concept of median a step further, we can define the Inter - Quartile Range.\n", "* IQR is a measure of variability and is based on dividing a data set into quartiles.\n", "* Quartile is the division of a set of observations into four intervals based on the values of the data." ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "![image.png](attachment:image.png)\n", "\n", "**The interquartile range is equal to Q3 minus Q1.**\n", "\n", "**For example,** \n", "* consider the following numbers: 1, 3, 4, 5, 5, 6, 7, 11.\n", "\n", " * Q1 is the middle value in the first half of the data set.\n", " * Since there are an even number of data points in the first half of the data set, the middle value is the average of the two middle values; that is, Q1 = (3 + 4)/2 or Q1 = 3.5. Q3 is the middle value in the second half of the data set. \n", " * Again, since the second half of the data set has an even number of observations, the middle value is the average of the two middle values; that is, Q3 = (6 + 7)/2 or Q3 = 6.5.\n", " * The interquartile range is Q3 minus Q1, so IQR = 6.5 - 3.5 = 3.\n", " \n", " \n", "### Box Plot View for IQR\n", "\n", "![image.png](attachment:image.png)\n", "\n", "\n", "### Outliers with Box Plot\n", "\n", "* The Boxplot above shows some additional observations below MINIMUM and above MAXIMUM. These are Outliers.\n", "* There are many ways to mathematically represent or define outliers. One such method is using IQR." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "326099.9999999999" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x= data['SalePrice'].quantile(0.95)\n", "x" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Median : 163000.0\n", "Q1: 129975.0\n", "Q3: 214000.0\n", "IQR: 84025.0\n" ] } ], "source": [ "### IQR \n", "\n", "# Median\n", "median = np.median(data['SalePrice'])\n", "print(\"Median :\",median)\n", "\n", "# lower quartile \n", "q1 = data['SalePrice'].quantile(0.25)\n", "\n", "# upper quartile\n", "q3 = data['SalePrice'].quantile(0.75)\n", "\n", "# printing Results\n", "print(\"Q1:\", q1)\n", "print(\"Q3:\", q3)\n", "print(\"IQR:\", q3 - q1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Here, IQR is Representing the Middle 50% of the values in the sales price column, Whereas the Mean and Median Values are having a hug gap in their values that means there are so many outliers in the data, let's try checking these outliers using a box plot" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Paola\\anaconda3\\lib\\site-packages\\seaborn\\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n", " warnings.warn(\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.boxplot(data['SalePrice'])\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Outlier Upper Limit : 3937.5\n", "Outlier Lower Limit : 340037.5\n" ] } ], "source": [ "## lets find no. of outliers\n", "\n", "# for that we have to find the upper and ower outlier limit\n", "outlier_lower_limit = q1 - 1.5*(q3 - q1)\n", "outlier_upper_limit = q3 + 1.5*(q3 - q1)\n", "print(\"Outlier Upper Limit :\", outlier_lower_limit)\n", "print(\"Outlier Lower Limit :\", outlier_upper_limit)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lower_limit_outliers: 0\n", "upper_limit_outliers: 61\n", "total outliers: 61\n" ] } ], "source": [ "Sales_price = data['SalePrice']\n", "\n", "lower_limit_outliers = Sales_price[Sales_price < outlier_lower_limit].count()\n", "\n", "upper_limit_outliers = Sales_price[Sales_price > outlier_upper_limit].count()\n", "\n", "print(\"lower_limit_outliers:\", lower_limit_outliers)\n", "print(\"upper_limit_outliers:\", upper_limit_outliers)\n", "print(\"total outliers:\", upper_limit_outliers + lower_limit_outliers)" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "## Skewness\n", "\n", "In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined\n", "\n", "![image.png](attachment:image.png)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Paola\\anaconda3\\lib\\site-packages\\seaborn\\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\n", " warnings.warn(msg, FutureWarning)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# lets check the skewness of the data\n", "sns.distplot(data['SalePrice'])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus, we see that our Histogram is \"Positively Skewed\"\n", "\n", "We can see different examples of Skewness from the image on the previous slide and see how Mean, and the Median are affected in each distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mode" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 140000\n", "dtype: int64\n" ] } ], "source": [ "mode = Sales_price.mode()\n", "print(mode)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## plot the hist with mean median and mode - This needs to be checked! \n", "\n", "plt.figure(figsize=(10, 6)) \n", "plt.hist(Sales_price, bins=40, color = 'yellow')\n", "plt.plot([mode]*300, range(300), color = 'black', label='mode') \n", "plt.plot([median]*300, range(300), label='median')\n", "plt.plot([mean]*300, range(300), label='mean')\n", "plt.ylim(0, 250)\n", "plt.legend()\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spread of the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Let's choose the value 250,000 from the SalePrice column and check how far this value is from the mean when compared to other points in the data set\n", "* We measure this as follows:\n", " (250,000 - mean)/Random Variation\n", "* We know the mean, we found that before\n", "\n", "* What is Random Variation?\n", " * It's nothing but the Average variation of the data from the mean\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Range of the Data\n", "\n", "* Range of data is simply:\n", " * Max Value of Data - Min Value of data\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "720100" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Range = np.max(Sales_price)-np.min(Sales_price)\n", "Range\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Variance of the Data" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6311111264.297451\n" ] } ], "source": [ "\n", "variance = Sales_price.var()\n", "print(variance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Standard Deviation" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "79442.50288288663\n" ] } ], "source": [ "from math import sqrt\n", "\n", "std = sqrt(variance)\n", "print(std)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "If we know that our data is Normally Distributed, we can confidently say that:\n", "~68% of the data is within one Std. Dev. from the mean\n", "~95% of the data is within 2 Std. Dev. from the mean\n", "~99.7% of the data is within 3 Std Dev from the mean" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "![image.png](attachment:image.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlation" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "![image.png](attachment:image.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* A correlation coefficient of 1 means that for every positive increase of 1 in one variable, there is a positive increase of 1 in the other.\n", "* A correlation coefficient of -1 means that for every positive increase of 1 in one variable, there is a negative decrease of 1 in the other.\n", "* Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the Correlation between the Sales price and the Living Room Area?\n", "\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "864 22\n", "1040 14\n", "894 11\n", "848 10\n", "1456 10\n", " ..\n", "2792 1\n", "2794 1\n", "1349 1\n", "1347 1\n", "2054 1\n", "Name: GrLivArea, Length: 861, dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['GrLivArea'].value_counts()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Correlation Between Sales Price and the Living Room Area is 70.86\n" ] } ], "source": [ "# lets find out the correlation\n", "\n", "living_room_area = data.GrLivArea\n", "\n", "# Returns Pearson product-moment correlation coefficients.\n", "corr = np.corrcoef(Sales_price, living_room_area)[0,1] \n", "print(\"Correlation Between Sales Price and the Living Room Area is {0:.2f}\".format(corr*100))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " LotArea GrLivArea GarageArea SalePrice\n", "LotArea 1.000000 0.263116 0.180403 0.263843\n", "GrLivArea 0.263116 1.000000 0.468997 0.708624\n", "GarageArea 0.180403 0.468997 1.000000 0.623431\n", "SalePrice 0.263843 0.708624 0.623431 1.000000\n" ] } ], "source": [ "#considering 4 continous variable and finding the correlation\n", "\n", "x = data[['LotArea','GrLivArea','GarageArea','SalePrice']]\n", "corr = x.corr() \n", "print(corr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlation doesn't imply Causation\n", "\n", "* However, correlation does not imply causation. There may be, for example, an unknown factor that influences both variables similarly.\n", "\n", "* Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.\n", "\n", "* A statistically significant correlation has been reported, for example, between yellow cars and a lower incidence of accidents. That does not indicate that yellow cars are safer, but just that fewer yellow cars are involved in accidents. A third factor, such as the personality type of the purchaser of yellow cars, is more likely to be responsible than the color of the paint itself." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LotAreaGrLivAreaGarageAreaSalePrice
LotArea9.962565e+071.380033e+063.849872e+052.092111e+08
GrLivArea1.380033e+062.761296e+055.269198e+042.958187e+07
GarageArea3.849872e+055.269198e+044.571251e+041.058910e+07
SalePrice2.092111e+082.958187e+071.058910e+076.311111e+09
\n", "
" ], "text/plain": [ " LotArea GrLivArea GarageArea SalePrice\n", "LotArea 9.962565e+07 1.380033e+06 3.849872e+05 2.092111e+08\n", "GrLivArea 1.380033e+06 2.761296e+05 5.269198e+04 2.958187e+07\n", "GarageArea 3.849872e+05 5.269198e+04 4.571251e+04 1.058910e+07\n", "SalePrice 2.092111e+08 2.958187e+07 1.058910e+07 6.311111e+09" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Covariance\n", "\n", "data[['LotArea','GrLivArea','GarageArea','SalePrice']].cov()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }