{"id":1449,"date":"2024-12-04T18:46:23","date_gmt":"2024-12-04T18:46:23","guid":{"rendered":"https:\/\/www.allendowney.com\/blog\/?p=1449"},"modified":"2024-12-04T18:46:23","modified_gmt":"2024-12-04T18:46:23","slug":"multiple-regression-with-statsmodels","status":"publish","type":"post","link":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/","title":{"rendered":"Multiple Regression with StatsModels"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>This is the third is a series of excerpts from <em>Elements of Data Science<\/em> which <a href=\"https:\/\/www.lulu.com\/shop\/allen-downey\/elements-of-data-science\/paperback\/product-9dyrwn.html\">available from Lulu.com<\/a> and online booksellers. It&#8217;s from Chapter 10, which is about multiple regression. You can read the complete chapter <a href=\"https:\/\/allendowney.github.io\/ElementsOfDataScience\/10_regression.html\">here<\/a>, or run the Jupyter notebook on <a href=\"https:\/\/colab.research.google.com\/github\/AllenDowney\/ElementsOfDataScience\/blob\/v1\/08_distributions.ipynb\">Colab<\/a>.<\/p>\n<\/blockquote>\n\n\n\n<p>In the previous chapter we used simple linear regression to quantify the relationship between two variables. In this chapter we\u2019ll get farther into regression, including multiple regression and one of my all-time favorite tools, logistic regression. These tools will allow us to explore relationships among sets of variables. As an example, we will use data from the General Social Survey (GSS) to explore the relationship between education, sex, age, and income.<\/p>\n\n\n\n<p>The GSS dataset contains hundreds of columns. We\u2019ll work with an extract that contains just the columns we need, as we did in Chapter 8. Instructions for downloading the extract are in the notebook for this chapter.<\/p>\n\n\n\n<p>We can read the <code>DataFrame<\/code> like this and display the first few rows.<\/p>\n\n\n\n<pre id=\"codecell0\" class=\"wp-block-preformatted\">import pandas as pd\n\ngss = pd.read_hdf('gss_extract_2022.hdf', 'gss')\ngss.head()\n<\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><\/th><th>year<\/th><th>id<\/th><th>age<\/th><th>educ<\/th><th>degree<\/th><th>sex<\/th><th>gunlaw<\/th><th>grass<\/th><th>realinc<\/th><\/tr><\/thead><tbody><tr><th>0<\/th><td>1972<\/td><td>1<\/td><td>23.0<\/td><td>16.0<\/td><td>3.0<\/td><td>2.0<\/td><td>1.0<\/td><td>NaN<\/td><td>18951.0<\/td><\/tr><tr><th>1<\/th><td>1972<\/td><td>2<\/td><td>70.0<\/td><td>10.0<\/td><td>0.0<\/td><td>1.0<\/td><td>1.0<\/td><td>NaN<\/td><td>24366.0<\/td><\/tr><tr><th>2<\/th><td>1972<\/td><td>3<\/td><td>48.0<\/td><td>12.0<\/td><td>1.0<\/td><td>2.0<\/td><td>1.0<\/td><td>NaN<\/td><td>24366.0<\/td><\/tr><tr><th>3<\/th><td>1972<\/td><td>4<\/td><td>27.0<\/td><td>17.0<\/td><td>3.0<\/td><td>2.0<\/td><td>1.0<\/td><td>NaN<\/td><td>30458.0<\/td><\/tr><tr><th>4<\/th><td>1972<\/td><td>5<\/td><td>61.0<\/td><td>12.0<\/td><td>1.0<\/td><td>2.0<\/td><td>1.0<\/td><td>NaN<\/td><td>50763.0<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>We\u2019ll start with a simple regression, estimating the parameters of real income as a function of years of education. First we\u2019ll select the subset of the data where both variables are valid.<\/p>\n\n\n\n<pre id=\"codecell1\" class=\"wp-block-preformatted\">data = gss.dropna(subset=['realinc', 'educ'])\nxs = data['educ']\nys = data['realinc']\n<\/pre>\n\n\n\n<p>Now we can use <code>linregress<\/code> to fit a line to the data.<\/p>\n\n\n\n<pre id=\"codecell2\" class=\"wp-block-preformatted\">from scipy.stats import linregress\nres = linregress(xs, ys)\nres._asdict()\n<\/pre>\n\n\n\n<pre id=\"codecell3\" class=\"wp-block-preformatted\">{'slope': 3631.0761003894995,\n 'intercept': -15007.453640508655,\n 'rvalue': 0.37169252259280877,\n 'pvalue': 0.0,\n 'stderr': 35.625290800764,\n 'intercept_stderr': 480.07467595184363}\n<\/pre>\n\n\n\n<p>The estimated slope is about 3450, which means that each additional year of education is associated with an additional $3450 of income.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Regression with StatsModels<\/h2>\n\n\n\n<p>SciPy doesn\u2019t do multiple regression, so we\u2019ll to switch to a new library, StatsModels. Here\u2019s the import statement.<\/p>\n\n\n\n<pre id=\"codecell4\" class=\"wp-block-preformatted\">import statsmodels.formula.api as smf\n<\/pre>\n\n\n\n<p>To fit a regression model, we\u2019ll use <code>ols<\/code>, which stands for \u201cordinary least squares\u201d, another name for regression.<\/p>\n\n\n\n<pre id=\"codecell5\" class=\"wp-block-preformatted\">results = smf.ols('realinc ~ educ', data=data).fit()\n<\/pre>\n\n\n\n<p>The first argument is a <strong>formula string<\/strong> that specifies that we want to regress income as a function of education. The second argument is the <code>DataFrame<\/code> containing the subset of valid data. The names in the formula string correspond to columns in the <code>DataFrame<\/code>.<\/p>\n\n\n\n<p>The result from <code>ols<\/code> is an object that represents the model \u2013 it provides a function called <code>fit<\/code> that does the actual computation.<\/p>\n\n\n\n<p>The result is a <code>RegressionResultsWrapper<\/code>, which contains a <code>Series<\/code> called <code>params<\/code>, which contains the estimated intercept and the slope associated with <code>educ<\/code>.<\/p>\n\n\n\n<pre id=\"codecell6\" class=\"wp-block-preformatted\">results.params\n<\/pre>\n\n\n\n<pre id=\"codecell7\" class=\"wp-block-preformatted\">Intercept   -15007.453641\neduc          3631.076100\ndtype: float64\n<\/pre>\n\n\n\n<p>The results from Statsmodels are the same as the results we got from SciPy, so that\u2019s good!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Multiple Regression<\/h2>\n\n\n\n<p>In the previous section, we saw that income depends on education, and in the exercise we saw that it also depends on <code>age<\/code>. Now let\u2019s put them together in a single model.<\/p>\n\n\n\n<pre id=\"codecell8\" class=\"wp-block-preformatted\">results = smf.ols('realinc ~ educ + age', data=gss).fit()\nresults.params\n<\/pre>\n\n\n\n<pre id=\"codecell9\" class=\"wp-block-preformatted\">Intercept   -17999.726908\neduc          3665.108238\nage             55.071802\ndtype: float64\n<\/pre>\n\n\n\n<p>In this model, <code>realinc<\/code> is the variable we are trying to explain or predict, which is called the <strong>dependent variable<\/strong> because it depends on the the other variables \u2013 or at least we expect it to. The other variables, <code>educ<\/code> and <code>age<\/code>, are called <strong>independent variables<\/strong> or sometimes \u201cpredictors\u201d. The <code>+<\/code> sign indicates that we expect the contributions of the independent variables to be additive.<\/p>\n\n\n\n<p>The result contains an intercept and two slopes, which estimate the average contribution of each predictor with the other predictor held constant.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The estimated slope for <code>educ<\/code> is about 3665 \u2013 so if we compare two people with the same age, and one has an additional year of education, we expect their income to be higher by $3514.<\/li>\n\n\n\n<li>The estimated slope for <code>age<\/code> is about 55 \u2013 so if we compare two people with the same education, and one is a year older, we expect their income to be higher by $55.<\/li>\n<\/ul>\n\n\n\n<p>In this model, the contribution of age is quite small, but as we\u2019ll see in the next section that might be misleading.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Grouping by Age<\/h2>\n\n\n\n<p>Let\u2019s look more closely at the relationship between income and age. We\u2019ll use a Pandas method we have not seen before, called <code>groupby<\/code>, to divide the <code>DataFrame<\/code> into age groups.<\/p>\n\n\n\n<pre id=\"codecell10\" class=\"wp-block-preformatted\">grouped = gss.groupby('age')\ntype(grouped)\n<\/pre>\n\n\n\n<pre id=\"codecell11\" class=\"wp-block-preformatted\">pandas.core.groupby.generic.DataFrameGroupBy\n<\/pre>\n\n\n\n<p>The result is a <code>GroupBy<\/code> object that contains one group for each value of <code>age<\/code>. The <code>GroupBy<\/code> object behaves like a <code>DataFrame<\/code> in many ways. You can use brackets to select a column, like <code>realinc<\/code> in this example, and then invoke a method like <code>mean<\/code>.<\/p>\n\n\n\n<pre id=\"codecell12\" class=\"wp-block-preformatted\">mean_income_by_age = grouped['realinc'].mean()\n<\/pre>\n\n\n\n<p>The result is a Pandas <code>Series<\/code> that contains the mean income for each age group, which we can plot like this.<\/p>\n\n\n\n<pre id=\"codecell13\" class=\"wp-block-preformatted\">import matplotlib.pyplot as plt\n\nplt.plot(mean_income_by_age, 'o', alpha=0.5)\nplt.xlabel('Age (years)')\nplt.ylabel('Income (1986 $)')\nplt.title('Average income, grouped by age');\n<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"417\" height=\"264\" src=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png\" alt=\"_images\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png\" class=\"wp-image-1451\" srcset=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png 417w, https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38-300x190.png 300w\" sizes=\"auto, (max-width: 417px) 100vw, 417px\" \/><\/figure>\n\n\n\n<p>Average income increases from age 20 to age 50, then starts to fall. And that explains why the estimated slope is so small, because the relationship is non-linear. To describe a non-linear relationship, we\u2019ll create a new variable called <code>age2<\/code> that equals <code>age<\/code> squared \u2013 so it is called a <strong>quadratic term<\/strong>.<\/p>\n\n\n\n<pre id=\"codecell14\" class=\"wp-block-preformatted\">gss['age2'] = gss['age']**2\n<\/pre>\n\n\n\n<p>Now we can run a regression with both <code>age<\/code> and <code>age2<\/code> on the right side.<\/p>\n\n\n\n<pre id=\"codecell15\" class=\"wp-block-preformatted\">model = smf.ols('realinc ~ educ + age + age2', data=gss)\nresults = model.fit()\nresults.params\n<\/pre>\n\n\n\n<pre id=\"codecell16\" class=\"wp-block-preformatted\">Intercept   -52599.674844\neduc          3464.870685\nage           1779.196367\nage2           -17.445272\ndtype: float64\n<\/pre>\n\n\n\n<p>In this model, the slope associated with <code>age<\/code> is substantial, about $1779 per year.<\/p>\n\n\n\n<p>The slope associated with <code>age2<\/code> is about -$17. It might be unexpected that it is negative \u2013 we\u2019ll see why in the next section. But first, here are two exercises where you can practice using <code>groupby<\/code> and <code>ols<\/code>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Visualizing regression results<\/h2>\n\n\n\n<p>In the previous section we ran a multiple regression model to characterize the relationships between income, age, and education. Because the model includes quadratic terms, the parameters are hard to interpret. For example, you might notice that the parameter for <code>educ<\/code> is negative, and that might be a surprise, because it suggests that higher education is associated with lower income. But the parameter for <code>educ2<\/code> is positive, and that makes a big difference. In this section we\u2019ll see a way to interpret the model visually and validate it against data.<\/p>\n\n\n\n<p>Here\u2019s the model from the previous exercise.<\/p>\n\n\n\n<pre id=\"codecell17\" class=\"wp-block-preformatted\">gss['educ2'] = gss['educ']**2\n\nmodel = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss)\nresults = model.fit()\nresults.params\n<\/pre>\n\n\n\n<pre id=\"codecell18\" class=\"wp-block-preformatted\">Intercept   -26336.766346\neduc          -706.074107\neduc2          165.962552\nage           1728.454811\nage2           -17.207513\ndtype: float64\n<\/pre>\n\n\n\n<p>The <code>results<\/code> object provides a method called <code>predict<\/code> that uses the estimated parameters to generate predictions. It takes a <code>DataFrame<\/code> as a parameter and returns a <code>Series<\/code> with a prediction for each row in the <code>DataFrame<\/code>. To use it, we\u2019ll create a new <code>DataFrame<\/code> with <code>age<\/code> running from 18 to 89, and <code>age2<\/code> set to <code>age<\/code> squared.<\/p>\n\n\n\n<pre id=\"codecell19\" class=\"wp-block-preformatted\">import numpy as np\n\ndf = pd.DataFrame()\ndf['age'] = np.linspace(18, 89)\ndf['age2'] = df['age']**2\n<\/pre>\n\n\n\n<p>Next, we\u2019ll pick a level for <code>educ<\/code>, like 12 years, which is the most common value. When you assign a single value to a column in a <code>DataFrame<\/code>, Pandas makes a copy for each row.<\/p>\n\n\n\n<pre id=\"codecell20\" class=\"wp-block-preformatted\">df['educ'] = 12\ndf['educ2'] = df['educ']**2\n<\/pre>\n\n\n\n<p>Then we can use <code>results<\/code> to predict the average income for each age group, holding education constant.<\/p>\n\n\n\n<pre id=\"codecell21\" class=\"wp-block-preformatted\">pred12 = results.predict(df)\n<\/pre>\n\n\n\n<p>The result from <code>predict<\/code> is a <code>Series<\/code> with one prediction for each row. So we can plot it with age on the x-axis and the predicted income for each age group on the y-axis. And we\u2019ll plot the data for comparison.<\/p>\n\n\n\n<pre id=\"codecell22\" class=\"wp-block-preformatted\">plt.plot(mean_income_by_age, 'o', alpha=0.5)\nplt.plot(df['age'], pred12, label='High school', color='C4')\n\nplt.xlabel('Age (years)')\nplt.ylabel('Income (1986 $)')\nplt.title('Income versus age, grouped by education level')\nplt.legend();\n<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"417\" height=\"264\" src=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/891bc6bf5b349cf7c4b5d16771cfabd3322ae5486b9e9501ed58f7f79426ffb8.png\" alt=\"_images\/891bc6bf5b349cf7c4b5d16771cfabd3322ae5486b9e9501ed58f7f79426ffb8.png\" class=\"wp-image-1452\" srcset=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/891bc6bf5b349cf7c4b5d16771cfabd3322ae5486b9e9501ed58f7f79426ffb8.png 417w, https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/891bc6bf5b349cf7c4b5d16771cfabd3322ae5486b9e9501ed58f7f79426ffb8-300x190.png 300w\" sizes=\"auto, (max-width: 417px) 100vw, 417px\" \/><\/figure>\n\n\n\n<p>The dots show the average income in each age group. The line shows the predictions generated by the model, holding education constant. This plot shows the shape of the model, a downward-facing parabola.<\/p>\n\n\n\n<p>We can do the same thing with other levels of education, like 14 years, which is the nominal time to earn an Associate\u2019s degree, and 16 years, which is the nominal time to earn a Bachelor\u2019s degree.<\/p>\n\n\n\n<pre id=\"codecell23\" class=\"wp-block-preformatted\">df['educ'] = 16\ndf['educ2'] = df['educ']**2\npred16 = results.predict(df)\n\ndf['educ'] = 14\ndf['educ2'] = df['educ']**2\npred14 = results.predict(df)\n<\/pre>\n\n\n\n<pre id=\"codecell24\" class=\"wp-block-preformatted\">plt.plot(mean_income_by_age, 'o', alpha=0.5)\nplt.plot(df['age'], pred16, ':', label='Bachelor')\nplt.plot(df['age'], pred14, '--', label='Associate')\nplt.plot(df['age'], pred12, label='High school', color='C4')\n\nplt.xlabel('Age (years)')\nplt.ylabel('Income (1986 $)')\nplt.title('Income versus age, grouped by education level')\nplt.legend();\n<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"417\" height=\"264\" src=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/582915f2a0348e2f09c863fdeb4f0a9f36c532b1594572564761a207119c7684.png\" alt=\"_images\/582915f2a0348e2f09c863fdeb4f0a9f36c532b1594572564761a207119c7684.png\" class=\"wp-image-1453\" srcset=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/582915f2a0348e2f09c863fdeb4f0a9f36c532b1594572564761a207119c7684.png 417w, https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/582915f2a0348e2f09c863fdeb4f0a9f36c532b1594572564761a207119c7684-300x190.png 300w\" sizes=\"auto, (max-width: 417px) 100vw, 417px\" \/><\/figure>\n\n\n\n<p>The lines show expected income as a function of age for three levels of education. This visualization helps validate the model, since we can compare the predictions with the data. And it helps us interpret the model since we can see the separate contributions of age and education.<\/p>\n\n\n\n<p>Sometimes we can understand a model by looking at its parameters, but often it is better to look at its predictions. In the exercises, you\u2019ll have a chance to run a multiple regression, generate predictions, and visualize the results.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is the third is a series of excerpts from Elements of Data Science which available from Lulu.com and online booksellers. It&#8217;s from Chapter 10, which is about multiple regression. You can read the complete chapter here, or run the Jupyter notebook on Colab. In the previous chapter we used simple linear regression to quantify the relationship between two variables. In this chapter we\u2019ll get farther into regression, including multiple regression and one of my all-time favorite tools, logistic regression&#8230;.<\/p>\n<p class=\"read-more\"><a class=\"btn btn-default\" href=\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/\"> Read More<span class=\"screen-reader-text\">  Read More<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[],"class_list":["post-1449","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Multiple Regression with StatsModels - Probably Overthinking It<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Multiple Regression with StatsModels - Probably Overthinking It\" \/>\n<meta property=\"og:description\" content=\"This is the third is a series of excerpts from Elements of Data Science which available from Lulu.com and online booksellers. It&#8217;s from Chapter 10, which is about multiple regression. You can read the complete chapter here, or run the Jupyter notebook on Colab. In the previous chapter we used simple linear regression to quantify the relationship between two variables. In this chapter we\u2019ll get farther into regression, including multiple regression and one of my all-time favorite tools, logistic regression.... Read More Read More\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/\" \/>\n<meta property=\"og:site_name\" content=\"Probably Overthinking It\" \/>\n<meta property=\"article:published_time\" content=\"2024-12-04T18:46:23+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png\" \/>\n<meta name=\"author\" content=\"AllenDowney\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@AllenDowney\" \/>\n<meta name=\"twitter:site\" content=\"@AllenDowney\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"AllenDowney\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/\"},\"author\":{\"name\":\"AllenDowney\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/#\/schema\/person\/4e5bfb2e9af6c3446cb0031a7bf83207\"},\"headline\":\"Multiple Regression with StatsModels\",\"datePublished\":\"2024-12-04T18:46:23+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/\"},\"wordCount\":1184,\"publisher\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png\",\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/\",\"url\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/\",\"name\":\"Multiple Regression with StatsModels - Probably Overthinking It\",\"isPartOf\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png\",\"datePublished\":\"2024-12-04T18:46:23+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#primaryimage\",\"url\":\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png\",\"contentUrl\":\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png\",\"width\":417,\"height\":264},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.allendowney.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Multiple Regression with StatsModels\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/#website\",\"url\":\"https:\/\/www.allendowney.com\/blog\/\",\"name\":\"Probably Overthinking It\",\"description\":\"Data science, Bayesian Statistics, and other ideas\",\"publisher\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.allendowney.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/#organization\",\"name\":\"Probably Overthinking It\",\"url\":\"https:\/\/www.allendowney.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/03\/probably_logo.png\",\"contentUrl\":\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/03\/probably_logo.png\",\"width\":714,\"height\":784,\"caption\":\"Probably Overthinking It\"},\"image\":{\"@id\":\"https:\/\/www.allendowney.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/AllenDowney\",\"https:\/\/www.linkedin.com\/in\/allendowney\/\",\"https:\/\/bsky.app\/profile\/allendowney.bsky.social\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/#\/schema\/person\/4e5bfb2e9af6c3446cb0031a7bf83207\",\"name\":\"AllenDowney\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.allendowney.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g\",\"caption\":\"AllenDowney\"},\"url\":\"https:\/\/www.allendowney.com\/blog\/author\/allendowney_6dbrc4\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Multiple Regression with StatsModels - Probably Overthinking It","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/","og_locale":"en_US","og_type":"article","og_title":"Multiple Regression with StatsModels - Probably Overthinking It","og_description":"This is the third is a series of excerpts from Elements of Data Science which available from Lulu.com and online booksellers. It&#8217;s from Chapter 10, which is about multiple regression. You can read the complete chapter here, or run the Jupyter notebook on Colab. In the previous chapter we used simple linear regression to quantify the relationship between two variables. In this chapter we\u2019ll get farther into regression, including multiple regression and one of my all-time favorite tools, logistic regression.... Read More Read More","og_url":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/","og_site_name":"Probably Overthinking It","article_published_time":"2024-12-04T18:46:23+00:00","og_image":[{"url":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png","type":"","width":"","height":""}],"author":"AllenDowney","twitter_card":"summary_large_image","twitter_creator":"@AllenDowney","twitter_site":"@AllenDowney","twitter_misc":{"Written by":"AllenDowney","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#article","isPartOf":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/"},"author":{"name":"AllenDowney","@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/person\/4e5bfb2e9af6c3446cb0031a7bf83207"},"headline":"Multiple Regression with StatsModels","datePublished":"2024-12-04T18:46:23+00:00","mainEntityOfPage":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/"},"wordCount":1184,"publisher":{"@id":"https:\/\/www.allendowney.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#primaryimage"},"thumbnailUrl":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png","inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/","url":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/","name":"Multiple Regression with StatsModels - Probably Overthinking It","isPartOf":{"@id":"https:\/\/www.allendowney.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#primaryimage"},"image":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#primaryimage"},"thumbnailUrl":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png","datePublished":"2024-12-04T18:46:23+00:00","breadcrumb":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#primaryimage","url":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png","contentUrl":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/12\/ecc1aef34032bb07cf1639367d00ddbe2fc8a8ed7532628b9ddddafed10f7f38.png","width":417,"height":264},{"@type":"BreadcrumbList","@id":"https:\/\/www.allendowney.com\/blog\/2024\/12\/04\/multiple-regression-with-statsmodels\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.allendowney.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Multiple Regression with StatsModels"}]},{"@type":"WebSite","@id":"https:\/\/www.allendowney.com\/blog\/#website","url":"https:\/\/www.allendowney.com\/blog\/","name":"Probably Overthinking It","description":"Data science, Bayesian Statistics, and other ideas","publisher":{"@id":"https:\/\/www.allendowney.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.allendowney.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.allendowney.com\/blog\/#organization","name":"Probably Overthinking It","url":"https:\/\/www.allendowney.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/03\/probably_logo.png","contentUrl":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/03\/probably_logo.png","width":714,"height":784,"caption":"Probably Overthinking It"},"image":{"@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/AllenDowney","https:\/\/www.linkedin.com\/in\/allendowney\/","https:\/\/bsky.app\/profile\/allendowney.bsky.social"]},{"@type":"Person","@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/person\/4e5bfb2e9af6c3446cb0031a7bf83207","name":"AllenDowney","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g","caption":"AllenDowney"},"url":"https:\/\/www.allendowney.com\/blog\/author\/allendowney_6dbrc4\/"}]}},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":1551,"url":"https:\/\/www.allendowney.com\/blog\/2025\/05\/22\/my-very-busy-week\/","url_meta":{"origin":1449,"position":0},"title":"My very busy week","author":"AllenDowney","date":"May 22, 2025","format":false,"excerpt":"I'm not sure who scheduled ODSC and PyConUS during the same week, but I am unhappy with their decisions. Last Tuesday I presented a talk and co-presented a workshop at ODSC, and on Thursday I presented a tutorial at PyCon. If you would like to follow along with my very\u2026","rel":"","context":"In \"bayesian statistics\"","block_context":{"text":"bayesian statistics","link":"https:\/\/www.allendowney.com\/blog\/tag\/bayesian-statistics\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/img.youtube.com\/vi\/foMbacbuAQk\/0.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":843,"url":"https:\/\/www.allendowney.com\/blog\/2023\/01\/20\/resampling-for-logistic-regression\/","url_meta":{"origin":1449,"position":1},"title":"Resampling for Logistic Regression","author":"AllenDowney","date":"January 20, 2023","format":false,"excerpt":"A recent question on Reddit asked about using resampling with logistic regression. The responses suggest two ways to do it, one parametric and one non-parametric. I implemented both of them and then invented a third, which is hybrid of the two. You can read the details of the implementation in\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":201,"url":"https:\/\/www.allendowney.com\/blog\/2019\/04\/01\/local-regression-in-python\/","url_meta":{"origin":1449,"position":2},"title":"Local regression in Python","author":"AllenDowney","date":"April 1, 2019","format":false,"excerpt":"I love data visualization make-overs (like this one I wrote a few months ago), but sometimes the tone can be too negative (like this one I wrote a few months ago). Sarah Leo, a data journalist at The Economist, has found the perfect solution: re-making your own visualizations. Here's her\u2026","rel":"","context":"In \"local regression\"","block_context":{"text":"local regression","link":"https:\/\/www.allendowney.com\/blog\/tag\/local-regression\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2019\/04\/image.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":586,"url":"https:\/\/www.allendowney.com\/blog\/2021\/04\/30\/whats-new-in-think-bayes-2\/","url_meta":{"origin":1449,"position":3},"title":"What&#8217;s new in Think Bayes 2?","author":"AllenDowney","date":"April 30, 2021","format":false,"excerpt":"I'm happy to report that the second edition of Think Bayes is available for preorder now. What's new in the second edition? I wrote a new Chapter 1 that introduces conditional probability using the Linda the Banker problem and data from the General Social Survey.I added new chapters on survival\u2026","rel":"","context":"In \"bayesian statistics\"","block_context":{"text":"bayesian statistics","link":"https:\/\/www.allendowney.com\/blog\/tag\/bayesian-statistics\/"},"img":{"alt_text":"Cover of Think Bayes second edition","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2021\/04\/think_bayes_2e_cover.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":1541,"url":"https:\/\/www.allendowney.com\/blog\/2025\/04\/06\/announcing-think-stats-3e\/","url_meta":{"origin":1449,"position":4},"title":"Announcing Think Stats 3e","author":"AllenDowney","date":"April 6, 2025","format":false,"excerpt":"The third edition of Think Stats is on its way to the printer! You can preorder now from Bookshop.org and Amazon (those are affiliate links), or if you can't wait to get a paper copy, you can read the free, online version here. Here's the new cover, still featuring a\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/04\/image-2.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/04\/image-2.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/04\/image-2.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/04\/image-2.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":1627,"url":"https:\/\/www.allendowney.com\/blog\/2025\/10\/22\/the-foundation-fallacy\/","url_meta":{"origin":1449,"position":5},"title":"The Foundation Fallacy","author":"AllenDowney","date":"October 22, 2025","format":false,"excerpt":"At Olin College recently, I met with a group from the Kyiv School of Economics who are creating a new engineering program. I am very impressed with the work they are doing, and their persistence despite everything happening in Ukraine. As preparation for their curriculum design process, they interviewed engineers\u2026","rel":"","context":"In \"engineering education\"","block_context":{"text":"engineering education","link":"https:\/\/www.allendowney.com\/blog\/tag\/engineering-education\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/10\/image-2.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/posts\/1449","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/comments?post=1449"}],"version-history":[{"count":2,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/posts\/1449\/revisions"}],"predecessor-version":[{"id":1454,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/posts\/1449\/revisions\/1454"}],"wp:attachment":[{"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/media?parent=1449"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/categories?post=1449"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/tags?post=1449"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}