{"id":1390,"date":"2024-10-15T13:39:54","date_gmt":"2024-10-15T13:39:54","guid":{"rendered":"https:\/\/www.allendowney.com\/blog\/?p=1390"},"modified":"2024-10-15T13:39:54","modified_gmt":"2024-10-15T13:39:54","slug":"bootstrapping-a-proportion","status":"publish","type":"post","link":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/","title":{"rendered":"Bootstrapping a Proportion"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">It&#8217;s another installment in <em>Data Q&amp;A: Answering the real questions with Python<\/em>. Previous installments are available from the <a href=\"https:\/\/allendowney.github.io\/DataQnA\/index.html\"><em>Data Q&amp;A<\/em> landing page<\/a>.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s a <a href=\"https:\/\/www.reddit.com\/r\/statistics\/comments\/1g2pmfg\/comment\/lruz2xp\">question from the Reddit statistics forum<\/a>.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">How do I use bootstrapping to generate confidence intervals for a proportion\/ratio? The situation is this:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I obtain samples of text with differing numbers of lines. From several tens to over a million. I have no control over how many lines there are in any given sample. Each line of each sample may or may not contain a string S. Counting lines according to S presence or S absence generates a ratio of S to S\u2019 for that sample. I want to use bootstrapping to calculate confidence intervals for the found ratio (which of course will vary with sample size).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To do this I could either:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A. Literally resample (10,000 times) of size (say) 1,000 from the original sample (with replacement) then categorise S (and S\u2019), and then calculate the ratio for each resample, and finally identify highest and lowest 2.5% (for 95% CI), or<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">B. Generate 10,000 samples of 1,000 random numbers between 0 and 1, scoring each stochastically as above or below original sample ratio (equivalent to S or S\u2019). Then calculate CI as in A.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Programmatically A is slow and B is very fast. Is there anything wrong with doing B? The confidence intervals generated by each are almost identical.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">The answer to the immediate question is that A and B are equivalent, so there\u2019s nothing wrong with B. But in follow-up responses, a few related questions were raised:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Is resampling a good choice for this problem?<\/li>\n\n\n\n<li>What size should the resamplings be?<\/li>\n\n\n\n<li>How many resamplings do we need?<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">I don\u2019t think resampling is really necessary here, and I\u2019ll show some alternatives. And I\u2019ll answer the other questions along the way.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/colab.research.google.com\/github\/AllenDowney\/DataQnA\/blob\/main\/examples\/proportions.ipynb\">Click here to run this notebook on Colab<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I\u2019ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Pallor and Probability<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">As an example, let\u2019s use one of the exercises from <em>Think Python<\/em>:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>The Count of Monte Cristo<\/em> is a novel by Alexandre Dumas that is considered a classic. Nevertheless, in the introduction of an English translation of the book, the writer Umberto Eco confesses that he found the book to be \u201cone of the most badly written novels of all time\u201d.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In particular, he says it is \u201cshameless in its repetition of the same adjective,\u201d and mentions in particular the number of times \u201cits characters either shudder or turn pale.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To see whether his objection is valid, let\u2019s count the number number of lines that contain the word pale in any form, including pale, pales, paled, and paleness, as well as the related word pallor. Use a single regular expression that matches all of these words and no others.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">The following cell downloads the text of the book from Project Gutenberg.<\/p>\n\n\n\n<pre id=\"codecell3\" class=\"wp-block-preformatted\">download('https:\/\/www.gutenberg.org\/cache\/epub\/1184\/pg1184.txt');\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ll use the following functions to remove the additional material that appears before and after the text of the book.<\/p>\n\n\n\n<pre id=\"codecell4\" class=\"wp-block-preformatted\">def is_special_line(line):\n    return line.startswith('*** ')\n<\/pre>\n\n\n\n<pre id=\"codecell5\" class=\"wp-block-preformatted\">def clean_file(input_file, output_file):\n    reader = open(input_file)\n    writer = open(output_file, 'w')\n\n    for line in reader:\n        if is_special_line(line):\n            break\n\n    for line in reader:\n        if is_special_line(line):\n            break\n        writer.write(line)\n        \n    reader.close()\n    writer.close()\n\nclean_file('pg1184.txt', 'pg1184_cleaned.txt')\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And we\u2019ll use the following function to count the number of lines that contain a particular pattern of characters.<\/p>\n\n\n\n<pre id=\"codecell6\" class=\"wp-block-preformatted\">import re\n\ndef count_matches(lines, pattern):\n    count = 0\n    for line in lines:\n        result = re.search(pattern, line)\n        if result:\n            count += 1\n    return count\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><code>readlines<\/code> reads the file and creates a list of strings, one for each line.<\/p>\n\n\n\n<pre id=\"codecell7\" class=\"wp-block-preformatted\">lines = open('pg1184_cleaned.txt').readlines()\nn = len(lines)\nn\n<\/pre>\n\n\n\n<pre id=\"codecell8\" class=\"wp-block-preformatted\">61310\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">There are about 61,000 lines in the file.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The following pattern matches \u201cpale\u201d and several related words.<\/p>\n\n\n\n<pre id=\"codecell9\" class=\"wp-block-preformatted\">pattern = r'\\b(pale|pales|paled|paleness|pallor)\\b'\nk = count_matches(lines, pattern)\nk\n<\/pre>\n\n\n\n<pre id=\"codecell10\" class=\"wp-block-preformatted\">223\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">These words appear in 223 lines of the file.<\/p>\n\n\n\n<pre id=\"codecell11\" class=\"wp-block-preformatted\">p_est = k \/ n\np_est\n<\/pre>\n\n\n\n<pre id=\"codecell12\" class=\"wp-block-preformatted\">0.0036372533028869677\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">So the estimated proportion is about 0.0036. To quantify the precision of that estimate, we\u2019ll compute a confidence interval.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Resampling<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">First we\u2019ll use the method OP called A \u2013 literally resampling the lines of the file. The following function takes a list of lines and selects a sample, with replacement, that has the same size.<\/p>\n\n\n\n<pre id=\"codecell13\" class=\"wp-block-preformatted\">def resample(lines):\n    return np.random.choice(lines, len(lines), replace=True)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In a resampled list, the same line can appear more than once, and some lines might not appear at all. So in any resampling, the forbidden words might appear more times than in the original text, or fewer. Here\u2019s an example.<\/p>\n\n\n\n<pre id=\"codecell14\" class=\"wp-block-preformatted\">np.random.seed(1)\ncount_matches(resample(lines), pattern)\n<\/pre>\n\n\n\n<pre id=\"codecell15\" class=\"wp-block-preformatted\">201\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In this resampling, the words appear in 201 lines, fewer than in the original (223).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we repeat this process many times, we can compute a sample of possible values of <code>k<\/code>. Because this method is slow, we\u2019ll only repeat it 101 times.<\/p>\n\n\n\n<pre id=\"codecell16\" class=\"wp-block-preformatted\">ks_resampling = [count_matches(resample(lines), pattern) for i in range(101)]\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">With these different values of <code>k<\/code>, we can divide by <code>n<\/code> to get the corresponding values of <code>p<\/code>.<\/p>\n\n\n\n<pre id=\"codecell17\" class=\"wp-block-preformatted\">ps_resampling = np.array(ks_resampling) \/ n\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To see what the distribution of those values looks like, we\u2019ll plot the CDF.<\/p>\n\n\n\n<pre id=\"codecell18\" class=\"wp-block-preformatted\">from empiricaldist import Cdf\n\nCdf.from_seq(ps_resampling).plot(label='resampling')\ndecorate(xlabel='Resampled proportion', ylabel='Density')\n<\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"441\" height=\"255\" src=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image.png\" alt=\"\" class=\"wp-image-1391\" srcset=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image.png 441w, https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-300x173.png 300w\" sizes=\"auto, (max-width: 441px) 100vw, 441px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">So that\u2019s the slow way to compute the sampling distribution of the proportion. The method OP calls B is to simulate a Bernoulli trial with size <code>n<\/code> and probability of success <code>p_est<\/code>. One way to do that is to draw random numbers from 0 to 1 and count how many are less than <code>p_est<\/code>.<\/p>\n\n\n\n<pre id=\"codecell19\" class=\"wp-block-preformatted\">(np.random.random(n) &lt; p_est).sum()\n<\/pre>\n\n\n\n<pre id=\"codecell20\" class=\"wp-block-preformatted\">229\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Equivalently, we can draw a sample from a Bernoulli distribution and add it up.<\/p>\n\n\n\n<pre id=\"codecell21\" class=\"wp-block-preformatted\">from scipy.stats import bernoulli\n\nbernoulli(p_est).rvs(n).sum()\n<\/pre>\n\n\n\n<pre id=\"codecell22\" class=\"wp-block-preformatted\">232\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">These values follow a binomial distribution with parameters <code>n<\/code> and <code>p_est<\/code>. So we can simulate a large number of trials quickly by drawing values from a binomial distribution.<\/p>\n\n\n\n<pre id=\"codecell23\" class=\"wp-block-preformatted\">from scipy.stats import binom\n\nks_binom = binom(n, p_est).rvs(10001)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Dividing by <code>n<\/code>, we can compute the corresponding sample of proportions.<\/p>\n\n\n\n<pre id=\"codecell24\" class=\"wp-block-preformatted\">ps_binom = np.array(ks_binom) \/ n\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Because this method is so much faster, we can generate a large number of values, which means we get a more precise picture of the sampling distribution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The following figure compares the CDFs of the values we got by resampling and the values we got from the binomial distribution.<\/p>\n\n\n\n<pre id=\"codecell25\" class=\"wp-block-preformatted\">Cdf.from_seq(ps_resampling).plot(label='resampling')\nCdf.from_seq(ps_binom).plot(label='binomial')\ndecorate(xlabel='Resampled proportion', ylabel='CDF')\n<\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"442\" height=\"255\" src=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-1.png\" alt=\"\" class=\"wp-image-1392\" srcset=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-1.png 442w, https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-1-300x173.png 300w\" sizes=\"auto, (max-width: 442px) 100vw, 442px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If we run the resampling method longer, these CDFs converge, so the two methods are equivalent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compute a 90% confidence interval, we can use the values we sampled from the binomial distribution.<\/p>\n\n\n\n<pre id=\"codecell26\" class=\"wp-block-preformatted\">np.percentile(ps_binom, [5, 95])\n<\/pre>\n\n\n\n<pre id=\"codecell27\" class=\"wp-block-preformatted\">array([0.0032458 , 0.00404502])\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Or we can use the inverse CDF of the binomial distribution, which is even faster than drawing a sample. And it\u2019s deterministic \u2013 that is, we get the same result every time, with no randomness.<\/p>\n\n\n\n<pre id=\"codecell28\" class=\"wp-block-preformatted\">binom(n, p_est).ppf([0.05, 0.95]) \/ n\n<\/pre>\n\n\n\n<pre id=\"codecell29\" class=\"wp-block-preformatted\">array([0.0032458 , 0.00404502])\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Using the inverse CDF of the binomial distribution is a good way to compute confidence intervals. But before we get to that, let\u2019s see how resampling behaves as we increase the sample size and the number of iterations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Sample Size<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the example, the sample size is more than 60,000, so the CI is very small. The following figure shows what it looks like for more moderate sample sizes, using <code>p=0.1<\/code> as an example.<\/p>\n\n\n\n<pre id=\"codecell30\" class=\"wp-block-preformatted\">p = 0.1\nns = [50, 500, 5000]\nci_df = pd.DataFrame(index=ns, columns=['low', 'high'])\n\nfor n in ns:\n    ks = binom(n, p).rvs(10001)\n    ps = ks \/ n\n    Cdf.from_seq(ps).plot(label=f\"n = {n}\")\n    ci_df.loc[n] = np.percentile(ps, [5, 95])\n    \ndecorate(xlabel='Proportion', ylabel='CDF')\n<\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"442\" height=\"255\" src=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-2.png\" alt=\"\" class=\"wp-image-1393\" srcset=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-2.png 442w, https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-2-300x173.png 300w\" sizes=\"auto, (max-width: 442px) 100vw, 442px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">As the sample size increases, the spread of the sampling distribution gets smaller, and so does the width of the confidence interval.<\/p>\n\n\n\n<pre id=\"codecell31\" class=\"wp-block-preformatted\">ci_df['width'] = ci_df['high'] - ci_df['low']\nci_df\n<\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><\/th><th>low<\/th><th>high<\/th><th>width<\/th><\/tr><\/thead><tbody><tr><th>50<\/th><td>0.04<\/td><td>0.18<\/td><td>0.14<\/td><\/tr><tr><th>500<\/th><td>0.078<\/td><td>0.122<\/td><td>0.044<\/td><\/tr><tr><th>5000<\/th><td>0.0932<\/td><td>0.1072<\/td><td>0.014<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">With resampling methods, it is important to draw samples with the same size as the original dataset \u2013 otherwise the result is wrong.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But the number of iterations doesn\u2019t matter as much. The following figure shows the sampling distribution if we run the sampling process 101, 1001, and 10,001 times.<\/p>\n\n\n\n<pre id=\"codecell32\" class=\"wp-block-preformatted\">p = 0.1\nn = 100\niter_seq = [101, 1001, 100001]\n\nfor iters in iter_seq:\n    ks = binom(n, p).rvs(iters)\n    ps = ks \/ n\n    Cdf.from_seq(ps).plot(label=f\"iters = {iters}\")\n    \ndecorate()\n<\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"442\" height=\"255\" src=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-3.png\" alt=\"\" class=\"wp-image-1394\" srcset=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-3.png 442w, https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-3-300x173.png 300w\" sizes=\"auto, (max-width: 442px) 100vw, 442px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The sampling distribution is the same, regardless of how many iterations we run. But with more iterations, we get a better picture of the distribution and a more precise estimate of the confidence interval. For most problems, 1001 iterations is enough, but if you can generate larger samples fast enough, more is better.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, for this problem, resampling isn\u2019t really necessary. As we\u2019ve seen, we can use the binomial distribution to compute a CI without drawing a random sample at all. And for this problem, there are approximations that are even easier to compute \u2013 although they come with some caveats.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Approximations<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If <code>n<\/code> is large and <code>p<\/code> is not too close to 0 or 1, the sampling distribution of a proportion is well modeled by a normal distribution, and we can approximate a confidence interval with just a few calculations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For a given confidence level, we can use the inverse CDF of the normal distribution to compute a<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">score, which is the number of standard deviations the CI should span \u2013 above and below the observed value of <code>p<\/code> \u2013 in order to include the given confidence.<\/p>\n\n\n\n<pre id=\"codecell33\" class=\"wp-block-preformatted\">from scipy.stats import norm\n\nconfidence = 0.9\nz = norm.ppf(1 - (1 - confidence) \/ 2)\nz\n<\/pre>\n\n\n\n<pre id=\"codecell34\" class=\"wp-block-preformatted\">1.6448536269514722\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A 90% confidence interval spans about 1.64 standard deviations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now we can use the following function, which uses <code>p<\/code>, <code>n<\/code>, and this <code>z<\/code> score to compute the confidence interval.<\/p>\n\n\n\n<pre id=\"codecell35\" class=\"wp-block-preformatted\">def confidence_interval_normal_approx(k, n, z):\n    p = k \/ n\n    margin_of_error = z * np.sqrt(p * (1 - p) \/ n)\n    \n    lower_bound = p - margin_of_error\n    upper_bound = p + margin_of_error\n    return lower_bound, upper_bound\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To test it, we\u2019ll compute <code>n<\/code> and <code>k<\/code> for the example again.<\/p>\n\n\n\n<pre id=\"codecell36\" class=\"wp-block-preformatted\">n = len(lines)\nk = count_matches(lines, pattern)\nn, k\n<\/pre>\n\n\n\n<pre id=\"codecell37\" class=\"wp-block-preformatted\">(61310, 223)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s the confidence interval based on the normal approximation.<\/p>\n\n\n\n<pre id=\"codecell38\" class=\"wp-block-preformatted\">ci_normal = confidence_interval_normal_approx(k, n, z)\nci_normal\n<\/pre>\n\n\n\n<pre id=\"codecell39\" class=\"wp-block-preformatted\">(0.003237348046298746, 0.00403715855947519)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In the example, <code>n<\/code> is large, which is good for the normal approximation, but <code>p<\/code> is small, which is bad. So it\u2019s not obvious whether we can trust the approximation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An alternative that\u2019s more robust is the Wilson score interval, which is reliable for values of <code>p<\/code> close to 0 and 1, and sample sizes bigger than about 5.<\/p>\n\n\n\n<pre id=\"codecell40\" class=\"wp-block-preformatted\">def confidence_interval_wilson_score(k, n, z):    \n    p = k \/ n\n    factor = z**2 \/ n\n    denominator = 1 + factor\n    center = p + factor \/ 2\n    half_width = z * np.sqrt((p * (1 - p) + factor \/ 4) \/ n)\n    \n    lower_bound = (center - half_width) \/ denominator\n    upper_bound = (center + half_width) \/ denominator\n    \n    return lower_bound, upper_bound\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s the 90% CI based on Wilson scores.<\/p>\n\n\n\n<pre id=\"codecell41\" class=\"wp-block-preformatted\">ci_wilson = confidence_interval_wilson_score(k, n, z)\nci_wilson\n<\/pre>\n\n\n\n<pre id=\"codecell42\" class=\"wp-block-preformatted\">(0.003258660468175958, 0.00405965209814987)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Another option is the Clopper-Pearson interval, which is what we computed earlier with the inverse CDF of the binomial distribution. Here\u2019s a function that computes it.<\/p>\n\n\n\n<pre id=\"codecell43\" class=\"wp-block-preformatted\">from scipy.stats import binom\n\ndef confidence_interval_exact_binomial(k, n, confidence=0.9):\n    alpha = 1 - confidence\n    p = k \/ n\n\n    lower_bound = binom.ppf(alpha \/ 2, n, p) \/ n if k &gt; 0 else 0\n    upper_bound = binom.ppf(1 - alpha \/ 2, n, p) \/ n if k &lt; n else 1\n    \n    return lower_bound, upper_bound\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And here\u2019s the interval we get.<\/p>\n\n\n\n<pre id=\"codecell44\" class=\"wp-block-preformatted\">ci_binomial = confidence_interval_exact_binomial(k, n)\nci_binomial\n<\/pre>\n\n\n\n<pre id=\"codecell45\" class=\"wp-block-preformatted\">(0.003245800032621106, 0.0040450171260805745)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A final alternative is the Jeffreys interval, which is derived from Bayes\u2019s Theorem. If we start with a Jeffreys prior and observe <code>k<\/code> successes out of <code>n<\/code> attempts, the posterior distribution of <code>p<\/code> is a beta distribution with parameters <code>a = k + 1\/2<\/code> and <code>b = n - k + 1\/2<\/code>. So we can use the inverse CDF of the beta distribution to compute a CI.<\/p>\n\n\n\n<pre id=\"codecell46\" class=\"wp-block-preformatted\">from scipy.stats import beta\n\ndef bayesian_confidence_interval_beta(k, n, confidence=0.9):\n    alpha = 1 - confidence    \n    a, b = k + 1\/2, n - k + 1\/2\n    \n    lower_bound = beta.ppf(alpha \/ 2, a, b)\n    upper_bound = beta.ppf(1 - alpha \/ 2, a, b)\n    \n    return lower_bound, upper_bound\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And here\u2019s the interval we get.<\/p>\n\n\n\n<pre id=\"codecell47\" class=\"wp-block-preformatted\">ci_beta = bayesian_confidence_interval_beta(k, n)\nci_beta\n<\/pre>\n\n\n\n<pre id=\"codecell48\" class=\"wp-block-preformatted\">(0.003254420914221609, 0.004054683138668112)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The following figure shows the four intervals we just computed graphically.<\/p>\n\n\n\n<pre id=\"codecell49\" class=\"wp-block-preformatted\">intervals = {\n    'Normal Approximation': ci_normal,\n    'Wilson Score': ci_wilson,\n    'Clopper-Pearson': ci_binomial,\n    'Jeffreys': ci_beta\n}\ny_pos = np.arange(len(intervals))\n\nfor i, (label, (lower, upper)) in enumerate(intervals.items()):\n    middle = (lower + upper) \/ 2\n    xerr = [[(middle - lower)], [(upper - middle)]]\n    plt.errorbar(x=middle, y=i-0.2, xerr=xerr, fmt='o', capsize=5)\n    plt.text(middle, i, label, ha='center', va='top')\n    \ndecorate(xlabel='Proportion', ylim=[3.5, -0.8], yticks=[])\n<\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"442\" height=\"255\" src=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-4.png\" alt=\"\" class=\"wp-image-1395\" srcset=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-4.png 442w, https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image-4-300x173.png 300w\" sizes=\"auto, (max-width: 442px) 100vw, 442px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In this example, because <code>n<\/code> is so large, the intervals are all similar \u2013 the differences are too small to matter in practice. For smaller values of <code>n<\/code>, the normal approximation becomes unreliable, and for very small values, none of them are reliable.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The normal approximation and Wilson score interval are easy and fast to compute. On my old laptop, they take 1-2 microseconds.<\/p>\n\n\n\n<pre id=\"codecell50\" class=\"wp-block-preformatted\">%timeit confidence_interval_normal_approx(k, n, z)\n<\/pre>\n\n\n\n<pre id=\"codecell51\" class=\"wp-block-preformatted\">1.04 \u00b5s \u00b1 4.04 ns per loop (mean \u00b1 std. dev. of 7 runs, 1,000,000 loops each)\n<\/pre>\n\n\n\n<pre id=\"codecell52\" class=\"wp-block-preformatted\">%timeit confidence_interval_wilson_score(k, n, z)\n<\/pre>\n\n\n\n<pre id=\"codecell53\" class=\"wp-block-preformatted\">1.64 \u00b5s \u00b1 28.6 ns per loop (mean \u00b1 std. dev. of 7 runs, 1,000,000 loops each)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Evaluating the inverse CDF of the binomial and beta distributions are more complex computations \u2013 they take about 100 times longer.<\/p>\n\n\n\n<pre id=\"codecell54\" class=\"wp-block-preformatted\">%timeit confidence_interval_exact_binomial(k, n)\n<\/pre>\n\n\n\n<pre id=\"codecell55\" class=\"wp-block-preformatted\">195 \u00b5s \u00b1 7.53 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n<\/pre>\n\n\n\n<pre id=\"codecell56\" class=\"wp-block-preformatted\">%timeit bayesian_confidence_interval_beta(k, n)\n<\/pre>\n\n\n\n<pre id=\"codecell57\" class=\"wp-block-preformatted\">269 \u00b5s \u00b1 4.6 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">But they still take less than 300 microseconds, so unless you need to compute millions of confidence intervals per second, the difference in computation time doesn\u2019t matter.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Discussion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If you took a statistics class and learned one of these methods, you probably learned the normal approximation. That\u2019s because it is easy to explain and, because it is based on a form of the Central Limit Theorem, it helps to justify time spent learning about the CLT. But in my opinion it should never be used in practice because it is dominated by the Wilson score interval \u2013 that is, it is worse than Wilson in at least one way and better in none.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I think the Clopper-Pearson interval is equally easy to explain, but when <code>n<\/code> is small, there are few possible values of <code>k<\/code>, and therefore few possible values of <code>p<\/code> \u2013 and the interval can be wider than it needs to be.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Jeffreys interval is based on Bayesian statistics, so it takes a little more explaining, but it behaves well for all values of <code>n<\/code> and <code>p<\/code>. And when <code>n<\/code> is small, it can be extended to take advantage of background information about likely values of <code>p<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For these reasons, the Jeffreys interval is my usual choice, but in a computational environment that doesn\u2019t provide the inverse CDF of the beta distribution, I would use a Wilson score interval.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">OP is working in LiveCode, which doesn\u2019t provide a lot of math and statistics libraries, so Wilson might be a good choice. Here\u2019s a LiveCode implementation generated by ChatGPT.<\/p>\n\n\n\n<pre id=\"codecell58\" class=\"wp-block-preformatted\">-- Function to calculate the z-score for a 95% confidence level (z \u2248 1.96)\nfunction zScore\n    return 1.96\nend zScore\n\n-- Function to calculate the Wilson Score Interval with distinct bounds\nfunction wilsonScoreInterval k n\n    -- Calculate proportion of successes\n    put k \/ n into p\n    put zScore() into z\n    \n    -- Common term for the interval calculation\n    put (z^2 \/ n) into factor\n    put (p + factor \/ 2) \/ (1 + factor) into adjustedCenter\n    \n    -- Asymmetric bounds\n    put sqrt(p * (1 - p) \/ n + factor \/ 4) into sqrtTerm\n    \n    -- Lower bound calculation\n    put adjustedCenter - (z * sqrtTerm \/ (1 + factor)) into lowerBound\n    \n    -- Upper bound calculation\n    put adjustedCenter + (z * sqrtTerm \/ (1 + factor)) into upperBound\n    \n    return lowerBound &amp; comma &amp; upperBound\nend wilsonScoreInterval\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/allendowney.github.io\/DataQnA\/index.html\"><em>Data Q&amp;A: Answering the real questions with Python<\/em><\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Copyright 2024 <a href=\"https:\/\/allendowney.com\">Allen B. Downey<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>It&#8217;s another installment in Data Q&amp;A: Answering the real questions with Python. Previous installments are available from the Data Q&amp;A landing page. Here\u2019s a question from the Reddit statistics forum. How do I use bootstrapping to generate confidence intervals for a proportion\/ratio? The situation is this: I obtain samples of text with differing numbers of lines. From several tens to over a million. I have no control over how many lines there are in any given sample. Each line of&#8230;<\/p>\n<p class=\"read-more\"><a class=\"btn btn-default\" href=\"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/\"> Read More<span class=\"screen-reader-text\">  Read More<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-1390","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Bootstrapping a Proportion - Probably Overthinking It<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Bootstrapping a Proportion - Probably Overthinking It\" \/>\n<meta property=\"og:description\" content=\"It&#8217;s another installment in Data Q&amp;A: Answering the real questions with Python. Previous installments are available from the Data Q&amp;A landing page. Here\u2019s a question from the Reddit statistics forum. How do I use bootstrapping to generate confidence intervals for a proportion\/ratio? The situation is this: I obtain samples of text with differing numbers of lines. From several tens to over a million. I have no control over how many lines there are in any given sample. Each line of... Read More Read More\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/\" \/>\n<meta property=\"og:site_name\" content=\"Probably Overthinking It\" \/>\n<meta property=\"article:published_time\" content=\"2024-10-15T13:39:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image.png\" \/>\n<meta name=\"author\" content=\"AllenDowney\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@AllenDowney\" \/>\n<meta name=\"twitter:site\" content=\"@AllenDowney\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"AllenDowney\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/\"},\"author\":{\"name\":\"AllenDowney\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#\\\/schema\\\/person\\\/4e5bfb2e9af6c3446cb0031a7bf83207\"},\"headline\":\"Bootstrapping a Proportion\",\"datePublished\":\"2024-10-15T13:39:54+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/\"},\"wordCount\":1842,\"publisher\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/10\\\/image.png\",\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/\",\"url\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/\",\"name\":\"Bootstrapping a Proportion - Probably Overthinking It\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/10\\\/image.png\",\"datePublished\":\"2024-10-15T13:39:54+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/10\\\/image.png\",\"contentUrl\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/10\\\/image.png\",\"width\":441,\"height\":255},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/2024\\\/10\\\/15\\\/bootstrapping-a-proportion\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Bootstrapping a Proportion\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/\",\"name\":\"Probably Overthinking It\",\"description\":\"Data science, Bayesian Statistics, and other ideas\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#organization\",\"name\":\"Probably Overthinking It\",\"url\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/probably_logo.png\",\"contentUrl\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/probably_logo.png\",\"width\":714,\"height\":784,\"caption\":\"Probably Overthinking It\"},\"image\":{\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/AllenDowney\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/allendowney\\\/\",\"https:\\\/\\\/bsky.app\\\/profile\\\/allendowney.bsky.social\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/#\\\/schema\\\/person\\\/4e5bfb2e9af6c3446cb0031a7bf83207\",\"name\":\"AllenDowney\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g\",\"caption\":\"AllenDowney\"},\"url\":\"https:\\\/\\\/www.allendowney.com\\\/blog\\\/author\\\/allendowney_6dbrc4\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Bootstrapping a Proportion - Probably Overthinking It","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/","og_locale":"en_US","og_type":"article","og_title":"Bootstrapping a Proportion - Probably Overthinking It","og_description":"It&#8217;s another installment in Data Q&amp;A: Answering the real questions with Python. Previous installments are available from the Data Q&amp;A landing page. Here\u2019s a question from the Reddit statistics forum. How do I use bootstrapping to generate confidence intervals for a proportion\/ratio? The situation is this: I obtain samples of text with differing numbers of lines. From several tens to over a million. I have no control over how many lines there are in any given sample. Each line of... Read More Read More","og_url":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/","og_site_name":"Probably Overthinking It","article_published_time":"2024-10-15T13:39:54+00:00","og_image":[{"url":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image.png","type":"","width":"","height":""}],"author":"AllenDowney","twitter_card":"summary_large_image","twitter_creator":"@AllenDowney","twitter_site":"@AllenDowney","twitter_misc":{"Written by":"AllenDowney","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/#article","isPartOf":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/"},"author":{"name":"AllenDowney","@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/person\/4e5bfb2e9af6c3446cb0031a7bf83207"},"headline":"Bootstrapping a Proportion","datePublished":"2024-10-15T13:39:54+00:00","mainEntityOfPage":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/"},"wordCount":1842,"publisher":{"@id":"https:\/\/www.allendowney.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/#primaryimage"},"thumbnailUrl":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image.png","inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/","url":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/","name":"Bootstrapping a Proportion - Probably Overthinking It","isPartOf":{"@id":"https:\/\/www.allendowney.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/#primaryimage"},"image":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/#primaryimage"},"thumbnailUrl":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image.png","datePublished":"2024-10-15T13:39:54+00:00","breadcrumb":{"@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/#primaryimage","url":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image.png","contentUrl":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/10\/image.png","width":441,"height":255},{"@type":"BreadcrumbList","@id":"https:\/\/www.allendowney.com\/blog\/2024\/10\/15\/bootstrapping-a-proportion\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.allendowney.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Bootstrapping a Proportion"}]},{"@type":"WebSite","@id":"https:\/\/www.allendowney.com\/blog\/#website","url":"https:\/\/www.allendowney.com\/blog\/","name":"Probably Overthinking It","description":"Data science, Bayesian Statistics, and other ideas","publisher":{"@id":"https:\/\/www.allendowney.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.allendowney.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.allendowney.com\/blog\/#organization","name":"Probably Overthinking It","url":"https:\/\/www.allendowney.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/03\/probably_logo.png","contentUrl":"https:\/\/www.allendowney.com\/blog\/wp-content\/uploads\/2025\/03\/probably_logo.png","width":714,"height":784,"caption":"Probably Overthinking It"},"image":{"@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/AllenDowney","https:\/\/www.linkedin.com\/in\/allendowney\/","https:\/\/bsky.app\/profile\/allendowney.bsky.social"]},{"@type":"Person","@id":"https:\/\/www.allendowney.com\/blog\/#\/schema\/person\/4e5bfb2e9af6c3446cb0031a7bf83207","name":"AllenDowney","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/fb01b3a7f7190bea1bbf7f0852e686c2f8c03b099222df2ce4bc7926f15bcb43?s=96&d=mm&r=g","caption":"AllenDowney"},"url":"https:\/\/www.allendowney.com\/blog\/author\/allendowney_6dbrc4\/"}]}},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":765,"url":"https:\/\/www.allendowney.com\/blog\/2022\/08\/27\/the-plurality-of-the-nones\/","url_meta":{"origin":1390,"position":0},"title":"The Plurality of the Nones","author":"AllenDowney","date":"August 27, 2022","format":false,"excerpt":"Some time in the next 10-15 years, the most common religion in the United States will be \"none\". Data from the General Social Survey In this figure, the solid lines show estimated proportions of each religious affiliation from 1972 to 2021. Since the early 1990s, the proportion of Protestants has\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2022\/08\/affiliation_year_2021.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2022\/08\/affiliation_year_2021.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2022\/08\/affiliation_year_2021.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2022\/08\/affiliation_year_2021.png?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2022\/08\/affiliation_year_2021.png?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2022\/08\/affiliation_year_2021.png?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":1334,"url":"https:\/\/www.allendowney.com\/blog\/2024\/07\/17\/elements-of-data-science\/","url_meta":{"origin":1390,"position":1},"title":"Elements of Data Science","author":"AllenDowney","date":"July 17, 2024","format":false,"excerpt":"I'm excited to announce the launch of my newest book, Elements of Data Science. As the subtitle suggests, it is about \"Getting started with Data Science and Python\". Order now from Lulu.com and get 20% off! I am publishing this book myself, which has one big advantage: I can print\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/07\/image.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/07\/image.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2024\/07\/image.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":319,"url":"https:\/\/www.allendowney.com\/blog\/2019\/10\/18\/the-dartboard-paradox\/","url_meta":{"origin":1390,"position":2},"title":"The Dartboard Paradox","author":"AllenDowney","date":"October 18, 2019","format":false,"excerpt":"On November 5, 2019, I will be at PyData NYC to give a talk called The Inspection Paradox is Everywhere [UPDATE: The video from the talk is here]. Here's the abstract: The inspection paradox is a statistical illusion you\u2019ve probably never heard of. It\u2019s a common source of confusion, an\u2026","rel":"","context":"In \"dartboard\"","block_context":{"text":"dartboard","link":"https:\/\/www.allendowney.com\/blog\/tag\/dartboard\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2019\/10\/darts2-1.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2019\/10\/darts2-1.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2019\/10\/darts2-1.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2019\/10\/darts2-1.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":1736,"url":"https:\/\/www.allendowney.com\/blog\/2026\/03\/19\/young-adults-are-not-very-happy\/","url_meta":{"origin":1390,"position":3},"title":"Young Adults Are Not Very Happy","author":"AllenDowney","date":"March 19, 2026","format":false,"excerpt":"Since 1972, the General Social Survey has asked respondents: \u201cTaken all together, how would you say things are these days\u2014would you say that you are very happy, pretty happy, or not too happy?\u201d The following figure shows how the responses have changed over time and between birth cohorts. Each line\u2026","rel":"","context":"In \"age period cohort analysis\"","block_context":{"text":"age period cohort analysis","link":"https:\/\/www.allendowney.com\/blog\/tag\/age-period-cohort-analysis\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_standardized_-2dbe3fdd869f5be03f0b91d2cf3b980e.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_standardized_-2dbe3fdd869f5be03f0b91d2cf3b980e.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_standardized_-2dbe3fdd869f5be03f0b91d2cf3b980e.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_standardized_-2dbe3fdd869f5be03f0b91d2cf3b980e.png?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_standardized_-2dbe3fdd869f5be03f0b91d2cf3b980e.png?resize=1050%2C600&ssl=1 3x"},"classes":[]},{"id":596,"url":"https:\/\/www.allendowney.com\/blog\/2021\/05\/03\/simpsons-paradox-and-age-effects\/","url_meta":{"origin":1390,"position":4},"title":"Simpson&#8217;s Paradox and Age Effects","author":"AllenDowney","date":"May 3, 2021","format":false,"excerpt":"As people get older, do they become more racist, sexist, and homophobic? To find out, you could use data from the General Social Survey (GSS), which asks questions like: Do you think there should be laws against marriages between Blacks\/African-Americans and whites?Should a man who admits[mfn]If you find the wording\u2026","rel":"","context":"In \"general social survey\"","block_context":{"text":"general social survey","link":"https:\/\/www.allendowney.com\/blog\/tag\/general-social-survey\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2021\/05\/fepol_vs_age_by_cohort10.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2021\/05\/fepol_vs_age_by_cohort10.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2021\/05\/fepol_vs_age_by_cohort10.jpg?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":1756,"url":"https:\/\/www.allendowney.com\/blog\/2026\/03\/30\/have-the-nones-hit-a-ceiling\/","url_meta":{"origin":1390,"position":5},"title":"Have the Nones hit a ceiling?","author":"AllenDowney","date":"March 30, 2026","format":false,"excerpt":"Someone asked me recently why I stopped writing about religion, and I said there were two reasons: One is that the primary dataset I was following stopped updating; the other is that Ryan Burge is doing such a good job, I felt redundant. His most recent article presents evidence that\u2026","rel":"","context":"In \"bayesian statistics\"","block_context":{"text":"bayesian statistics","link":"https:\/\/www.allendowney.com\/blog\/tag\/bayesian-statistics\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_trajectory_re-2dd48c8df781b26e84647025ea5891b6.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_trajectory_re-2dd48c8df781b26e84647025ea5891b6.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_trajectory_re-2dd48c8df781b26e84647025ea5891b6.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_trajectory_re-2dd48c8df781b26e84647025ea5891b6.png?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/www.allendowney.com\/blog\/wp-content\/uploads\/2026\/03\/cohort_trajectory_re-2dd48c8df781b26e84647025ea5891b6.png?resize=1050%2C600&ssl=1 3x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/posts\/1390","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/comments?post=1390"}],"version-history":[{"count":1,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/posts\/1390\/revisions"}],"predecessor-version":[{"id":1396,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/posts\/1390\/revisions\/1396"}],"wp:attachment":[{"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/media?parent=1390"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/categories?post=1390"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.allendowney.com\/blog\/wp-json\/wp\/v2\/tags?post=1390"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}