Logistic Regression Calculator

Introduction

This tool facilitates the analysis of binary outcomes by performing logistic regression on a dataset. It is designed for students and researchers exploring the relationship between a continuous predictor variable $x$ and a categorical response $y$ . By modelling the probability $p$ of a specific event occurring, it helps determine how changes in the independent variable influence the likelihood of a binary success.

What this calculator does

The system processes pairs of numerical data to estimate the coefficients of a logit model. Users provide a series of predictor values and corresponding binary responses (0 or 1). The calculator applies outlier detection via modified Z-scores and uses the Newton-Raphson method for Maximum Likelihood Estimation. It produces the intercept, slope coefficient, standard errors, and classification metrics including accuracy, precision, recall, and McFadden's R-squared.

Formula used

The model estimates the probability of the response being 1 using the sigmoid function. The linear predictor $z$ is calculated as $β_{0} + β_{1} x$ . Parameters are refined iteratively to maximise the log-likelihood function. Goodness of fit is evaluated using the Akaike Information Criterion and McFadden's pseudo-R-squared based on the model and null likelihoods.

Logistic Function (Probability):

P (Y = 1) = \frac{1}{1 + e^{- (β_{0} + β_{1} x)}}

McFadden's Pseudo-R²:

R^{2} = 1 - \frac{\ln (L_{model})}{\ln (L_{null})}

How to use this calculator

Enter the predictor variable $x$ values as a comma-separated list.
Input the corresponding response variable $y$ values, ensuring only 0s and 1s are used.
Select the preferred outlier sensitivity and decimal place precision.
Execute the calculation to view the coefficients, statistical metrics, and visual probability curves.

Example calculation

Scenario: A researcher in environmental science investigates whether the concentration of a soil mineral (mg/kg) predicts the presence (1) or absence (0) of a plant species.

Inputs: Predictor $x$ values: 2, 4, 6, 8, 10; Response $y$ values: 1, 0, 1, 0, 1.

Because the data has an alternating pattern (1, 0, 1, 0, 1), the optimal slope is $β_{1} = 0$ . The model simplifies to finding the intercept $β_{0}$ where the predicted probability matches the overall sample proportion: $P = \frac{3}{5} = 0.6$ .

Step 1: Calculate the Log-Odds (z)

Using the solved coefficients $β_{0} = 0.4055$ and $β_{1} = 0$ :

For all points: $z = 0.4055 + (0 \times x) = 0.4055$

Step 2: Map to Probabilities (P) Using the Logistic Function

$P = \frac{1}{1 + e^{- 0.4055}} = \frac{1}{1 + 0.6667} = 0.600$

Step 3: Evaluate the Log-Likelihood (ln L)

We sum the log-probabilities for the 3 successes (y = 1) and 2 failures (y = 0):

ln(L) = 3 x ln(0.6) + 2 x ln(0.4)

ln(L) = 3 x (-0.5108) + 2 x (-0.9163) = -3.3651

Step 4: Verify Optimization Gradient (Newton-Raphson Check)

The gradient (first derivative) for the intercept must equal 0 at convergence:

Gradient = $\sum (y_{i} - p_{i})$ = (1 - 0.6) + (0 - 0.6) + (1 - 0.6) + (0 - 0.6) + (1 - 0.6)

Gradient = 0.4 - 0.6 + 0.4 - 0.6 + 0.4 = 0

Because the gradient is exactly 0, the Newton-Raphson update step becomes zero, confirming that the algorithm has reached the maximum likelihood solution.

Calculation Process

Identified 0 outliers using Modified Z-score (threshold 3.5).
Performed Maximum Likelihood Estimation via Newton-Raphson. Converged in 4 iterations.

Point-wise predictions:

Point 1: x = 2 -> z = 0.405 -> P(Y = 1) = 0.600 (Actual: 1)
Point 2: x = 4 -> z = 0.405 -> P(Y = 1) = 0.600 (Actual: 0)
Point 3: x = 6 -> z = 0.405 -> P(Y = 1) = 0.600 (Actual: 1)
Point 4: x = 8 -> z = 0.405 -> P(Y = 1) = 0.600 (Actual: 0)
Point 5: x = 10 -> z = 0.405 -> P(Y = 1) = 0.600 (Actual: 1)

Results

Estimated Coefficients:

Intercept (β₀): 0.405

Slope (β₁): 0.000

Interpretation

The slope coefficient beta1 = 0 indicates that mineral concentration has no detectable effect on the log-odds of plant presence.

The alternating pattern of responses (1, 0, 1, 0, 1) shows no monotonic trend with x, causing the logistic regression to collapse to the intercept-only model.

Summary

The logistic regression model converged but found no relationship between mineral concentration and plant presence, resulting in an intercept-only model with a constant predicted probability of 0.600 across all values of x.

The zero slope coefficient (beta1 = 0) reflects the alternating response pattern, which provides no monotonic trend for the model to learn, leading to McFadden's R² = 0.000 and identical predictions for all observations.

Overall, the model indicates that mineral concentration does not explain species presence in this dataset, and additional predictors or a larger sample may be required to identify meaningful ecological relationships.

Understanding the result

The results provide a mathematical bridge between the predictor and the outcome probability. An accuracy of 100% may indicate perfect separation, while McFadden's R-squared values between 0.2 and 0.4 represent a very good model fit. The F1-score balances precision and recall to show the overall predictive power for the positive class.

Assumptions and limitations

The calculation assumes the observations are independent and that the relationship between the log-odds and the predictor is linear. It requires that the data is not perfectly separable; otherwise, the Newton-Raphson iteration may fail to converge as coefficients tend toward infinity.

Common mistakes to avoid

A frequent error is inputting non-binary values for the response variable, as this calculator strictly requires 0 or 1. Another mistake is misinterpreting the coefficient as a direct change in probability; it actually represents the change in the log-odds of the response variable.

Sensitivity and robustness

The model is sensitive to outliers, which is why a modified Z-score filter is included. High sensitivity settings will remove more points, potentially stabilising the curve but reducing the sample size. Small datasets may result in high standard errors, making the estimated parameters less reliable for broader inference.

Troubleshooting

If a convergence error occurs, the dataset might be "perfectly separable," meaning a single $x$ value perfectly divides all 0s from 1s. Ensure that both classes (0 and 1) are present in the response list and that the number of entries matches the predictor list exactly.

Frequently asked questions

What is the sigmoid function?

It is an S-shaped curve that maps any real-valued number into a value between 0 and 1, representing a probability.

How does the calculator handle outliers?

It uses a median-based modified Z-score to identify and exclude extreme values based on the selected sensitivity level (Strict, Standard, or Loose).

What does the AIC value indicate?

The Akaike Information Criterion estimates the relative quality of the model; lower values generally indicate a better fit while penalising for unnecessary complexity.

Where this calculation is used

This statistical method is fundamental in population studies and social research for predicting binary events, such as whether a student will pass an exam based on study hours or if a participant will respond to a specific stimulus. It is extensively used in probability theory and predictive modelling to understand the drivers behind categorical outcomes. In academic settings, it serves as the primary introduction to generalised linear models, moving beyond standard linear regression to handle non-normal error distributions.

Results are based on standard mathematical and statistical methods and may involve rounding or approximation. If precise accuracy is required, please verify results independently. See full disclaimer.