Introduction
This tool facilitates the analysis of binary outcomes by performing logistic regression on a dataset. It is designed for students and researchers exploring the relationship between a continuous predictor variable and a categorical response . By modelling the probability of a specific event occurring, it helps determine how changes in the independent variable influence the likelihood of a binary success.
What this calculator does
The system processes pairs of numerical data to estimate the coefficients of a logit model. Users provide a series of predictor values and corresponding binary responses (0 or 1). The calculator applies outlier detection via modified Z-scores and uses the Newton-Raphson method for Maximum Likelihood Estimation. It produces the intercept, slope coefficient, standard errors, and classification metrics including accuracy, precision, recall, and McFadden's R-squared.
Formula used
The model estimates the probability of the response being 1 using the sigmoid function. The linear predictor is calculated as . Parameters are refined iteratively to maximise the log-likelihood function. Goodness of fit is evaluated using the Akaike Information Criterion and McFadden's pseudo-R-squared based on the model and null likelihoods.
How to use this calculator
1. Enter the predictor variable values as a comma-separated list.
2. Input the corresponding response variable values, ensuring only 0s and 1s are used.
3. Select the preferred outlier sensitivity and decimal place precision.
4. Execute the calculation to view the coefficients, statistical metrics, and visual probability curves.
Example calculation
Scenario: A researcher in environmental science analyses whether a specific concentration of a soil mineral (in mg/kg) results in the presence (1) or absence (0) of a particular plant species.
Inputs: Predictor values: 10, 20, 30, 40; Response values: 0, 0, 1, 1.
Working:
Step 1:
Step 2:
Step 3:
Step 4:
Result: Intercept and Coefficient values are generated alongside a probability curve.
Interpretation: The slope coefficient indicates the change in the log-odds of the plant presence for every unit increase in mineral concentration.
Summary: The model successfully identifies the threshold at which the probability of plant presence exceeds 0.5.
Understanding the result
The results provide a mathematical bridge between the predictor and the outcome probability. An accuracy of 100% may indicate perfect separation, while McFadden's R-squared values between 0.2 and 0.4 represent a very good model fit. The F1-score balances precision and recall to show the overall predictive power for the positive class.
Assumptions and limitations
The calculation assumes the observations are independent and that the relationship between the log-odds and the predictor is linear. It requires that the data is not perfectly separable; otherwise, the Newton-Raphson iteration may fail to converge as coefficients tend toward infinity.
Common mistakes to avoid
A frequent error is inputting non-binary values for the response variable, as this calculator strictly requires 0 or 1. Another mistake is misinterpreting the coefficient as a direct change in probability; it actually represents the change in the log-odds of the response variable.
Sensitivity and robustness
The model is sensitive to outliers, which is why a modified Z-score filter is included. High sensitivity settings will remove more points, potentially stabilising the curve but reducing the sample size. Small datasets may result in high standard errors, making the estimated parameters less reliable for broader inference.
Troubleshooting
If a convergence error occurs, the dataset might be "perfectly separable," meaning a single value perfectly divides all 0s from 1s. Ensure that both classes (0 and 1) are present in the response list and that the number of entries matches the predictor list exactly.
Frequently asked questions
What is the sigmoid function?
It is an S-shaped curve that maps any real-valued number into a value between 0 and 1, representing a probability.
How does the calculator handle outliers?
It uses a median-based modified Z-score to identify and exclude extreme values based on the selected sensitivity level (Strict, Standard, or Loose).
What does the AIC value indicate?
The Akaike Information Criterion estimates the relative quality of the model; lower values generally indicate a better fit while penalising for unnecessary complexity.
Where this calculation is used
This statistical method is fundamental in population studies and social research for predicting binary events, such as whether a student will pass an exam based on study hours or if a participant will respond to a specific stimulus. It is extensively used in probability theory and predictive modelling to understand the drivers behind categorical outcomes. In academic settings, it serves as the primary introduction to generalised linear models, moving beyond standard linear regression to handle non-normal error distributions.
Results are based on standard mathematical and statistical methods and may involve rounding or approximation. If precise accuracy is required, please verify results independently. See full disclaimer.