INTRODUCTION
The significant impact of family instability on children’s behavioral and emotional well-being has been well-documented across various studies ( Fomby et al., 2016). However, the challenge of pinpointing the specific aspects of family instability that profoundly affect child behavior remains. The definition of the family as the cornerstone of society has undergone significant evolution due to demographic shifts and sociocultural changes. The diversification of family structures, including marriage, divorce, cohabitation, remarriage, same-sex unions, and nonmarital fertility, has introduced complexities into the notion of family stability within the United States ( Cavanagh and Fomby, 2019). Consequently, a considerable proportion of American children experience transitions across multiple family structures during their developmental years ( Brown et al., 2017). This changing landscape has prompted scholars to examine the impacts of diverse family configurations on child behaviors more closely ( Coe et al., 2020), noting that chronic family instability, marked by erratic parenting practices, unpredictable routines, and fluctuating economic and social resources, can severely hinder children’s adaptability ( Cooper et al., 2015).
This study utilizes artificial intelligence (AI) and machine learning (ML) as advanced methodologies to analyze and predict the intricate impacts of fragile family environments on children. Prior scholarly endeavors, such as Kumar’s (2022) examination of gender-related disparities in academic performance predictors through tree-based regression;, Carnegie and Wu’s (2019) utilization of Bayesian generalized linear models and Bayesian additive regression trees (BART) to forecast child outcomes within fragile family contexts; and the work by Rigobon et al. (2019), emphasizing collaborative methods encompassing data preprocessing and feature selection techniques such as mutual information (MI) and shrinkage methods, alongside the application of ML models like Random Forest and Gradient-Boosted Trees, have highlighted the potential efficacy of these approaches. Nevertheless, these investigations commonly encounter challenges in accurately identifying the influence of individual predictors and achieving optimal prediction or classification performance.
Building upon this foundation, our research introduces a comprehensive methodology that employs ML to identify key determinants and predict outcomes for children from vulnerable family backgrounds. Addressing the multifaceted challenges these children face, including learning disabilities [evidenced by grade point average (GPA) scores], resilience (measured by grit), and material hardship (correlated with disabilities), this study aims to provide actionable insights to inform policy and enhance intervention strategies. By deploying advanced ML techniques and rigorous statistical modeling, we delve into a detailed examination of demographic, familial, and neighborhood characteristics to predict children’s future outcomes more accurately. Our contributions are manifold:
We implemented meticulous data preprocessing, including the removal of records with extensive missing values, discarding features with minimal variance, and the strategic imputation of medians for categorical data and means for numerical data. A novel feature selection approach was adopted, utilizing MI ( Vergara and Estévez, 2014), the least absolute shrinkage and selection operator (LASSO) ( Fonti and Belitser, 2017), and tree-based methods ( Hasan et al., 2016) to refine the dataset and mitigate overfitting.
Recognizing the critical challenge of class imbalance within our dataset, we integrated the Synthetic Minority Over-sampling Technique (SMOTE) ( Elreedy and Atiya, 2019) to equalize the classes, thereby enhancing the models’ ability to generalize across diverse scenarios.
A suite of ML models, including Random Forest ( Biau and Scornet, 2016), XGBoost ( Chen et al., 2015), Neural Networks [multilayer perceptron (MLP)] ( Abiodun et al., 2018), and a Stacking Ensemble Model, were evaluated on the the Future of Families and Child Wellbeing Study ( FFCWS, n.d.) dataset. These models were fine-tuned using Bayesian optimization to achieve superior prediction accuracy.
Through the application of these methodologies, our study offers significant empirical insights for developing tailored policies and interventions that address the unique needs of children from fragile families.
In essence, our research extends beyond the limitations of previous studies to offer a holistic and nuanced understanding of the developmental factors affecting children within fragile family contexts. Our innovative use of ML techniques aims to automate the prediction and classification of behavioral impacts, paving the way for more effective and focused support mechanisms.
RELATED WORKS
Numerous studies have explored the application of ML to understand the dynamics of child disabilities within fragile families. This section offers a review of notable contributions in this area.
The study by Salganik et al. (2020) engaged over 100 global research teams in predicting various life outcomes using data from the FFCWS, a longitudinal survey in the United States. Teams utilized socioeconomic, health, education, and family data at ages 1, 3, 5, 9, and 15 to predict outcomes at age 15, including material hardship, GPA, grit, household eviction, job training participation, and caregiver layoff. Despite extensive data and ML techniques, predictability of outcomes at age 15 remained relatively low, with the best model achieving an R 2 of 0.19 for GPA. Notably, classic statistical models performed comparably to advanced ML algorithms ( Ahearn and Brand, 2019). However, the challenge spurred research into effective data wrangling and predictive modeling techniques, with gradient boosting and regularized regression models showing promising performance in predicting GPA ( Raes, 2019).
Rigobon et al. (2019) detail their participation in the Fragile Families Challenge, utilizing a prediction challenge dataset from the FFCWS. Their collaborative and modular approach encompassed data preprocessing, feature engineering, feature selection, model development, and prediction aggregation. Utilizing data science techniques such as MI, LASSO, elastic net, Random Forest, and Gradient-Boosted Trees, they generated predictions for six outcomes. Their entries ranked highly, achieving first place in predicting GPA, grit, and layoff, third in job training, ninth in material hardship, and eleventh in eviction. They also reflect on the challenges encountered and propose directions for future research.
Compton (2019) endeavors to explore the potential of employing data-driven ML techniques in addressing sociological challenges, departing from traditional theoretical approaches. The study specifically focuses on the application of feature engineering and the optimization of predictive models to identify families at risk within the Fragile Families Challenge context. Through the utilization of principal-component analysis and decision tree modeling, the author aims to predict six primary dependent variables. While demonstrating success in modeling one binary variable, the study reveals constraints in accurately predicting continuous dependent variables. This observation underscores the nuanced nature of predictability concerning dependent variables, suggesting that varying levels of complexity in independent variables may influence predictive outcomes.
Kindel et al. (2019) outline a redesign of the metadata system for the FFCWS, inspired by experiences from the Fragile Families Challenge. By treating metadata as data, the authors aim to simplify data preparation processes for various analyses. This approach, exemplified through open-source tools, offers potential for enhancing machine learning applications in longitudinal surveys and stimulating research on data preparation in social sciences. Traditionally, social scientists rely on metadata systems for navigating and interpreting datasets, which often require significant investment to master. However, by reimagining metadata as data and streamlining access through machine-actionable formats, the redesigned system seeks to address scalability challenges and facilitate more efficient research methodologies in the social sciences.
A research article by social scientist Stephen McKay (2019) reports on his experience in the Fragile Families Challenge. McKay leveraged his background in social science and statistical methods for variable selection and modeling to predict six outcomes. His models, especially for material hardship and layoff, proved competitive against ML approaches, highlighting the value of integrating social science insights into predictive modeling.
The study by Filippova et al. (2019) explores the integration of human expertise with ML for predicting six outcomes in the FFCWS. The authors solicited expert and lay opinions to evaluate variable relevance, informing data selection or weighting in regression models. Their findings suggest that human-augmented methods did not consistently improve—and sometimes detract from—prediction accuracy, leading to a discussion on the approach’s limitations and potential areas for future investigation.
The study by Carnegie and Wu (2019) describes a nuanced approach to the Fragile Families Challenge, employing BART alongside collaborative and modular data processing and variable selection techniques. The authors assessed various variable selection methods, including LASSO and horseshoe prior, and examined the influence of tree quantity in BART models. While recognizing BART’s strengths in predictive modeling and causal inference, they acknowledge the need for deeper analysis to elucidate significant associations.
Finally, Prendergast and MacPhee (2021) examine the impact of family risk factors at birth on kindergarten success and Child Protective Services (CPS) engagement. Through cumulative risk and latent class analysis, the study identifies correlations between risk factors and academic and behavioral outcomes, as well as differing patterns of CPS involvement. This research underscores the potential of early risk screening to inform preventative programs and services.
Our study offers several advantages over existing research. Firstly, we employ a comprehensive methodology, integrating multiple ML techniques and feature selection methods. This approach allows for a holistic prediction of outcomes for children in fragile families, considering various factors simultaneously. Additionally, we address class imbalance by implementing the SMOTE, enhancing model generalizability. Furthermore, evaluating multiple ML models enables us to identify the most effective approach for predicting outcomes in fragile family contexts. This thorough analysis ensures optimal model selection for accurate predictions. Lastly, our research provides policy-relevant insights, informing the development of tailored policies and interventions for children in fragile families, based on empirical evidence.
DATA ANALYSIS AND DATA PREPROCESSING
In this paper, we propose a novel hybrid two-way feature selection technique that integrates MI regression and the LASSO to identify relevant and impactful features for our regression analysis of GPA and grit. Figure 1 provides an overview of the workflow adopted in this study. After the feature selection process, we trained our selected base models, which include Random Forest, XGBoost, and Neural Networks, to generate predictions for material hardship, GPA, and grit. For performance evaluation, we utilized mean squared error (MSE), accuracy score, area under the curve (AUC), and F1 score for material hardship; and MSE, mean absolute error (MAE), and R 2 for grit and GPA. Moreover, we applied a stacking-based ensemble technique leveraging the three selected base models and evaluated the Ensemble Model with appropriate performance metrics, comparing the results with those of the base models across all three targeted features. Our findings indicate that the use of an ensemble technique significantly enhances model performance, yielding superior outcomes. Detailed comparative analyses of all base models and the Stacking Ensemble Model are presented in the Result Analysis section.
Dataset
The dataset utilized in this study is the FFCWS (n.d.). This dataset encompasses information collected from approximately 4898 families, representing a diverse cross-section of ethnicities, including Black, Hispanic, and low-income families. The survey was conducted across major US cities, each with a population exceeding 200,000, between the years 1998 and 2000. The FFCWS dataset comprises a core survey targeting primarily mothers, fathers, and primary caregivers. The subsequent sections will delve into further analysis and preprocessing of this dataset.
Data analysis
The dataset includes data about 4898 families and 17,002 variables, assigning a unique ID number to each family. The dataset potentially houses approximately 83 million entries. However, 68.38% of these entries were found to be missing. The predominant reasons for missing data include: (i) noninclusion of participants in certain survey waves, accounting for roughly 29.71% of the missing entries; (ii) refusal or inability of respondents to answer specific questions, contributing to less than 1% of the missing entries; (iii) loss of data due to various reasons, which accounts for 8.18% of the missing entries; and (iv) survey questions being skipped because they did not apply to the participant or the answers could be inferred from other provided information, constituting 28.20% of the missing entries. For more details, see Figure 2.
Data preprocessing
To ensure the accuracy and validity of our analysis, we employed several strategies for managing missing data and outliers. Variables exhibiting more than 60% missing data were excluded, as their contribution to the analysis was deemed insignificant. For categorical variables, median values were imputed to maintain their categorical nature, while mean values were imputed for numerical variables to preserve data distribution. Furthermore, unrecognized strings were removed to prevent errors and inconsistencies, and variables with standard deviations below 0.05 were excluded to concentrate on variables demonstrating substantial variability. These measures were aimed at enhancing data quality and bolstering the robustness of our results.
It was observed that the dataset utilized in this study exhibited a class imbalance problem (see Fig. 3). Class imbalance, a prevalent challenge in ML, occurs when there is a significant discrepancy in the number of samples between classes, leading to an imbalanced dataset. This issue is known to adversely affect model performance, as models trained on imbalanced data may overfit the majority class while neglecting the minority class. To mitigate this problem, we implemented the SMOTE in our research ( Elreedy and Atiya, 2019).
SMOTE, an oversampling method, creates new synthetic samples for the minority class by interpolating between existing samples, thus achieving a balanced training set without eliminating valuable data from the majority class. This technique has been demonstrated to enhance model performance on imbalanced datasets across various fields by allowing models to learn more comprehensive representations of the minority class.
In our analysis, SMOTE was applied to address the significant class imbalance observed in the target variable, specifically material hardship, where the majority of samples were from the negative class and the positive class was underrepresented (see Fig. 3). This imbalance posed potential issues for model performance on the minority class. By generating synthetic positive samples through SMOTE, we created a more balanced training set, enabling the development of a model with improved generalization capabilities across classes. Utilizing SMOTE to handle class imbalance emerged as a vital preprocessing step, significantly enhancing model performance in the challenge of predicting material hardship in an imbalanced dataset.
In Figure 4, it is evident that SMOTE oversampling has been applied to the imbalanced dataset, resulting in a balanced class distribution.

Data distribution showing balanced data after applying SMOTE on the target variable: material hardship.
Moreover, an analysis of the numerical features revealed disparities in their value scales. To address this issue, max-min normalization was employed to rescale the features to a uniform range, typically between 0 and 1. This normalization technique is widely utilized for feature scaling and is instrumental in enhancing the performance of ML algorithms that are sensitive to the scale of input features.
FEATURE SELECTION
Feature selection is a crucial step in the data preprocessing pipeline. It aims to reduce the dimensionality of input features. High-dimensional datasets can be complex and contain many irrelevant, redundant, or noisy features that degrade the performance of ML models. Feature selection can improve the generalization ability of ML models, reduce overfitting, and enhance computational efficiency by selecting relevant and informative feature subsets.
In this research, three feature selection methods were used primarily: (i) MI ( Battiti, 1994; Vergara and Estévez, 2014) (ii) LASSO ( Zou, 2006; Fonti and Belitser, 2017), and (iii) tree-based methods ( Hasan et al., 2016). MI and LASSO were used as a two-layer feature selection technique for the target variables: GPA and grit. For a binary classification task on material hardship, a tree-based approach was used to select features. These methods effectively identified and selected the most informative features, improving model performance, interpretability, and computational efficiency. A brief detail of these feature selection techniques is provided here.
Mutual information
MI is a measure between two (potentially multidimensional) random variables, X and Y, that quantifies the amount of information obtained about one variable through the other. As a nonparametric measure, it can accommodate both continuous and categorical variables, making it easy to interpret and computationally efficient. MI has been widely used in feature selection since the seminal work by Battiti ( Battiti, 1994). The equation for MI is shown in Equation (1).
In this context, p( x, y) denotes the joint probability density function of X and Y, whereas p( x) and p( y) represent their respective marginal density functions. MI gauges the similarity between the joint distribution p( x, y) and the product of the individual marginal distributions. When X and Y are entirely unrelated (thus independent), p( x, y) equals p( x) p( y), resulting in the MI value being zero. This principle underlies its utility in determining the extent of dependence or independence between variables ( Doquire and Verleysen, 2013).
In this study, we employed MI as one of the key feature selection techniques, among others, to identify important features from the dataset. MI is utilized to evaluate the dependency between each feature and the target variable, enabling the effective selection of relevant features for our analysis. The principle is straightforward: the higher the MI score between a feature and the target, the greater the predictive power of that feature regarding the target variable. Consequently, features highly correlated with the target variable exhibit high MI scores, indicating their relevance. In contrast, irrelevant features, which contribute little to understanding the target variable, register low scores. As part of initial feature screening, we opted to eliminate features with an MI score of zero, aiming to refine the feature set.
LASSO
The LASSO was employed as the second layer of feature selection ( Fonti and Belitser, 2017). LASSO regression, a regularization technique, is utilized to enhance prediction accuracy beyond that of standard regression methods. It introduces a penalty term to the cost function of the linear regression model, proportional to the absolute values of the model coefficients. Consequently, this leads to the coefficients being “shrunk” toward zero, resulting in a sparser model that relies on fewer features. “Shrinkage” describes the process by which data values are reduced toward a central point, like the mean. Equation (2) illustrates the concept of LASSO regularization.
This equation is used to perform the LASSO feature selection where ∥β∥1=∑pj=1|βj|. ꞵ j is the coefficient of the jth feature. λ is a nonnegative regularization parameter that tunes the intensity of this penalty term. The second term in (2) is the so-called “L1 penalty,” which is crucial for the success of the LASSO ( Fonti and Belitser, 2017).
The primary objective of LASSO feature selection is to discern a subset of the most impactful predictor variables concerning the response variable. This process involves minimizing the sum of the residual sum of squares along with the L1-norm of the regression coefficients, the latter being multiplied by a regularization parameter, λ. As λ, the penalty term, increases, the coefficients associated with less important features are progressively reduced to zero and ultimately excluded from the model. This method effectively conducts feature selection by preserving only those features that have the most significant coefficients.
The choice of λ is crucial as it dictates the number of features that are maintained in the model. Employing cross-validation to select the optimal λ value ensures that the model achieves a good generalization to new data samples ( Browne, 2000). It fine-tunes the regularization strength and the extent of shrinkage applied to the model’s coefficients. The GridSearchCV method was utilized to ascertain the most suitable λ value.
Tree-based methods
In this study, to address the binary-class classification problem, specifically material hardship, a tree-based approach was utilized to identify the influential features from the dataset. More precisely, the well-known Random Forest algorithm was employed for feature selection ( Dimitriadis et al., 2018). Random Forest feature selection is recognized as a robust method that leverages the power of Random Forest algorithms to discern and prioritize the most vital features within large, high-dimensional datasets. By implementing a metric-based ranking system, this strategy effectively eliminates features that do not meet a predetermined relevance threshold, concentrating exclusively on those of utmost significance. The primary goal of this technique is to simplify the dataset’s complexity by focusing on essential features. This not only enhances the model’s accuracy but also reduces the likelihood of overfitting. The feature selection process plays a crucial role in improving the overall performance of predictive models by ensuring they are trained on the most informative data.
The feature selection process is systematically divided into two steps to ensure both precision and effectiveness. Initially, the permutation importance method is employed to evaluate and rank each feature’s importance based on the impact of permuting their values on model performance. Features that significantly influence the model’s accuracy upon alteration of their values are deemed more critical, according to a user-defined threshold. Subsequently, the “SelectFromModel” technique from the sklearn library is utilized to identify the optimal subset of features. This selection process is further honed by a fivefold cross-validation procedure, which examines the performance of a Random Forest model trained with the chosen features. The resulting mean cross-validation score is then visualized through a plot, illustrating the correlation between the importance threshold and the mean cross-validation score, as depicted in Figure 5.
MODEL SELECTION
In this part of our proposed method, a number of ML models are deployed to predict three features of children from fragile families: learning capacity, as reflected by GPA; resilience, as gauged by grit; and material hardship. Among these, being continuous values in nature, the predictions of GPA and grit are regression tasks, while the prediction of material hardship is a binary classification task. The list of ML models used in this study includes Neural Network, Random Forest, and Gradient-Boosted Tree. These models were chosen for their proven effectiveness in handling both regression and classification problems, as well as their widespread usage in similar research. By considering multiple models, the study aimed to identify the most effective model for each of the three variables, based on their performances on a hold-out test set. For the sake of completeness, a brief introduction to all three models is provided here.
Neural Network (MLP)
Neural Network ( Abiodun et al., 2018) is a machine learning model inspired by the structure and function of the human brain. Figure 6 shows the three-layer structure of the neural network: input layer, hidden layer, and output layer. Each node, called an artificial neuron, is connected to others and has an associated weight and threshold. The input layer simply receives the raw input data, that is, values of independent variables. The hidden layer processes and transforms the input data through a series of weighted connections and activation functions. The output layer produces the final prediction or output based on the transformed input data. During training, the weights and biases of the neurons are adjusted using an optimization algorithm to minimize the difference between the predicted output of the network and the true output.
In our proposed method, we have used the MLP from Scikit-Learn (sklearn) in Python. The MLP is a type of fully connected dense layer consisting of neurons called perceptrons. In each hidden layer, the nodes or neurons use the "Sigmoid" activation function. The total number of hidden layers and neurons in each layer, along with the learning rate of the network, is optimized using Bayesian optimization. A brief description of this optimization method is provided in the Hyperparameter Tuning section.
Random Forest
A Random Forest is an ML model that consists of multiple decision trees. Figure 7 shows the basic structure of a Random Forest, which includes several decision trees. Each decision tree is constructed by randomly selecting a subset of features and training the tree to predict the target variable. During prediction, the Random Forest combines the outputs of all decision trees to make a final prediction ( Biau and Scornet, 2016; Huljanah et al., 2019). Random Forest also belongs to the ensemble learning family, which combines multiple ML models to make more accurate predictions than individual models.
Gradient Boosting Tree
The Gradient Boosting Tree (GBT) ( Bentéjac et al., 2021) is an ML model that, similar to a Random Forest, uses a set of decision trees to make predictions. However, unlike Random Forest, in GBT, decision trees are not constructed independently. The basic idea behind gradient boosting is to iteratively train decision trees to correct the errors of the preceding tree.
The process starts by training a single decision tree on the input data. If the output of this tree is not sufficiently accurate, another decision tree is trained to predict the residual errors of the first tree. The output of the second tree is then combined with that of the first tree to produce a more accurate prediction.
This process is repeated, with each new tree trained to predict the residuals left by all previous trees. The final prediction is the sum of the outputs of all trees.
Extreme Gradient Boosting (XGBoost) ( Chen et al., 2015) is an optimized version of gradient boosting that includes a regularized objective function and advanced regularization techniques to prevent overfitting. XGBoost’s inclusion of L1 and L2 regularization, along with an approximate greedy algorithm for splitting nodes, helps reduce overfitting and computational costs. Furthermore, XGBoost’s support for parallel processing and distributed computing enhances its efficiency and scalability, enabling it to handle large datasets effectively.
In this study, the XGBRegressor was utilized for predicting both continuous-valued and binary-valued outcomes.
Stacking Ensemble Model
An Ensemble Model is a type of ML model that combines the predictions generated by multiple base models, which can be of various types. These models are generally used to improve the accuracy of the ML model, as they combine the predictions of multiple models, thereby reducing the variance of the predictions and making the models more robust to overfitting. There are various methods for creating an Ensemble Model. In this research, we have created a Stacking-based Ensemble Model to combine the predictions of the selected base models. An overview of the proposed Stacking Ensemble Model is shown in Figure 8.

Overview of the Stacking Ensemble Model using the selected three base models. Abbreviation: FFCWS, Future of Families and Child Wellbeing Study.
Stacking Ensemble works by initially training a set of base models on the original data. Following this process, the predictions generated by the best models are combined to create features for a selected metamodel. This metamodel is then trained on the predictions of the base models to learn how to best combine them, effectively learning to weigh the predictions of the base models to make the best predictions for the target variable.
In our proposed method, we initially trained three base models (Random Forest, XGBoost, and Neural Network) and generated predictions from each individual model for model evaluation. The predictions from the base models were then stored, horizontally stacked, and fed into an optimized Random Forest regressor, which acts as our metaclassifier. The final predictions are evaluated and compared with the performances of each of the base models.
HYPERPARAMETER TUNING
Hyperparameter tuning is a crucial step in optimizing the performance of an ML model. In our proposed method, Bayesian optimization ( Frazier, 2018; Greenhill et al., 2020) was utilized as the primary method for adjusting and optimizing the model’s hyperparameters. Bayesian optimization is adept at solving the problem of finding the minimum of a function, represented as in Equation (3):
For hyperparameter tuning using Bayesian optimization, the performance of the base model on the validation dataset was treated as the function to optimize. A probabilistic model, specifically the Gaussian process, was employed to model the relationship between hyperparameters and validation set performance. This model enables the prediction of performance under new hyperparameter settings and estimation of the uncertainty in those predictions.
To find the function’s minimum using Bayesian optimization, we began by sampling a set of initial input locations, evaluating the function at these points, and using these evaluations to build the initial model. Iteratively, new input locations were selected based on the acquisition function, which balanced exploration and exploitation to guide the search toward regions where the minimum was likely. Cross-validation was used to evaluate the function at each new input location, providing a more accurate estimate of the function’s performance ( Browne, 2000).
The optimal set of hyperparameters was determined based on the cross-validation score, identifying the hyperparameters that yielded the best performance within the defined input space ranges. Bayesian optimization facilitated an efficient exploration of the input space, leading to the identification of the optimal hyperparameters for the best function performance.
MODEL PERFORMANCE
In this study, material hardship was approached as a binary classification task, employing a classifier for prediction. Conversely, GPA and grit were treated as continuous regression variables, with a regressor used for their prediction. The classifier’s performance was evaluated using several metrics, including accuracy, MSE, confusion matrix, receiver operating characteristic (ROC) curve, AUC, and F1 score, while the regressor’s performance was assessed using MSE, MAE, and R 2 score. These metrics provided a comprehensive evaluation of the accuracy, precision, and overall performance of the classification and regression models.
Accuracy: the proportion of correctly classified instances among all instances, offering a basic measure of overall classifier performance.
Confusion matrix: a table displaying the counts of true positives, false positives, true negatives, and false negatives, facilitating the calculation of accuracy, precision, and recall.
ROC curve: a graphical representation of a binary classifier’s performance, illustrating the trade-off between the true positive rate and false positive rate across various thresholds.
AUC: the area under the ROC curve, measuring the classifier’s overall performance.
MSE: quantifies the average squared difference between predicted and actual values, calculated as shown in Equation (4):
MAE: represents the mean absolute error, a metric for regression model accuracy, calculating the average absolute difference between predicted and actual values as in Equation (5):
R 2: the coefficient of determination, indicating the variance proportion in the dependent variable explained by the independent variables, as depicted in Equation (6):
F1-Score: a measure of a binary classifier’s accuracy and precision, the harmonic mean of precision and recall, as given by Equation (7):
RESULT ANALYSIS
The models we have selected have been tested on the FFCWS dataset and the outcome of each model was evaluated using adequate performance metrics. From the observation from each model, we have seen a significant performance outcome for each target variable. In this section, we discuss the result analysis of our research.
The results in Table 1 demonstrate the impressive performance of the Random Forest, XGBoost, Neural Network, and Stacking Ensemble models in classifying material hardship after applying SMOTE oversampling. All models achieved F1 scores over 98%, indicating excellent predictive capabilities on this imbalanced classification task.
Performance comparison of different models on material hardship classification (after SMOTE).
Model | MSE | ACC | AUC | F1 Score |
---|---|---|---|---|
Random Forest Classifier | 0.013 | 0.987 | 0.998 | 0.987 |
XGBoost | 0.015 | 0.985 | 0.997 | 0.985 |
Neural Network (MLP) | 0.013 | 0.894 | 0.962 | 0.987 |
Stacking Ensemble Model | 0.014 | 0.986 | 0.996 | 0.986 |
Abbreviations: ACC, accuracy; AUC, area under the curve; MLP, multilayer perceptron; MSE, mean squared error; SMOTE, Synthetic Minority Over-sampling Technique.
Compared to the results before SMOTE was applied, as shown in Table 2, the performance improvement is stark. Without addressing the class imbalance, the models struggled to effectively predict the minority positive class of material hardship cases. For example, the Random Forest F1 jumped from 0.80 to 0.987 after using SMOTE.
Performance comparison of different models on material hardship classification (before SMOTE).
Model | MSE | ACC | AUC | F1 Score |
---|---|---|---|---|
Random Forest Classifier | 0.29 | 0.69 | 0.83 | 0.80 |
XGBoost | 0.25 | 0.71 | 0.86 | 0.77 |
Neural Network | 0.15 | 0.80 | 0.68 | 0.79 |
Stacking Ensemble Model | 0.05 | 0.82 | 0.85 | 0.85 |
Abbreviations: ACC, accuracy; AUC, area under the curve; MLP, multilayer perceptron; MSE, mean squared error; SMOTE, Synthetic Minority Over-sampling Technique.
The AUC scores in Table 1 are all above 0.96, also confirming the models’ strong ability to distinguish between the positive and negative classes after oversampling. While accuracy and MSE can be misleading metrics for imbalanced data, their values improved as well with SMOTE.
Given the skewed class distribution illustrated in Figure 9, F1 score and AUC were the most informative metrics for this problem. The scale-independent F1 captures performance on both classes, while AUC evaluates how well the model ranks positive cases higher than negative ones. The impressive values for both metrics indicate that handling the class imbalance was key to the models’ success.

ROC curve of Random Forest, XGBoost, Neural Network and Stacking Ensemble Model for material hardship. Abbreviations: AUC, area under the curve; ROC, receiver operating characteristic.
Oversampling enabled the models to learn from a more representative training set with enough minority examples. This allowed accurate learning of patterns in positive material hardship cases, rather than just focusing on the majority class. The resultant models are robust classifiers for predicting material hardship, validated by the confusion matrix results in Figure 10 showing strong true-positive and low false-negative rates.
In summary, SMOTE oversampling proved highly effective at improving model performance on this imbalanced classification task, enabling accurate prediction of the underrepresented positive class. The results motivate the use of oversampling techniques when applying ML to skewed real-world data.
The empirical results presented in Tables 3 and 4 elucidate the comparative efficacy of various ML algorithms—namely, Random Forest, XGBoost, Neural Network (MLP), and Stacking Ensemble Model—in predicting two continuous variables: grit and GPA. The evaluation leans on MAE and R-squared ( R 2) as primary metrics, given their independence from scale, which renders them particularly relevant for regression analysis.
Performance comparison of different models on grit.
Model | MSE | MAE | R 2 |
---|---|---|---|
Random Forest Regressor | 0.347 | 0.439 | 0.684 |
XGBoost | 0.344 | 0.440 | 0.251 |
Neural Network (MLP) | 0.371 | 0.482 | 0.249 |
Stacking Ensemble Model | 0.253 | 0.239 | 0.699 |
Abbreviations: MAE, mean absolute error; MLP, multilayer perceptron; MSE, mean squared error.
Performance comparison of different models on GPA.
Model | MSE | MAE | R 2 |
---|---|---|---|
Random Forest Regressor | 0.647 | 0.586 | 0.431 |
XGBoost | 0.637 | 0.592 | 0.440 |
Neural Network (MLP) | 0.671 | 0.675 | 0.339 |
Stacking Ensemble Model | 0.30 | 0.540 | 0.859 |
Abbreviations: GPA, grade point average; MAE, mean absolute error; MLP, multilayer perceptron; MSE, mean squared error.
In the context of grit prediction, the Stacking Ensemble Model demonstrated paramount performance with an MAE of 0.239 and an R 2 of 0.699, surpassing the base models substantially. Notably, the Random Forest algorithm emerged as the superior base model, registering an MAE of 0.439 and an R 2 of 0.684. While the MSE is a common metric, MAE and R 2 are more reflective of model performance in regression tasks, thus underscoring the Stacking Ensemble’s advancements.
Correspondingly, for GPA prediction, the Stacking Ensemble Model again attained preeminence with an MAE of 0.54 and an R 2 of 0.859, marking a significant enhancement over the individual models. Among the base models, XGBoost exhibited the highest performance with an MAE of 0.592 and an R 2 of 0.44. The Ensemble Model’s superior performance, as indicated by these scale-independent metrics, affirms the utility of combining multiple models to amplify predictive accuracy—a well-documented strength of ensemble methods.
Interestingly, the Neural Network model did not fare as well as its counterparts. This could be attributed to several factors inherent to neural networks, such as their demand for substantial data volumes to effectively learn, their susceptibility to overfitting, and the necessity for meticulous optimization of their architecture and hyperparameters. Conversely, tree-based models, like Random Forest and XGBoost, may offer more robust generalization from limited datasets and integrate intrinsic feature selection, thereby concentrating on pertinent predictors.
The robustness of the Stacking Ensemble Model is further bolstered by the strategic feature selection process delineated in the Feature Selection section. This process aids in distilling the most influential features, thus enabling the models to more accurately map the input-output relationship.
In summary, the Stacking Ensemble Model, augmented by a thoughtful feature selection, emerged as the most potent predictor for the continuous outcomes of grit and GPA. The MAE and R 2 statistics corroborate the superiority of the ensemble strategy over conventional regression algorithms. These findings underscore the potential for leveraging analogous methodologies in constructing predictive models for the analysis of learning disabilities in fragile families. With additional refinement, these models hold promise for enhancing automated diagnostics and interventions for this at-risk demographic.
CONCLUSION
In this study, we applied advanced ML techniques to forecast behavioral outcomes for children from fragile families, emphasizing learning disabilities, grit, and material hardship. By rigorously preparing data and employing feature selection methods such as MI, LASSO, and tree-based techniques, we ensured the precision of our model training. Our analysis utilized diverse predictive models, including Random Forest, Neural Networks (MLP), GBT, and a Stacking Ensemble Model, each optimized through Bayesian optimization. We also tackled class imbalance in our dataset by applying the SMOTE, achieving balanced classes and improved model generalization.
Our experimental results demonstrated that the Random Forest and XGBoost models were particularly effective in classifying material hardship, with the Random Forest model standing out for its exceptional performance. In the prediction of GPA and grit, the Stacking Ensemble Model proved to be the most proficient, surpassing other methodologies in accuracy and reliability. However, Neural Networks exhibited limitations in binary classification tasks, indicating a need for refinement through regularization and hyperparameter tuning.
This study not only advances our understanding of the predictive factors affecting children from vulnerable backgrounds but also provides a solid empirical foundation for informing future educational and social policies tailored to these challenges. Our findings advocate for the effectiveness of ensemble methods and sophisticated algorithms in surpassing traditional models, with the Stacking Ensemble Model particularly enhancing predictions. By highlighting the importance of targeted interventions and policy-making based on rigorous data analysis, this research contributes valuable insights toward improving the life prospects of children in fragile families. Future research could benefit from evaluating a broader spectrum of ML algorithms, including probabilistic models and dimensionality reduction techniques, to gain deeper insights into the dynamics affecting fragile families.