482
views
0
recommends
+1 Recommend
1 collections
    0
    shares

      King Salman Center for Disability Research is pleased to invite you to submit your scientific research to the Journal of Disability Research. JDR contributes to the Center's strategy to maximize the impact of the field, by supporting and publishing scientific research on disability and related issues, which positively affect the level of services, rehabilitation, and care for individuals with disabilities.
      JDR is an Open Access scientific journal that takes the lead in covering disability research in all areas of health and society at the regional and international level.

      scite_
      0
      0
      0
      0
      Smart Citations
      0
      0
      0
      0
      Citing PublicationsSupportingMentioningContrasting
      View Citations

      See how this article has been cited at scite.ai

      scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

       
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Exploring Childhood Disabilities in Fragile Families: Machine Learning Insights for Informed Policy Interventions

      Published
      research-article
      Bookmark

            Abstract

            This study delves into the multifaceted challenges confronting children from vulnerable or fragile families, with a specific focus on learning disabilities, resilience (measured by grit), and material hardship—a factor intricately linked with children’s disabilities. Leveraging the predictive capabilities of machine learning (ML), our research aims to discern the determinants of these outcomes, thereby facilitating evidence-based policy formulation and targeted interventions for at-risk populations. The dataset underwent meticulous preprocessing, including the elimination of records with extensive missing values, the removal of features with minimal variance, and the imputation of medians for categorical data and means for numerical data. Advanced feature selection techniques, incorporating mutual information, the least absolute shrinkage and selection operator (LASSO), and tree-based methods, were employed to refine the dataset and mitigate overfitting. Additionally, we addressed the challenge of class imbalance through the implementation of the Synthetic Minority Over-sampling Technique (SMOTE) to enhance model generalization. Various ML models, encompassing Random Forest, Neural Networks [multilayer perceptron (MLP)], Gradient-Boosted Trees (XGBoost), and a Stacking Ensemble Model, were evaluated on the Future of Families and Child Wellbeing Study (FFCWS) dataset, with fine-tuning facilitated by Bayesian optimization techniques. The experimental findings highlighted the superior predictive performance of Random Forest and XGBoost models in classifying material hardship, while the Stacking Ensemble Model emerged as the most effective predictor of grade point average (GPA) and grit. Our research underscores the critical importance of tailored policy interventions grounded in empirical evidence to address childhood disabilities within fragile families, thus offering invaluable insights for policymakers and practitioners alike.

            Main article text

            INTRODUCTION

            The significant impact of family instability on children’s behavioral and emotional well-being has been well-documented across various studies ( Fomby et al., 2016). However, the challenge of pinpointing the specific aspects of family instability that profoundly affect child behavior remains. The definition of the family as the cornerstone of society has undergone significant evolution due to demographic shifts and sociocultural changes. The diversification of family structures, including marriage, divorce, cohabitation, remarriage, same-sex unions, and nonmarital fertility, has introduced complexities into the notion of family stability within the United States ( Cavanagh and Fomby, 2019). Consequently, a considerable proportion of American children experience transitions across multiple family structures during their developmental years ( Brown et al., 2017). This changing landscape has prompted scholars to examine the impacts of diverse family configurations on child behaviors more closely ( Coe et al., 2020), noting that chronic family instability, marked by erratic parenting practices, unpredictable routines, and fluctuating economic and social resources, can severely hinder children’s adaptability ( Cooper et al., 2015).

            This study utilizes artificial intelligence (AI) and machine learning (ML) as advanced methodologies to analyze and predict the intricate impacts of fragile family environments on children. Prior scholarly endeavors, such as Kumar’s (2022) examination of gender-related disparities in academic performance predictors through tree-based regression;, Carnegie and Wu’s (2019) utilization of Bayesian generalized linear models and Bayesian additive regression trees (BART) to forecast child outcomes within fragile family contexts; and the work by Rigobon et al. (2019), emphasizing collaborative methods encompassing data preprocessing and feature selection techniques such as mutual information (MI) and shrinkage methods, alongside the application of ML models like Random Forest and Gradient-Boosted Trees, have highlighted the potential efficacy of these approaches. Nevertheless, these investigations commonly encounter challenges in accurately identifying the influence of individual predictors and achieving optimal prediction or classification performance.

            Building upon this foundation, our research introduces a comprehensive methodology that employs ML to identify key determinants and predict outcomes for children from vulnerable family backgrounds. Addressing the multifaceted challenges these children face, including learning disabilities [evidenced by grade point average (GPA) scores], resilience (measured by grit), and material hardship (correlated with disabilities), this study aims to provide actionable insights to inform policy and enhance intervention strategies. By deploying advanced ML techniques and rigorous statistical modeling, we delve into a detailed examination of demographic, familial, and neighborhood characteristics to predict children’s future outcomes more accurately. Our contributions are manifold:

            • We implemented meticulous data preprocessing, including the removal of records with extensive missing values, discarding features with minimal variance, and the strategic imputation of medians for categorical data and means for numerical data. A novel feature selection approach was adopted, utilizing MI ( Vergara and Estévez, 2014), the least absolute shrinkage and selection operator (LASSO) ( Fonti and Belitser, 2017), and tree-based methods ( Hasan et al., 2016) to refine the dataset and mitigate overfitting.

            • Recognizing the critical challenge of class imbalance within our dataset, we integrated the Synthetic Minority Over-sampling Technique (SMOTE) ( Elreedy and Atiya, 2019) to equalize the classes, thereby enhancing the models’ ability to generalize across diverse scenarios.

            • A suite of ML models, including Random Forest ( Biau and Scornet, 2016), XGBoost ( Chen et al., 2015), Neural Networks [multilayer perceptron (MLP)] ( Abiodun et al., 2018), and a Stacking Ensemble Model, were evaluated on the the Future of Families and Child Wellbeing Study ( FFCWS, n.d.) dataset. These models were fine-tuned using Bayesian optimization to achieve superior prediction accuracy.

            • Through the application of these methodologies, our study offers significant empirical insights for developing tailored policies and interventions that address the unique needs of children from fragile families.

            In essence, our research extends beyond the limitations of previous studies to offer a holistic and nuanced understanding of the developmental factors affecting children within fragile family contexts. Our innovative use of ML techniques aims to automate the prediction and classification of behavioral impacts, paving the way for more effective and focused support mechanisms.

            RELATED WORKS

            Numerous studies have explored the application of ML to understand the dynamics of child disabilities within fragile families. This section offers a review of notable contributions in this area.

            The study by Salganik et al. (2020) engaged over 100 global research teams in predicting various life outcomes using data from the FFCWS, a longitudinal survey in the United States. Teams utilized socioeconomic, health, education, and family data at ages 1, 3, 5, 9, and 15 to predict outcomes at age 15, including material hardship, GPA, grit, household eviction, job training participation, and caregiver layoff. Despite extensive data and ML techniques, predictability of outcomes at age 15 remained relatively low, with the best model achieving an R 2 of 0.19 for GPA. Notably, classic statistical models performed comparably to advanced ML algorithms ( Ahearn and Brand, 2019). However, the challenge spurred research into effective data wrangling and predictive modeling techniques, with gradient boosting and regularized regression models showing promising performance in predicting GPA ( Raes, 2019).

            Rigobon et al. (2019) detail their participation in the Fragile Families Challenge, utilizing a prediction challenge dataset from the FFCWS. Their collaborative and modular approach encompassed data preprocessing, feature engineering, feature selection, model development, and prediction aggregation. Utilizing data science techniques such as MI, LASSO, elastic net, Random Forest, and Gradient-Boosted Trees, they generated predictions for six outcomes. Their entries ranked highly, achieving first place in predicting GPA, grit, and layoff, third in job training, ninth in material hardship, and eleventh in eviction. They also reflect on the challenges encountered and propose directions for future research.

            Compton (2019) endeavors to explore the potential of employing data-driven ML techniques in addressing sociological challenges, departing from traditional theoretical approaches. The study specifically focuses on the application of feature engineering and the optimization of predictive models to identify families at risk within the Fragile Families Challenge context. Through the utilization of principal-component analysis and decision tree modeling, the author aims to predict six primary dependent variables. While demonstrating success in modeling one binary variable, the study reveals constraints in accurately predicting continuous dependent variables. This observation underscores the nuanced nature of predictability concerning dependent variables, suggesting that varying levels of complexity in independent variables may influence predictive outcomes.

            Kindel et al. (2019) outline a redesign of the metadata system for the FFCWS, inspired by experiences from the Fragile Families Challenge. By treating metadata as data, the authors aim to simplify data preparation processes for various analyses. This approach, exemplified through open-source tools, offers potential for enhancing machine learning applications in longitudinal surveys and stimulating research on data preparation in social sciences. Traditionally, social scientists rely on metadata systems for navigating and interpreting datasets, which often require significant investment to master. However, by reimagining metadata as data and streamlining access through machine-actionable formats, the redesigned system seeks to address scalability challenges and facilitate more efficient research methodologies in the social sciences.

            A research article by social scientist Stephen McKay (2019) reports on his experience in the Fragile Families Challenge. McKay leveraged his background in social science and statistical methods for variable selection and modeling to predict six outcomes. His models, especially for material hardship and layoff, proved competitive against ML approaches, highlighting the value of integrating social science insights into predictive modeling.

            The study by Filippova et al. (2019) explores the integration of human expertise with ML for predicting six outcomes in the FFCWS. The authors solicited expert and lay opinions to evaluate variable relevance, informing data selection or weighting in regression models. Their findings suggest that human-augmented methods did not consistently improve—and sometimes detract from—prediction accuracy, leading to a discussion on the approach’s limitations and potential areas for future investigation.

            The study by Carnegie and Wu (2019) describes a nuanced approach to the Fragile Families Challenge, employing BART alongside collaborative and modular data processing and variable selection techniques. The authors assessed various variable selection methods, including LASSO and horseshoe prior, and examined the influence of tree quantity in BART models. While recognizing BART’s strengths in predictive modeling and causal inference, they acknowledge the need for deeper analysis to elucidate significant associations.

            Finally, Prendergast and MacPhee (2021) examine the impact of family risk factors at birth on kindergarten success and Child Protective Services (CPS) engagement. Through cumulative risk and latent class analysis, the study identifies correlations between risk factors and academic and behavioral outcomes, as well as differing patterns of CPS involvement. This research underscores the potential of early risk screening to inform preventative programs and services.

            Our study offers several advantages over existing research. Firstly, we employ a comprehensive methodology, integrating multiple ML techniques and feature selection methods. This approach allows for a holistic prediction of outcomes for children in fragile families, considering various factors simultaneously. Additionally, we address class imbalance by implementing the SMOTE, enhancing model generalizability. Furthermore, evaluating multiple ML models enables us to identify the most effective approach for predicting outcomes in fragile family contexts. This thorough analysis ensures optimal model selection for accurate predictions. Lastly, our research provides policy-relevant insights, informing the development of tailored policies and interventions for children in fragile families, based on empirical evidence.

            DATA ANALYSIS AND DATA PREPROCESSING

            In this paper, we propose a novel hybrid two-way feature selection technique that integrates MI regression and the LASSO to identify relevant and impactful features for our regression analysis of GPA and grit. Figure 1 provides an overview of the workflow adopted in this study. After the feature selection process, we trained our selected base models, which include Random Forest, XGBoost, and Neural Networks, to generate predictions for material hardship, GPA, and grit. For performance evaluation, we utilized mean squared error (MSE), accuracy score, area under the curve (AUC), and F1 score for material hardship; and MSE, mean absolute error (MAE), and R 2 for grit and GPA. Moreover, we applied a stacking-based ensemble technique leveraging the three selected base models and evaluated the Ensemble Model with appropriate performance metrics, comparing the results with those of the base models across all three targeted features. Our findings indicate that the use of an ensemble technique significantly enhances model performance, yielding superior outcomes. Detailed comparative analyses of all base models and the Stacking Ensemble Model are presented in the Result Analysis section.

            Figure 1:

            Top level overview of our proposed method.

            Dataset

            The dataset utilized in this study is the FFCWS (n.d.). This dataset encompasses information collected from approximately 4898 families, representing a diverse cross-section of ethnicities, including Black, Hispanic, and low-income families. The survey was conducted across major US cities, each with a population exceeding 200,000, between the years 1998 and 2000. The FFCWS dataset comprises a core survey targeting primarily mothers, fathers, and primary caregivers. The subsequent sections will delve into further analysis and preprocessing of this dataset.

            Data analysis

            The dataset includes data about 4898 families and 17,002 variables, assigning a unique ID number to each family. The dataset potentially houses approximately 83 million entries. However, 68.38% of these entries were found to be missing. The predominant reasons for missing data include: (i) noninclusion of participants in certain survey waves, accounting for roughly 29.71% of the missing entries; (ii) refusal or inability of respondents to answer specific questions, contributing to less than 1% of the missing entries; (iii) loss of data due to various reasons, which accounts for 8.18% of the missing entries; and (iv) survey questions being skipped because they did not apply to the participant or the answers could be inferred from other provided information, constituting 28.20% of the missing entries. For more details, see Figure 2.

            Figure 2:

            Histogram representing the number of missing values and corresponding reasons.

            Data preprocessing

            To ensure the accuracy and validity of our analysis, we employed several strategies for managing missing data and outliers. Variables exhibiting more than 60% missing data were excluded, as their contribution to the analysis was deemed insignificant. For categorical variables, median values were imputed to maintain their categorical nature, while mean values were imputed for numerical variables to preserve data distribution. Furthermore, unrecognized strings were removed to prevent errors and inconsistencies, and variables with standard deviations below 0.05 were excluded to concentrate on variables demonstrating substantial variability. These measures were aimed at enhancing data quality and bolstering the robustness of our results.

            It was observed that the dataset utilized in this study exhibited a class imbalance problem (see Fig. 3). Class imbalance, a prevalent challenge in ML, occurs when there is a significant discrepancy in the number of samples between classes, leading to an imbalanced dataset. This issue is known to adversely affect model performance, as models trained on imbalanced data may overfit the majority class while neglecting the minority class. To mitigate this problem, we implemented the SMOTE in our research ( Elreedy and Atiya, 2019).

            Figure 3:

            Data distribution showing class imbalance of the target variable: material hardship.

            SMOTE, an oversampling method, creates new synthetic samples for the minority class by interpolating between existing samples, thus achieving a balanced training set without eliminating valuable data from the majority class. This technique has been demonstrated to enhance model performance on imbalanced datasets across various fields by allowing models to learn more comprehensive representations of the minority class.

            In our analysis, SMOTE was applied to address the significant class imbalance observed in the target variable, specifically material hardship, where the majority of samples were from the negative class and the positive class was underrepresented (see Fig. 3). This imbalance posed potential issues for model performance on the minority class. By generating synthetic positive samples through SMOTE, we created a more balanced training set, enabling the development of a model with improved generalization capabilities across classes. Utilizing SMOTE to handle class imbalance emerged as a vital preprocessing step, significantly enhancing model performance in the challenge of predicting material hardship in an imbalanced dataset.

            In Figure 4, it is evident that SMOTE oversampling has been applied to the imbalanced dataset, resulting in a balanced class distribution.

            Figure 4:

            Data distribution showing balanced data after applying SMOTE on the target variable: material hardship.

            Moreover, an analysis of the numerical features revealed disparities in their value scales. To address this issue, max-min normalization was employed to rescale the features to a uniform range, typically between 0 and 1. This normalization technique is widely utilized for feature scaling and is instrumental in enhancing the performance of ML algorithms that are sensitive to the scale of input features.

            FEATURE SELECTION

            Feature selection is a crucial step in the data preprocessing pipeline. It aims to reduce the dimensionality of input features. High-dimensional datasets can be complex and contain many irrelevant, redundant, or noisy features that degrade the performance of ML models. Feature selection can improve the generalization ability of ML models, reduce overfitting, and enhance computational efficiency by selecting relevant and informative feature subsets.

            In this research, three feature selection methods were used primarily: (i) MI ( Battiti, 1994; Vergara and Estévez, 2014) (ii) LASSO ( Zou, 2006; Fonti and Belitser, 2017), and (iii) tree-based methods ( Hasan et al., 2016). MI and LASSO were used as a two-layer feature selection technique for the target variables: GPA and grit. For a binary classification task on material hardship, a tree-based approach was used to select features. These methods effectively identified and selected the most informative features, improving model performance, interpretability, and computational efficiency. A brief detail of these feature selection techniques is provided here.

            Mutual information

            MI is a measure between two (potentially multidimensional) random variables, X and Y, that quantifies the amount of information obtained about one variable through the other. As a nonparametric measure, it can accommodate both continuous and categorical variables, making it easy to interpret and computationally efficient. MI has been widely used in feature selection since the seminal work by Battiti ( Battiti, 1994). The equation for MI is shown in Equation (1).

            (1) I(X;Y)=pX,Y(x,y)logpX,Y(x,y),dxdy

            In this context, p( x, y) denotes the joint probability density function of X and Y, whereas p( x) and p( y) represent their respective marginal density functions. MI gauges the similarity between the joint distribution p( x, y) and the product of the individual marginal distributions. When X and Y are entirely unrelated (thus independent), p( x, y) equals p( x) p( y), resulting in the MI value being zero. This principle underlies its utility in determining the extent of dependence or independence between variables ( Doquire and Verleysen, 2013).

            In this study, we employed MI as one of the key feature selection techniques, among others, to identify important features from the dataset. MI is utilized to evaluate the dependency between each feature and the target variable, enabling the effective selection of relevant features for our analysis. The principle is straightforward: the higher the MI score between a feature and the target, the greater the predictive power of that feature regarding the target variable. Consequently, features highly correlated with the target variable exhibit high MI scores, indicating their relevance. In contrast, irrelevant features, which contribute little to understanding the target variable, register low scores. As part of initial feature screening, we opted to eliminate features with an MI score of zero, aiming to refine the feature set.

            LASSO

            The LASSO was employed as the second layer of feature selection ( Fonti and Belitser, 2017). LASSO regression, a regularization technique, is utilized to enhance prediction accuracy beyond that of standard regression methods. It introduces a penalty term to the cost function of the linear regression model, proportional to the absolute values of the model coefficients. Consequently, this leads to the coefficients being “shrunk” toward zero, resulting in a sparser model that relies on fewer features. “Shrinkage” describes the process by which data values are reduced toward a central point, like the mean. Equation (2) illustrates the concept of LASSO regularization.

            (2) argminβp{12nyXβ22+λβ1}

            This equation is used to perform the LASSO feature selection where β1=pj=1|βj|. j is the coefficient of the jth feature. λ is a nonnegative regularization parameter that tunes the intensity of this penalty term. The second term in (2) is the so-called “L1 penalty,” which is crucial for the success of the LASSO ( Fonti and Belitser, 2017).

            The primary objective of LASSO feature selection is to discern a subset of the most impactful predictor variables concerning the response variable. This process involves minimizing the sum of the residual sum of squares along with the L1-norm of the regression coefficients, the latter being multiplied by a regularization parameter, λ. As λ, the penalty term, increases, the coefficients associated with less important features are progressively reduced to zero and ultimately excluded from the model. This method effectively conducts feature selection by preserving only those features that have the most significant coefficients.

            The choice of λ is crucial as it dictates the number of features that are maintained in the model. Employing cross-validation to select the optimal λ value ensures that the model achieves a good generalization to new data samples ( Browne, 2000). It fine-tunes the regularization strength and the extent of shrinkage applied to the model’s coefficients. The GridSearchCV method was utilized to ascertain the most suitable λ value.

            Tree-based methods

            In this study, to address the binary-class classification problem, specifically material hardship, a tree-based approach was utilized to identify the influential features from the dataset. More precisely, the well-known Random Forest algorithm was employed for feature selection ( Dimitriadis et al., 2018). Random Forest feature selection is recognized as a robust method that leverages the power of Random Forest algorithms to discern and prioritize the most vital features within large, high-dimensional datasets. By implementing a metric-based ranking system, this strategy effectively eliminates features that do not meet a predetermined relevance threshold, concentrating exclusively on those of utmost significance. The primary goal of this technique is to simplify the dataset’s complexity by focusing on essential features. This not only enhances the model’s accuracy but also reduces the likelihood of overfitting. The feature selection process plays a crucial role in improving the overall performance of predictive models by ensuring they are trained on the most informative data.

            The feature selection process is systematically divided into two steps to ensure both precision and effectiveness. Initially, the permutation importance method is employed to evaluate and rank each feature’s importance based on the impact of permuting their values on model performance. Features that significantly influence the model’s accuracy upon alteration of their values are deemed more critical, according to a user-defined threshold. Subsequently, the “SelectFromModel” technique from the sklearn library is utilized to identify the optimal subset of features. This selection process is further honed by a fivefold cross-validation procedure, which examines the performance of a Random Forest model trained with the chosen features. The resulting mean cross-validation score is then visualized through a plot, illustrating the correlation between the importance threshold and the mean cross-validation score, as depicted in Figure 5.

            Figure 5:

            Feature selection for material hardship. The mean cross-validation score changes as the threshold for feature importance varies. The best subset is selected based on the highest cross-validation score.

            MODEL SELECTION

            In this part of our proposed method, a number of ML models are deployed to predict three features of children from fragile families: learning capacity, as reflected by GPA; resilience, as gauged by grit; and material hardship. Among these, being continuous values in nature, the predictions of GPA and grit are regression tasks, while the prediction of material hardship is a binary classification task. The list of ML models used in this study includes Neural Network, Random Forest, and Gradient-Boosted Tree. These models were chosen for their proven effectiveness in handling both regression and classification problems, as well as their widespread usage in similar research. By considering multiple models, the study aimed to identify the most effective model for each of the three variables, based on their performances on a hold-out test set. For the sake of completeness, a brief introduction to all three models is provided here.

            Neural Network (MLP)

            Neural Network ( Abiodun et al., 2018) is a machine learning model inspired by the structure and function of the human brain. Figure 6 shows the three-layer structure of the neural network: input layer, hidden layer, and output layer. Each node, called an artificial neuron, is connected to others and has an associated weight and threshold. The input layer simply receives the raw input data, that is, values of independent variables. The hidden layer processes and transforms the input data through a series of weighted connections and activation functions. The output layer produces the final prediction or output based on the transformed input data. During training, the weights and biases of the neurons are adjusted using an optimization algorithm to minimize the difference between the predicted output of the network and the true output.

            Figure 6:

            Neural Network structure.

            In our proposed method, we have used the MLP from Scikit-Learn (sklearn) in Python. The MLP is a type of fully connected dense layer consisting of neurons called perceptrons. In each hidden layer, the nodes or neurons use the "Sigmoid" activation function. The total number of hidden layers and neurons in each layer, along with the learning rate of the network, is optimized using Bayesian optimization. A brief description of this optimization method is provided in the Hyperparameter Tuning section.

            Random Forest

            A Random Forest is an ML model that consists of multiple decision trees. Figure 7 shows the basic structure of a Random Forest, which includes several decision trees. Each decision tree is constructed by randomly selecting a subset of features and training the tree to predict the target variable. During prediction, the Random Forest combines the outputs of all decision trees to make a final prediction ( Biau and Scornet, 2016; Huljanah et al., 2019). Random Forest also belongs to the ensemble learning family, which combines multiple ML models to make more accurate predictions than individual models.

            Figure 7:

            Random Forest structure.

            Gradient Boosting Tree

            The Gradient Boosting Tree (GBT) ( Bentéjac et al., 2021) is an ML model that, similar to a Random Forest, uses a set of decision trees to make predictions. However, unlike Random Forest, in GBT, decision trees are not constructed independently. The basic idea behind gradient boosting is to iteratively train decision trees to correct the errors of the preceding tree.

            The process starts by training a single decision tree on the input data. If the output of this tree is not sufficiently accurate, another decision tree is trained to predict the residual errors of the first tree. The output of the second tree is then combined with that of the first tree to produce a more accurate prediction.

            This process is repeated, with each new tree trained to predict the residuals left by all previous trees. The final prediction is the sum of the outputs of all trees.

            Extreme Gradient Boosting (XGBoost) ( Chen et al., 2015) is an optimized version of gradient boosting that includes a regularized objective function and advanced regularization techniques to prevent overfitting. XGBoost’s inclusion of L1 and L2 regularization, along with an approximate greedy algorithm for splitting nodes, helps reduce overfitting and computational costs. Furthermore, XGBoost’s support for parallel processing and distributed computing enhances its efficiency and scalability, enabling it to handle large datasets effectively.

            In this study, the XGBRegressor was utilized for predicting both continuous-valued and binary-valued outcomes.

            Stacking Ensemble Model

            An Ensemble Model is a type of ML model that combines the predictions generated by multiple base models, which can be of various types. These models are generally used to improve the accuracy of the ML model, as they combine the predictions of multiple models, thereby reducing the variance of the predictions and making the models more robust to overfitting. There are various methods for creating an Ensemble Model. In this research, we have created a Stacking-based Ensemble Model to combine the predictions of the selected base models. An overview of the proposed Stacking Ensemble Model is shown in Figure 8.

            Figure 8:

            Overview of the Stacking Ensemble Model using the selected three base models. Abbreviation: FFCWS, Future of Families and Child Wellbeing Study.

            Stacking Ensemble works by initially training a set of base models on the original data. Following this process, the predictions generated by the best models are combined to create features for a selected metamodel. This metamodel is then trained on the predictions of the base models to learn how to best combine them, effectively learning to weigh the predictions of the base models to make the best predictions for the target variable.

            In our proposed method, we initially trained three base models (Random Forest, XGBoost, and Neural Network) and generated predictions from each individual model for model evaluation. The predictions from the base models were then stored, horizontally stacked, and fed into an optimized Random Forest regressor, which acts as our metaclassifier. The final predictions are evaluated and compared with the performances of each of the base models.

            HYPERPARAMETER TUNING

            Hyperparameter tuning is a crucial step in optimizing the performance of an ML model. In our proposed method, Bayesian optimization ( Frazier, 2018; Greenhill et al., 2020) was utilized as the primary method for adjusting and optimizing the model’s hyperparameters. Bayesian optimization is adept at solving the problem of finding the minimum of a function, represented as in Equation (3):

            (3) xmin=argminxf(x)

            For hyperparameter tuning using Bayesian optimization, the performance of the base model on the validation dataset was treated as the function to optimize. A probabilistic model, specifically the Gaussian process, was employed to model the relationship between hyperparameters and validation set performance. This model enables the prediction of performance under new hyperparameter settings and estimation of the uncertainty in those predictions.

            To find the function’s minimum using Bayesian optimization, we began by sampling a set of initial input locations, evaluating the function at these points, and using these evaluations to build the initial model. Iteratively, new input locations were selected based on the acquisition function, which balanced exploration and exploitation to guide the search toward regions where the minimum was likely. Cross-validation was used to evaluate the function at each new input location, providing a more accurate estimate of the function’s performance ( Browne, 2000).

            The optimal set of hyperparameters was determined based on the cross-validation score, identifying the hyperparameters that yielded the best performance within the defined input space ranges. Bayesian optimization facilitated an efficient exploration of the input space, leading to the identification of the optimal hyperparameters for the best function performance.

            MODEL PERFORMANCE

            In this study, material hardship was approached as a binary classification task, employing a classifier for prediction. Conversely, GPA and grit were treated as continuous regression variables, with a regressor used for their prediction. The classifier’s performance was evaluated using several metrics, including accuracy, MSE, confusion matrix, receiver operating characteristic (ROC) curve, AUC, and F1 score, while the regressor’s performance was assessed using MSE, MAE, and R 2 score. These metrics provided a comprehensive evaluation of the accuracy, precision, and overall performance of the classification and regression models.

            • Accuracy: the proportion of correctly classified instances among all instances, offering a basic measure of overall classifier performance.

            • Confusion matrix: a table displaying the counts of true positives, false positives, true negatives, and false negatives, facilitating the calculation of accuracy, precision, and recall.

            • ROC curve: a graphical representation of a binary classifier’s performance, illustrating the trade-off between the true positive rate and false positive rate across various thresholds.

            • AUC: the area under the ROC curve, measuring the classifier’s overall performance.

            • MSE: quantifies the average squared difference between predicted and actual values, calculated as shown in Equation (4):

              (4) MSE=1nni=1(yiˆyi)2

            • MAE: represents the mean absolute error, a metric for regression model accuracy, calculating the average absolute difference between predicted and actual values as in Equation (5):

              (5) MAE=1nni=1|yiˆyi|

            • R 2: the coefficient of determination, indicating the variance proportion in the dependent variable explained by the independent variables, as depicted in Equation (6):

              (6) R2=1ni=1(yiˆyi)2ni=1(yiˉy)2

            • F1-Score: a measure of a binary classifier’s accuracy and precision, the harmonic mean of precision and recall, as given by Equation (7):

              (7) F1=2precisionrecallprecision+recall

            RESULT ANALYSIS

            The models we have selected have been tested on the FFCWS dataset and the outcome of each model was evaluated using adequate performance metrics. From the observation from each model, we have seen a significant performance outcome for each target variable. In this section, we discuss the result analysis of our research.

            The results in Table 1 demonstrate the impressive performance of the Random Forest, XGBoost, Neural Network, and Stacking Ensemble models in classifying material hardship after applying SMOTE oversampling. All models achieved F1 scores over 98%, indicating excellent predictive capabilities on this imbalanced classification task.

            Table 1:

            Performance comparison of different models on material hardship classification (after SMOTE).

            ModelMSEACCAUCF1 Score
            Random Forest Classifier0.0130.9870.9980.987
            XGBoost0.0150.9850.9970.985
            Neural Network (MLP)0.0130.8940.9620.987
            Stacking Ensemble Model0.0140.9860.9960.986

            Abbreviations: ACC, accuracy; AUC, area under the curve; MLP, multilayer perceptron; MSE, mean squared error; SMOTE, Synthetic Minority Over-sampling Technique.

            Compared to the results before SMOTE was applied, as shown in Table 2, the performance improvement is stark. Without addressing the class imbalance, the models struggled to effectively predict the minority positive class of material hardship cases. For example, the Random Forest F1 jumped from 0.80 to 0.987 after using SMOTE.

            Table 2:

            Performance comparison of different models on material hardship classification (before SMOTE).

            ModelMSEACCAUCF1 Score
            Random Forest Classifier0.290.690.830.80
            XGBoost0.250.710.860.77
            Neural Network0.150.800.680.79
            Stacking Ensemble Model0.050.820.850.85

            Abbreviations: ACC, accuracy; AUC, area under the curve; MLP, multilayer perceptron; MSE, mean squared error; SMOTE, Synthetic Minority Over-sampling Technique.

            The AUC scores in Table 1 are all above 0.96, also confirming the models’ strong ability to distinguish between the positive and negative classes after oversampling. While accuracy and MSE can be misleading metrics for imbalanced data, their values improved as well with SMOTE.

            Given the skewed class distribution illustrated in Figure 9, F1 score and AUC were the most informative metrics for this problem. The scale-independent F1 captures performance on both classes, while AUC evaluates how well the model ranks positive cases higher than negative ones. The impressive values for both metrics indicate that handling the class imbalance was key to the models’ success.

            Figure 9:

            ROC curve of Random Forest, XGBoost, Neural Network and Stacking Ensemble Model for material hardship. Abbreviations: AUC, area under the curve; ROC, receiver operating characteristic.

            Oversampling enabled the models to learn from a more representative training set with enough minority examples. This allowed accurate learning of patterns in positive material hardship cases, rather than just focusing on the majority class. The resultant models are robust classifiers for predicting material hardship, validated by the confusion matrix results in Figure 10 showing strong true-positive and low false-negative rates.

            Figure 10:

            Material hardship confusion matrix.

            In summary, SMOTE oversampling proved highly effective at improving model performance on this imbalanced classification task, enabling accurate prediction of the underrepresented positive class. The results motivate the use of oversampling techniques when applying ML to skewed real-world data.

            The empirical results presented in Tables 3 and 4 elucidate the comparative efficacy of various ML algorithms—namely, Random Forest, XGBoost, Neural Network (MLP), and Stacking Ensemble Model—in predicting two continuous variables: grit and GPA. The evaluation leans on MAE and R-squared ( R 2) as primary metrics, given their independence from scale, which renders them particularly relevant for regression analysis.

            Table 3:

            Performance comparison of different models on grit.

            ModelMSEMAE R 2
            Random Forest Regressor0.3470.4390.684
            XGBoost0.3440.4400.251
            Neural Network (MLP)0.3710.4820.249
            Stacking Ensemble Model0.2530.2390.699

            Abbreviations: MAE, mean absolute error; MLP, multilayer perceptron; MSE, mean squared error.

            Table 4:

            Performance comparison of different models on GPA.

            ModelMSEMAE R 2
            Random Forest Regressor0.6470.5860.431
            XGBoost0.6370.5920.440
            Neural Network (MLP)0.6710.6750.339
            Stacking Ensemble Model0.300.5400.859

            Abbreviations: GPA, grade point average; MAE, mean absolute error; MLP, multilayer perceptron; MSE, mean squared error.

            In the context of grit prediction, the Stacking Ensemble Model demonstrated paramount performance with an MAE of 0.239 and an R 2 of 0.699, surpassing the base models substantially. Notably, the Random Forest algorithm emerged as the superior base model, registering an MAE of 0.439 and an R 2 of 0.684. While the MSE is a common metric, MAE and R 2 are more reflective of model performance in regression tasks, thus underscoring the Stacking Ensemble’s advancements.

            Correspondingly, for GPA prediction, the Stacking Ensemble Model again attained preeminence with an MAE of 0.54 and an R 2 of 0.859, marking a significant enhancement over the individual models. Among the base models, XGBoost exhibited the highest performance with an MAE of 0.592 and an R 2 of 0.44. The Ensemble Model’s superior performance, as indicated by these scale-independent metrics, affirms the utility of combining multiple models to amplify predictive accuracy—a well-documented strength of ensemble methods.

            Interestingly, the Neural Network model did not fare as well as its counterparts. This could be attributed to several factors inherent to neural networks, such as their demand for substantial data volumes to effectively learn, their susceptibility to overfitting, and the necessity for meticulous optimization of their architecture and hyperparameters. Conversely, tree-based models, like Random Forest and XGBoost, may offer more robust generalization from limited datasets and integrate intrinsic feature selection, thereby concentrating on pertinent predictors.

            The robustness of the Stacking Ensemble Model is further bolstered by the strategic feature selection process delineated in the Feature Selection section. This process aids in distilling the most influential features, thus enabling the models to more accurately map the input-output relationship.

            In summary, the Stacking Ensemble Model, augmented by a thoughtful feature selection, emerged as the most potent predictor for the continuous outcomes of grit and GPA. The MAE and R 2 statistics corroborate the superiority of the ensemble strategy over conventional regression algorithms. These findings underscore the potential for leveraging analogous methodologies in constructing predictive models for the analysis of learning disabilities in fragile families. With additional refinement, these models hold promise for enhancing automated diagnostics and interventions for this at-risk demographic.

            CONCLUSION

            In this study, we applied advanced ML techniques to forecast behavioral outcomes for children from fragile families, emphasizing learning disabilities, grit, and material hardship. By rigorously preparing data and employing feature selection methods such as MI, LASSO, and tree-based techniques, we ensured the precision of our model training. Our analysis utilized diverse predictive models, including Random Forest, Neural Networks (MLP), GBT, and a Stacking Ensemble Model, each optimized through Bayesian optimization. We also tackled class imbalance in our dataset by applying the SMOTE, achieving balanced classes and improved model generalization.

            Our experimental results demonstrated that the Random Forest and XGBoost models were particularly effective in classifying material hardship, with the Random Forest model standing out for its exceptional performance. In the prediction of GPA and grit, the Stacking Ensemble Model proved to be the most proficient, surpassing other methodologies in accuracy and reliability. However, Neural Networks exhibited limitations in binary classification tasks, indicating a need for refinement through regularization and hyperparameter tuning.

            This study not only advances our understanding of the predictive factors affecting children from vulnerable backgrounds but also provides a solid empirical foundation for informing future educational and social policies tailored to these challenges. Our findings advocate for the effectiveness of ensemble methods and sophisticated algorithms in surpassing traditional models, with the Stacking Ensemble Model particularly enhancing predictions. By highlighting the importance of targeted interventions and policy-making based on rigorous data analysis, this research contributes valuable insights toward improving the life prospects of children in fragile families. Future research could benefit from evaluating a broader spectrum of ML algorithms, including probabilistic models and dimensionality reduction techniques, to gain deeper insights into the dynamics affecting fragile families.

            CONFLICTS OF INTEREST

            The authors declare no conflicts of interest in association with the present study.

            DATA AVAILABILITY

            The data that support the findings of this study are openly available at https://ffcws.princeton.edu/documentation.

            REFERENCES

            1. Abiodun OI, Jantan A, Omolara AE, Dada KV, Mohamed NA, Arshad H. 2018. State-of-the-art in artificial neural network applications: a survey. Heliyon. Vol. 4(11):e00938

            2. Ahearn CE, Brand JE. 2019. Predicting layoff among fragile families. Socius. Vol. 5:2378023118809757

            3. Battiti R. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw. Vol. 5(4):537–550

            4. Bentéjac C, Csörgö A, Martnez-Muñoz G. 2021. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. Vol. 54:1937–1967

            5. Biau G, Scornet E. 2016. A random forest guided tour. Test. Vol. 25:197–227

            6. Brown KA, Patel DR, Darmawan D. 2017. Participation in sports in relation to adolescent growth and development. Transl Pediatr. Vol. 6(3):150

            7. Browne MW. 2000. Cross-validation methods. J Math Psychol. Vol. 44(1):108–132

            8. Carnegie NB, Wu J. 2019. Variable selection and parameter tuning for BART modeling in the fragile families challenge. Socius. Vol. 5:2378023119825886

            9. Cavanagh SE, Fomby P. 2019. Family instability in the lives of American children. Ann. Rev. Sociol. Vol. 45:493–513

            10. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al.. 2015. Xgboost: extreme gradient boosting. R Package Version 0.4-2. Vol. 1(4):1–4

            11. Coe JL, Davies PT, Hentges RF, Sturge-Apple ML. 2020. Understanding the nature of associations between family instability, unsupportive parenting, and children’s externalizing symptoms. Dev. Psychopathol. Vol. 32(1):257–269

            12. Compton R. 2019. A data-driven approach to the fragile families challenge: prediction through principal-components analysis and random forests. Socius. Vol. 5:2378023118818720

            13. Cooper CE, Beck AN, Högnäs RS, Swanson J. 2015. Mothers’ partnership instability and coparenting among fragile families. Soc. Sci. Q. Vol. 96(4):1103–1116

            14. Dimitriadis SI, Liparas D; Alzheimer’s Disease Neuroimaging Initiative. 2018. How random is the random forest? Random forest algorithm on the service of structural imaging biomarkers for Alzheimer’s disease: from Alzheimer’s disease neuroimaging initiative (ADNI) database. Neural Regen. Res. Vol. 13(6):962

            15. Doquire G, Verleysen M. 2013. Mutual information-based feature selection for multilabel classification. Neurocomputing. Vol. 122:148–155

            16. Elreedy D, Atiya AF. 2019. A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. Vol. 505:32–64

            17. Filippova A, Gilroy C, Kashyap R, Kirchner A, Morgan AC, Polimis K, et al.. 2019. Humans in the loop: incorporating expert and crowd-sourced knowledge for predictions using survey data. Socius. Vol. 5:2378023118820157

            18. Fomby P, Goode JA, Mollborn S. 2016. Family complexity, siblings, and children’s aggressive behavior at school entry. Demography. Vol. 53(1):1–26

            19. Fonti V, Belitser E. 2017. Feature selection using LASSO. VU Amsterdam Research Paper in Business Analytics. Vol. 30:1–25

            20. Frazier PI. 2018. A tutorial on bayesian optimization. arXiv preprint arXiv. 1807.02811

            21. Future of Families and Child Wellbeing Study. n.d. Data and documentation. https://ffcws.princeton.edu/documentation

            22. Greenhill S, Rana S, Gupta S, Vellanki P, Venkatesh S. 2020. Bayesian optimization for adaptive experimental design: a review. IEEE Access. Vol. 8:13937–13948

            23. Hasan MAM, Nasser M, Ahmad S, Molla KI. 2016. Feature selection for intrusion detection using random forest. J. Inf. Secur. Vol. 7(3):129–140

            24. Huljanah M, Rustam Z, Utama S, Siswantining T. 2019. Feature selection using random forest classifier for predicting prostate cancerIOP Conference Series: Materials Science and Engineering; Vol. Vol. 546(No. 5):p. 052031IOP Publishing.

            25. Kindel AT, Bansal V, Catena KD, Hartshorne TH, Jaeger K, Koffman D, et al.. 2019. Improving metadata infrastructure for complex surveys: insights from the fragile families challenge. Socius. Vol. 5:2378023118817378

            26. Kumar T. 2022. Gendered Differences in the Factors Influencing Adolescent Academic Outcomes: An Analysis of the Fragile Families and Child Wellbeing Study (FFCWS) Using Decision Trees, a Supervised Machine Learning MethodLinköping University. Sweden:

            27. McKay S. 2019. When 4 ≈ 10,000: the power of social science knowledge in predictive performance. Socius. Vol. 5:2378023118811774

            28. Prendergast S, MacPhee D. 2021. Risk assessments at birth predict kindergarten achievement and involvement with child protective services. Prev. Sci. Vol. 22(4):432–442

            29. Raes L. 2019. Predicting GPA at age 15 in the fragile families and child wellbeing study. Socius. Vol. 5:2378023118824803

            30. Rigobon DE, Jahani E, Suhara Y, AlGhoneim K, Alghunaim A, Pentland A, et al.. 2019. Winning models for grade point average, grit, and layoff in the fragile families challenge. Socius. Vol. 5:2378023118820418

            31. Salganik MJ, Lundberg I, Kindel AT, Ahearn CE, Al-Ghoneim K, Almaatouq A, et al.. 2020. Measuring the predictability of life outcomes with a scientific mass collaboration. Proc. Natl. Acad. Sci. Vol. 117(15):8398–8403

            32. Vergara JR, Estévez PA. 2014. A review of feature selection methods based on mutual information. Neural Comput. Appl. Vol. 24:175–186

            33. Zou H. 2006. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. Vol. 101(476):1418–1429

            Author and article information

            Journal
            jdr
            Journal of Disability Research
            King Salman Centre for Disability Research (Riyadh, Saudi Arabia )
            1658-9912
            25 April 2024
            : 3
            : 4
            : e20240032
            Affiliations
            [1 ] Department of Mathematics and Computer Science, University of Maine at Presque Isle, Presque Isle, ME, USA ( https://ror.org/00nk17n43)
            [2 ] Prep Excellence LLC, Dayton, NJ 08810, USA;
            [3 ] Department of Sociology, University of Massachusetts Boston, Boston, MA, USA ( https://ror.org/04ydmy275)
            [4 ] Department of Computer Science, College of Arts and Sciences, University of Maine at Presque isle, Presque Isle, ME, USA ( https://ror.org/00nk17n43)
            [5 ] Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia ( https://ror.org/04jt46d36)
            [6 ] Department of Computer Science and Engineering, BRAC University, Dhaka 1212, Bangladesh ( https://ror.org/00sge8677)
            [7 ] Inrush Electrical and Technology, Calgary, AB, Canada;
            [8 ] Department of Information Systems, College of Computer and Information Sciences, King Saud University, and King Salman Centre for Disability Research, Riyadh 11543, Saudi Arabia ( https://ror.org/02f81g417)
            Author notes
            Author information
            https://orcid.org/0000-0002-3479-3606
            Article
            10.57197/JDR-2024-0032
            f7f64728-c02f-4677-b4f0-4def1ac1f159
            Copyright © 2024 The Authors.

            This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY) 4.0, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

            History
            : 13 October 2023
            : 20 March 2024
            : 20 March 2024
            Page count
            Figures: 10, Tables: 4, References: 33, Pages: 13
            Funding
            Funded by: King Salman Center for Disability Research
            Award ID: KSRG-2023-118
            The authors extend their appreciation to the King Salman Center for Disability Research (funder id: http://dx.doi.org/10.13039/501100019345) for funding this work through Research Group no KSRG-2023-118.
            Categories

            Artificial intelligence,Human-computer-interaction
            disability,fragile family,grit,GPA,material hardship,machine learning,Bayesian optimization,SMOTE

            Comments

            Comment on this article