Machine learning based time series forecasting of chemical levels from the East Palestine, Ohio train derailment site
=====================================================================================================================

* Vikas Ramachandra
* Mohit Sethi

## Abstract

The paper presents a time series analysis of the air, water and soil data from the East Palestine, Ohio, train derailment site. The goal of the analysis is to investigate the temporal pattern of chemical pollutant levels in air, water and soil and to build machine learning and statistical models for forecasting various chemical concentrations over time.

## Introduction

The release of hazardous materials into the environment due to industrial accidents, transportation mishaps, and natural disasters poses a significant risk to public health and safety. Such incidents require prompt and effective response to mitigate the impact on the affected population and the surrounding environment. In the case of the train derailment that occurred in East Palestine, Ohio on 3 February 2023, 38 rail cars carrying hazardous material derailed, resulting in the release of hazardous material into the air, water, and soil.

The United States Environmental Protection Agency (EPA) and other agencies conducted air sampling in the area to monitor the concentration of the hazardous material and assess the potential risk to human health. The air sampling data collected at various locations in the vicinity of the incident site is a valuable source of information for evaluating the extent of contamination and predicting the future trend of the pollutant concentration. Time series analysis of the air sampling data can provide insights into the temporal behavior of the pollutant concentration and help identify any underlying patterns or trends.

In this paper, we perform a time series analysis of the air sampling data collected by the EPA following the East Palestine train derailment. Our objective is to identify any temporal patterns or trends in the pollutant concentration and evaluate the effectiveness of the remediation efforts. We focus on the “Result\_Final\_Txt” variable, which represents the concentration of the hazardous material in the air at different sampling locations over time. We then present our methodology for time series analysis and discuss the results of our analysis. Finally, we conclude with a discussion of the implications of our findings and the potential for future research in this area.

## Dataset

### EPA Lab Results: Air (through 02/22/23) - East Palestine, OH Response (csv) [1]

([https://www.epa.gov/oh/air-sampling-data-east-palestine-ohio-train-derailment](https://www.epa.gov/oh/air-sampling-data-east-palestine-ohio-train-derailment))

The dataset contains 1553 records starting from 04/02/2023 to 22/02/2023. The dataset contains following features: [‘Location’, ‘Samp\_No’, ‘SampleDate\_txt’, ‘SampleTime’, ‘Matrix’, ‘SampleMedia’, ‘Activity’, Analytical\_Method’, ‘CAS\_NO’, ‘Analyte’, ‘Result\_Units’, ‘Reporting\_Limit’, ‘Validation\_Level’, ‘Result\_Final\_Txt’, ‘Result\_Qualifier\_Final’, ‘RL\_Comparison’].

Our analysis is focussed on the ‘Results\_Final\_Txt’ which represents the concentration level of the hazardous chemical. The ‘Analyte’ and ‘CAS_NO” (CAS Registry Number) [4] help identify the chemicals. There are 49 unique chemicals. The records before 17/02/2023 are used for the training set(1036) and the remaining are used for the validation set(517).

### EPA Lab Results: Water (through 02/14/23) - East Palestine, OH Response (csv) [2]

([https://www.epa.gov/oh/water-sampling-data-east-palestine-ohio-train-derailment](https://www.epa.gov/oh/water-sampling-data-east-palestine-ohio-train-derailment))

The dataset contains 1822 records starting from 04/02/2023 to 17/02/2023. The dataset contains following features: [‘Location’, ‘Samp\_No’, ‘SampleDate\_txt’, ‘SampleTime’, ‘Matrix’, ‘SampleMedia’, ‘Activity’, Analytical\_Method’, ‘CAS\_NO’, ‘Analyte’, ‘Result\_Units’, ‘Reporting\_Limit’, ‘Validation\_Level’, ‘Result\_Final\_Txt’, ‘Result\_Qualifier\_Final’, ‘RL\_Comparison’].

Our analysis is focussed on the ‘Results\_Final\_Txt’ which represents the concentration level of the hazardous chemical. The ‘Analyte’ and ‘CAS_NO” (CAS Registry Number) [4] help identify the chemicals. There are 134 unique chemicals. The records before 10/02/2023 are used for the training set(1166) and the remaining are used for the validation set(656).

### EPA Lab Results: Soil and Sediment (through 02/10/23) - East Palestine, OH Response (csv) [3]

([https://www.epa.gov/oh/soil-and-sediment-sampling-data-east-palestine-ohio-train-derailment](https://www.epa.gov/oh/soil-and-sediment-sampling-data-east-palestine-ohio-train-derailment)) The dataset contains 1082 records starting from 08/02/2023 to 10/02/2023. The dataset contains following features: [‘Location’, ‘Samp\_No’, ‘SampleDate\_txt’, ‘SampleTime’, ‘Matrix’, ‘SampleMedia’, ‘Activity’, Analytical\_Method’, ‘CAS\_NO’, ‘Analyte’, ‘Result\_Units’, ‘Reporting\_Limit’, ‘Validation\_Level’, ‘Result\_Final\_Txt’, ‘Result\_Qualifier\_Final’, ‘RL\_Comparison’].

Our analysis is focussed on the ‘Results\_Final\_Txt’ which represents the concentration level of the hazardous chemical. The ‘Analyte’ and ‘CAS_NO” (CAS Registry Number) [4] help identify the chemicals. There are 128 unique chemicals. The records before 10/02/2023 are used for the training set(512) and the remaining are used for the validation set(570).

We combined the ‘SampleDate_txt’ and ‘SampleTime’ features into a single feature called ‘Date-Time’

### Model & Results

#### Air Sample

We use the ‘Analyte’ feature to group the records for each chemical. There are 49 unique hazardous chemicals. We trained separate models for every chemical. We selected the following features for the prediction of our target variable ‘Result\_Final\_Txt’. ‘Location’, ‘Date\_Time’ ‘SampleMedia’, ‘Analytical\_Method’, ‘Reporting\_Limit’, ‘Validation\_Level’, ‘, ‘RL_Comparison’. We use the Augmented Dickey–Fuller test whether the time series is stationary or not. [5] After experimentation with several forecasting models, ARIMA, SARIMA, LSTM Forecasting [8]. We finally selected the FBProphet forecasting model as it best fitted our training data.FBProphet [6][7] is a time series forecasting model developed by Facebook’s Data Science team. It is designed to make accurate predictions of future values based on historical time series data, while also handling missing values, outliers, and trend changes. FBProphet uses a decomposable time series model with three main components: trend, seasonality, and holidays. It employs Bayesian inference to fit the model and provides uncertainty intervals for the predictions. We also use the metrics mean square error and mean absolute percentage error [9] to evaluate the performance of the model in predicting the target variable ‘Result\_Final\_Txt’, which tells the concentration level of the hazardous chemical in the air. Fig.1, Fig.2, Fig.3, Fig.4 show the prediction of our model against the test data for 4 different chemicals.

![Fig. 1](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F1.medium.gif)

[Fig. 1](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F1)

Fig. 1 

![Fig. 2](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F2.medium.gif)

[Fig. 2](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F2)

Fig. 2 

![Fig. 3](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F3.medium.gif)

[Fig. 3](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F3)

Fig. 3 

![Fig. 4](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F4.medium.gif)

[Fig. 4](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F4)

Fig. 4 

Mean Square Error (MSE) is a commonly used statistical metric that measures the average squared difference between the predicted values and the actual values.

![Formula][1]</img>  Mean Absolute Percentage Error (MAPE) is a statistical metric that measures the average absolute percentage difference between the predicted values and the actual values. ![Formula][2]</img>  Table 1 shows the performance metrics for each chemical in the Air Sample Data.

View this table:
[Table 1](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T1)

Table 1 

#### Water Sample

We use the exact same procedure for the Water Sample Data. We use the ‘Analyte’ feature to group the records for each chemical. There are 134 unique hazardous chemicals. We trained separate models for every chemical.Fig.5, Fig.6, Fig.7, Fig.8 show the prediction of our model against the test data for 4 different chemicals.

![Fig. 5](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F5.medium.gif)

[Fig. 5](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F5)

Fig. 5 

![Fig. 6](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F6.medium.gif)

[Fig. 6](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F6)

Fig. 6 

![Fig. 7](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F7.medium.gif)

[Fig. 7](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F7)

Fig. 7 

![Fig. 8](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F8.medium.gif)

[Fig. 8](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F8)

Fig. 8 

Table 2, 3, 4, 5 shows the performance metrics for each chemical in the Water Sample Data.

View this table:
[Table 2](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T2)

Table 2 

View this table:
[Table 3](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T3)

Table 3 

View this table:
[Table 4](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T4)

Table 4 

View this table:
[Table 5](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T5)

Table 5 

#### Soil Sample

We use the exact same procedure for the Soil Sample Data. We use the ‘Analyte’ feature to group the records for each chemical. There are 128 unique hazardous chemicals. We trained separate models for every chemical.Fig.9, Fig.10, Fig.11, Fig.12 show the prediction of our model against the test data for 4 different chemicals.

![Fig. 9](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F9.medium.gif)

[Fig. 9](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F9)

Fig. 9 

![Fig. 10](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F10.medium.gif)

[Fig. 10](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F10)

Fig. 10 

![Fig. 11](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F11.medium.gif)

[Fig. 11](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F11)

Fig. 11 

![Fig. 12](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/09/2023.03.07.23286923/F12.medium.gif)

[Fig. 12](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/F12)

Fig. 12 

Table 6, 7, 8, 9 shows the performance metrics for each chemical in the Soil Sample Data.

View this table:
[Table 6](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T6)

Table 6 

View this table:
[Table 7](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T7)

Table 7 

View this table:
[Table 8](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T8)

Table 8 

View this table:
[Table 9](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T9)

Table 9 

## Air Sample with Water Sample

Here we try to predict the target for Air Sample data by combining both Air Sample and Water Sample Data and using Water Sample data as an auxiliary variable. Table 10, 11 shows the performance metrics for each chemical

View this table:
[Table 10](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T10)

Table 10 

View this table:
[Table 11](http://medrxiv.org/content/early/2023/03/09/2023.03.07.23286923/T11)

Table 11 

The mean and median for MSE for Air Sample Data is 73.4789 and 0.0006 respectively. The mean and median for MAPE for Air Sample Data is 4.2328 and 0.198 respectively.

The mean and median for MSE for Water Sample Data is 938175511251.79 and 28.4845 respectively.

The mean and median for MAPE for Water Sample Data is 466.434 and 5.7653 respectively.

The mean and median for MSE for Soil Sample Data is17885529841550.645 and 436.4504 respectively.

The mean and median for MAPE for Soil Sample Data is 663.1171 and 57.37025 respectively.

The mean and median for MSE for combined Air and Water Ssmple Data is 267.11765 and 0.0014 respectively.

The mean and median for MAPE for Soil Sample Data is 10.8287 and 0.1722 respectively.

## Conclusion

In conclusion the time series forecasting of Ohio train derailment data has provided valuable insights into the patterns of the hazardous chemical present in air, water and soil.

The time series forecasting models used in this research have demonstrated strong predictive power when using Air Sample Data and Air Sample Data with Water Sample Data used as an auxiliary variable.

The relatively poor performance of the soil and water model with respect to the air model can be alluded to the fact that there is a lack of temporal data for both soil and water. While Air Sample Data has a total data collection of 18 days, data samples for water and soil have only been collected for 11 and 2 days respectively.

We hope that this research comes to use and the models can be used for forecasting hazardous chemical concentration over time.

## Data Availability

All data produced are available online on EPA website.

## Footnotes

*   vikas.ramachandra{at}jacobs.ucsd.edu

*   Received March 7, 2023.
*   Revision received March 7, 2023.
*   Accepted March 9, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  1.[https://www.epa.gov/oh/air-sampling-data-east-palestine-ohio-train-derailment](https://www.epa.gov/oh/air-sampling-data-east-palestine-ohio-train-derailment)
    
    

2.  2.[https://www.epa.gov/oh/water-sampling-data-east-palestine-ohio-train-derailment](https://www.epa.gov/oh/water-sampling-data-east-palestine-ohio-train-derailment)
    
    

3.  3.[https://www.epa.gov/oh/soil-and-sediment-sampling-data-east-palestine-ohio-train-derailment](https://www.epa.gov/oh/soil-and-sediment-sampling-data-east-palestine-ohio-train-derailment)
    
    

4.  4.[https://en.wikipedia.org/wiki/CAS\_Registry\_Number](https://en.wikipedia.org/wiki/CAS_Registry_Number)
    
    

5.  5.[https://analyticsindiamag.com/complete-guide-to-dickey-fuller-test-in-time-series-analysis/](https://analyticsindiamag.com/complete-guide-to-dickey-fuller-test-in-time-series-analysis/)
    
    

6.  6.[https://facebook.github.io/prophet/](https://facebook.github.io/prophet/)
    
    

7.  7.[https://github.com/facebook/prophet](https://github.com/facebook/prophet)
    
    

8.  8.[https://medium.com/mlearning-ai/multivariate-time-series-forecasting-using-rnn-lstm-8d840f3f9aa7](https://medium.com/mlearning-ai/multivariate-time-series-forecasting-using-rnn-lstm-8d840f3f9aa7)
    
    

9.  9.[https://iopscience.iop.org/article/10.1088/1742-6596/930/1/012002/meta](https://iopscience.iop.org/article/10.1088/1742-6596/930/1/012002/meta)

 [1]: /embed/graphic-5.gif
 [2]: /embed/graphic-6.gif