Analysis of big data from New York taxi trip 2023: revenue prediction using ordinary least squares solution and limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithms
International Journal of Electrical and Computer Engineering
Abstract
This study explores the prediction of taxi trip fares using two linear regression methods: normal equations (ordinary least squares solution (OLS)) and limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS). Utilizing a dataset of New York City yellow taxi trips from 2023, the analysis involves data cleaning, feature engineering, and model training. The data consists of over 12 million records, managed, and processed that involves configuring the Spark driver and executor memory to efficiently process the Parquet-format data stored on hadoop distributed file system (HDFS). Key features influencing fare amount, such as passenger count, trip distance, fare amount, and tip amount, were analyzed for correlation. Models were trained on an 80-20 train-test split, and their performance was evaluated using root-mean-square error (RMSE) and mean squared error (MSE). Results show that both methods provide comparable accuracy, with slight differences in coefficients and training time. Additionally, vendor performance metrics, including total trips, average trip distance, fare amount, and tip amount, were analyzed to reveal trends and inform strategic decisions for fleet management. This comprehensive analysis demonstrates the efficacy of linear regression techniques in predicting taxi fares and offers valuable insights for optimizing taxi operations.
Discover Our Library
Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.