Raw data of real analytical use cases in a number of industries and companies is frequently provided in an Excel-based form. These files usually cannot be processed directly in machine learning models, but must first be cleaned and preprocessed. In this procedure, many different types of pitfalls may occur. This makes data preprocessing an essential time factor in the daily work of a data scientist.
Here, an Excel spreadsheet will be presented which in this form is closely oriented to a real case but contains only simulated figures for reasons of data and business results protection. The form and structure of the file correspond to a real case and could be encountered by a data scientist in a company in this way. Such a file can be the result of a download from a financial controlling system, e.g. SAP.
The data includes information about sold goods resp. product units, the associated turnover and hours worked. This information is grouped by month, store and department of the retailer. Moreover, information about the sales area in a specific department as well as about the opening hours of the store is provided.
The following goals of data cleansing might be addressed:
Furthermore, the data can be investigated with regard to correlations between different features and/or a regression model.