In this blog post, I am going to explain how to Automate IBM DataStage Director using Python with RegEx and pandas.
Bonus: I will also include the script to automate the mail of logs in excel format if I get comments asking for it.
Motivation:
The log file director can export is in txt format, which obviously is not very readable, as I will show you the example of it soon.
Excel is much more readable than the text file.
Also, to reduce the manual work and improve the efficiency of the team.
Obtaining Raw Data:
To obtain the job logs, we can simply print the job logs and select all entries to print all the entries in the current view to a text file. Click the link below to read the official documentation.
https://www.ibm.com/support/pages/how-do-i-save-job-log-datastage-director-file
Now we have a log.txt file that contains out raw data like this:
DataStage Report - Detail Status of jobs: Produced on: 02-05-2020 15:47:27 Project: testing Host system: TESTING Items: 1 - 3 Sorted on: Job name Item #: 1 Job name: J_CTRL_JOB Status: Finished Started: 01:46 On date: 02-05-2020 Last ran: 01:57 On date: 02-05-2020 Elapsed time: 00:11:39 Description: Item #: 2 Job name: J_TEST1 Status: Finished Started: 01:46 On date: 02-05-2020 Last ran: 01:48 On date: 02-05-2020 Elapsed time: 00:02:39 Description: Item #: 3 Job name: J_TEST2 Status: Finished Started: 01:48 On date: 02-05-2020 Last ran: 01:49 On date: 02-05-2020 Elapsed time: 00:00:28 Description: End of report.
Requirements:
- Python (Version 3.x Recommended)
- Basic knowledge of Regular Expressions (Or you can just copy-paste my code)
- Pandas library of Python to save data to excel (a bit overkill but works like a charm)
Let’s do it now:
- Just put the log.txt file and the excel.py in the same folder
- Open Terminal
- run this command:
python excel.py
Here is the script for excel.py
import re from pandas import DataFrame with open('log.txt', 'r', encoding='utf-8') as f: log = f.read() search_pattern = ( r"""\bItem #: (\d+)\s+""" r"""Job name: (.+)\s+""" r"""Status: (.+)\s+""" r"""Started: (.+)""" r"""\s+On date: (.+)\s+""" r"""Last ran: (.+)\s+""" r"""On date: (.+)\s+""" r"""Elapsed time: (.+)\s+""" """Description:(.*)""" ) pattern = re.compile(search_pattern) item, job_name, status, \ start_date_time, end_date_time, \ elapsed_time, description = ([] for _ in range(7)) matches = pattern.finditer(log) for match in matches: item.append(match.group(1)) job_name.append(match.group(2)) status.append(match.group(3)) start_date_time.append(match.group(4)+' '+match.group(5)) end_date_time.append(match.group(6)+' '+match.group(7)) elapsed_time.append(match.group(8)) description.append(match.group(9)) df = DataFrame( {'Item': item, 'Job Name': job_name, 'Status': status, 'Start Time': start_date_time, 'End Time': end_date_time, 'Elapsed Time': elapsed_time, 'Description': description, }) #print(df) df.to_excel('job_status.xlsx', sheet_name='sheet1', index=False)
On Successfully, following all the steps, and running the python script, you should see a new excel file named ‘job_status.xlsx’ in the same folder. That’s it. It is this simple automating IBM DataStage Director. Yayyy!
Mailing this excel file to the team members:
Now that we have the excel file ready which looks so much better in terms of readability and usability(you can use filters and colour the blocks and whatnot!). Some people may not need to mail this file by automation. So I am not writing about it as of now.
Github:
You can find all the code with log file on my Github Repo:
https://github.com/iamkaushal/Log-Automation-IBM-Infosphere-DataStage
Check out my other posts:
Squared Error vs Absolute Error
why use harmonic mean to calculate the f1 score
Read my StackExchange answers:
https://stats.stackexchange.com/users/272121/kaushal-sharma
Feel free to write comments in case of any doubt or query, I will reply asap.
Originally published at semanticerror.com on May 3, 2020.