Project 1: Log Parser

Introduction

This page has 3 general sections:

  • the basic log parser script that was written by Gemini

  • the questions I asked and what I learned from seeing how it's written

  • the final log parser script I wrote which will include minor improvements such as allowing the user to define the name of the log file to be parsed, to define the status(es) of interest, and an output which categorizes the IPs by status.

Getting Gemini to create the initial script

Here's my prompt to Gemini:

This is the output script that I got:

Understanding what is going on in each line

  • What is "re", why "re", and how did we get to "re" Since we asked Gemini to find IPs in lines of text, that would be a certain pattern which is great for using the regular expression module (re). How does a beginner like me figure out which libraries/modules to use? Googling and experience is the repeated answers that I got.

  • What is "try" The try...except block lets us test and handle a block of code for errors, so that the script doesn't crap out on us.

  • What is "__name__" and "__main__" This is a conditional block that ensures that the code is run only when the file is executed directly, not when it's imported. Apparently it's best practice, so..better start using it now.

  • Why is there "print()" and "print(f)" Print() just prints whatever is in the bracket. Print(f") allows us to embed expressions and variables directly inside. For example: print("my name is", name, "and I live in", location) and we'll have to define name and location. vs print(f"my name is {name} and I live in {location}") allows us to include operations inside the curly brackets.

This section is to share an idea of the granularity that I got down to. There are other questions I asked obviously, but I won't be documenting everything for the sake of brevity and dignity.

Writing my own script with improvements

As a beginner, I chart out the following points before I start writing the program:

  • Objective Write a program that takes a log file which contains IPs and statuses, and prints out the IPs of line items with any of the 5 statuses. There are way too many HTTP statuses, and this is just my beginner project so I'm choosing these 5 statuses (200 OK, 401 Unauthorized, 403 Forbidden, 404 Not Found, 418 I'm a teapotarrow-up-right). Unlike the Gemini script, my script should allow the user to define arguments instead of running a fixed input (the log file) and the desired statuses in the command line. These two should not be hard coded into the script itself. For instance, the command should look like: "python logparserV2.py log.txt "200 OK" "401 Unauthorized"" The resulting IPs should also be categorized by status, and not a blob of IPs.

  • Boundaries The input will be a log file and the output will a list of IPs of interest categorized by status printed in the command line.

  • Core logic i. validate the number of arguments to ensure at least 1 python file, 1 log file, and 1 status code. ii. open the log text file iii. read each line within the file, so we don't have a huge load into memory at one go iv. for each line, check if the IP satisfies any of the defined statuses v. if it does, add it to the appropriate list vi. after all lines are checked, close the file vii. create a txt file and save the list of collected IPs into it viii. if there are no IPs in any of the lists, print the none statement if there are none

  • Structure i. we need lists to hold the IPs of the desired statuses ii. but since there are multiple IPs, we need a function that creates a new list for each status

  • Error handling To prevent our program from crashing, we need to address potential errors such as the log file not being found or no IPs with the desired statuses.

  • Output The output should be IPs categorized by status code and can be printed in the command line.

Learning about the sys module (sys.argv)

if I run this command: python script.py data.txt "200 OK", python splits the command like so:

  • index 0 is the name of the script - script.py

  • index 1 is the first argument - data.txt

  • further arguments will be index 2, 3, etc - 200 OK

Learning about Dictionary (key-value pairs)

I need this as there are various statuses with multiple IPs, and the previous set doesn't cut it anymore. So I'll need the script to create a key-value pair for each status taht has at least 1 corresponding IP to it.

Here's the script I came up with after hours of raging and crying:

Time to run this turd. If I get an error on line whatever one more time...

Setting up our virtual environment (as a best practice in this case since the script is simple), which will be helpful if we get to more complex scripts later on with multiple versions and potential conflicts.

Ran the script and boom, neat output.

Since this is a super basic script with no external libraries needed, no need for a requirements.txt file since it'll be empty anyway.

I've documented what each line does exactly in my Jupyter notebook herearrow-up-right. Check it out on the github link because this post is already goddamn long.

Bonus section:

We can also run this in a Jupyter notebook. Here are the steps from the command line:

  1. Setup the virtual environment python -m venv <name>

  2. Activate the virtual environment <name>\Scripts\activate.bat

  3. Install Jupyter and any packages pip install jupyterlab

  4. Launch Jupyter jupyter lab

  5. This will launch Jupyter on your preferred browser window.

Last updated