Lecture 4

  1. Take this short interactive regex tutorial

  2. Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?

    If cat /usr/share/dict/words doesn't return anything, you can install with:
    sudo apt install wamerican
    1. Find the number of words with 3 a's that don't end with 's.
      grep -iE '^[^aA]*[aA][^aA]*[aA][^aA]*[aA]' /usr/share/dict/words | grep -v "'s$" | tr '[:upper:]' '[:lower:]' | wc -l
    2. What are the three most common last two letters of those words?
      grep -iE '^[^aA]*[aA][^aA]*[aA][^aA]*[aA]' /usr/share/dict/words | grep -v "'s$" | tr '[:upper:]' '[:lower:]' | sed 's/.*\(..\)$/\1/' | sort | uniq -c | sort -nr | head -n 3
    3. How many of those two words combinations are there?
      grep -iE '^[^aA]*[aA][^aA]*[aA][^aA]*[aA]' /usr/share/dict/words | grep -v "'s$" | tr '[:upper:]' '[:lower:]' | sed 's/.*\(..\)$/\1/' | sort | uniq | wc -l
    4. Which combinations do not occur?

      First we need to generate all two letter combinations.
      echo {a..z}{a..z} | tr ' ' '\n' | sort > all_combinations.txt
      Then, we find the endings of words with three a's that don't end in 's.
      grep -iE '^[^aA]*[aA][^aA]*[aA][^aA]*[aA]' /usr/share/dict/words | grep -v "'s$" | tr '[:upper:]' '[:lower:]' | sed 's/.*\(..\)$/\1/' | sort | uniq > existing combinations
      Then to find which line differ:
      diff --new-file --suppress-common-lines all_combinations.txt existing_combinations.txt | grep '>' | sed 's/> //'

      diff --suppress-common-lines Shows what lines are present in both files

      grep '>' Selects lines starting with >, so the lines present only in the first file.

      sed 's/> //' Removes the leading > character and the space that follows.

  3. To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.

    sed will overwrite the file before it finishes writing the file.

    Instead you can use the inplace argument -i like so sed -i 's/REGEX/SUBSTITUTION/' input.txt.

  4. Find your average, median, and max system boot time over the last ten boots. Use journalctl on Linux and log show on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:
    Logs begin at ...

    and

    systemd[577]: Startup finished in ...

    If your system supports the -b option on systemd analyze you can use that to get the time of each boot easily.

    Otherwise, this script will work:

    boot_time.sh

    #!/bin/bash
    
    # Script to calculate average, median, and max boot times for the last 10 boots
    
    boot_times=()
    
    for i in {-0..-9}; do
      echo "Boot $i:"
      
      # Get the first kernel message for the start of the boot
      start=$(journalctl --boot $i --grep "kernel" | head -n 1 | awk '{print $1, $2, $3}')
      
      # Get the "Reached target Multi-User System" message for the end of the boot
      end=$(journalctl --boot $i --grep "Reached target Multi-User System" | head -n 1 | awk '{print $1, $2, $3}')
      
      echo "Start time: $start"
      echo "End time: $end"
      
      # Calculate time difference if both start and end times are found
      if [[ -n "$start" && -n "$end" ]]; then
        start_ts=$(date -d "$start" +%s)
        end_ts=$(date -d "$end" +%s)
        boot_time=$((end_ts - start_ts))
        echo "Boot time: $boot_time seconds"
        
        # Add boot time to array
        boot_times+=($boot_time)
      else
        echo "Could not find boot time."
      fi
      echo "------------------"
    done
    
    # Calculate and display average, median, and max boot times
    
    if [[ ${#boot_times[@]} -gt 0 ]]; then
      # Sort boot times for median calculation
      sorted_times=($(printf '%s\n' "${boot_times[@]}" | sort -n))
    
      # Calculate average
      total=0
      for time in "${boot_times[@]}"; do
        total=$((total + time))
      done
      average=$((total / ${#boot_times[@]}))
    
      # Calculate median
      mid_index=$(( ${#sorted_times[@]} / 2 ))
      if (( ${#sorted_times[@]} % 2 == 0 )); then
        median=$(( (sorted_times[mid_index-1] + sorted_times[mid_index]) / 2 ))
      else
        median=${sorted_times[$mid_index]}
      fi
    
      # Calculate max
      max=${sorted_times[-1]}
    
      # Display the results
      echo "Average boot time: $average seconds"
      echo "Median boot time: $median seconds"
      echo "Max boot time: $max seconds"
    else
      echo "No boot times found."
    fi
    
  5. Look for boot messages that are not shared between your past three reboots (see journalctl’s -b flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use sed '0,/STRING/d' to remove all lines previous to one that matches STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).

    diff_boots.sh

    #!/bin/bash
    
    # Get logs from the last three boots
    journalctl --boot -0 > boot_logs_0.txt
    journalctl --boot -1 > boot_logs_1.txt
    journalctl --boot -2 > boot_logs_2.txt
    
    # Remove the timestamp and keep only the log message
    sed 's/^[^ ]* [^ ]* [^ ]* //' boot_logs_*.txt > cleaned_logs.txt
    
    # Sort the cleaned logs and count occurrences
    sort cleaned_logs.txt | uniq -c > counted_logs.txt
    
    # Filter out lines with a count of 3
    awk '$1 != 3 {print substr($0, 3)}' counted_logs.txt > unique_boot_messages.txt
    
    echo "Unique boot messages that were not shared among the last three boots are stored in unique_boot_messages.txt"
  6. Find an online data set like this one, this one, or maybe one from here. Fetch it using curl and extract out just two columns of numerical data. If you’re fetching HTML data, pup might be helpful. For JSON data, try jq. Find the min and max of one column in a single command, and the difference of the sum of each column in another.
    1. First you need to install pup if you don't already have it
      sudo apt-get install pup
    2. Fetch the data with curl and put it in a file so you only have to do it once.
      curl -s https://stats.wikimedia.org/EN/TablesWikipediaZZ.htm -o wikimedia_data.html
    3. Then, to get the min and max:
      pup 'table:nth-of-type(1) tr td:nth-of-type(3) text{}' < wikimedia_data.html | grep -Eo '[0-9]+' | sort -n | awk 'NR==1{print "Min:", $1} END{print "Max:", $1}'

      pup takes the 3rd column of the first table, grep filters out none digits (remove whitespace), sort -n sorts the results to get min and max. awk prints the first (min) and last (max) lines (values).

    4. And for the difference of the sum of each column
      echo "Difference of Sums: $(echo "$(pup 'table:nth-of-type(1) tr td:nth-of-type(3) text{}' < wikimedia_data.html | grep -Eo '[0-9]+' | paste -sd+ - | bc) - $(pup 'table:nth-of-type(1) tr td:nth-of-type(4) text{}' < wikimedia_data.html | grep -Eo '[0-9]+' | paste -sd+ - | bc)" | bc)"
    5. This command clalculates the sum of each column inside the parenthesis and then substracts them.

      Inside the parenthesis, you should already know what pup and grep do from the last step. paste -sd+ - is used to concatenate all lines into a single line (-s option). -d+ separates each value by a + operator. The - is used to make to standard output of the last command into the standard input of this one.

      Finally bc transform the string into a mathematical operation.