قالب وردپرس درنا توس
Home / Tips and Tricks / Extract and sort columns from log files on Linux – CloudSavvy IT

Extract and sort columns from log files on Linux – CloudSavvy IT



  Sorting a log file

Sorting a log file by a specific column is useful for finding information quickly. Logs are usually stored as human-readable text, so you can use command-line text manipulation tools to process them and view them more legibly.

Extract columns with cut and awk

The cut and awk utilities are two different ways to extract a column of information from text files. Both assume that your log files are separated by spaces, for example:

  column column column 

This poses a problem if the data in the columns contains whitespaces, such as dates ("Wed June 1

2"). Although cut may see this as three separate columns, you can still extract all three at once assuming the structure of your log file is consistent.

cut is very easy to use:

  cat system.log | cut -d & # 39; & # 39; -f 1-6 

The cat command reads the contents of system.log and directs it to cut . The flag -d specifies the separator, in this case a white space. (The default is tab, t .) The flag -f specifies which fields to output. This command specifically prints the first six columns of system.log . If you only wanted to print the third column, you would use the flag -f 3 .

awk is more powerful but not so succinct. cut is useful for extracting columns, for example, if you want to extract a list of IP addresses from your Apache logs. awk can rearrange entire lines, which can be useful for sorting an entire document by a specific column. awk is a full programming language, but you can use a simple command to print columns:

  cat system.log | awk & # 39; {print $ 1, $ 2} & # 39; 

awk executes your command for each line in the file. By default, the file is split into white spaces and each column is stored in variables $ 1 $ 2 $ 3 and so on. The print $ 1 command allows you to print the first column, but it is not easy to print a series of columns without using loops.

An advantage of awk is that the command can refer to the whole line at once. The contents of the line are stored in variable $ 0 which you can use to print the whole line. For example, you can print the third column before printing the rest of the line:

  awk & # 39; {print $ 3 "" $ 0} & # 39; 

The "" prints a space between $ 3 and $ 0 . This command repeats column three twice, but you can work around this by setting the variable $ 3 to null:

  awk & # 39; {printf $ 3; $ 3 = ""; print "" $ 0} & # 39; 

The command printf does not print a newline. Likewise, you can exclude specific columns from the output by setting them all to empty strings before printing $ 0 :

  awk & # 39; {$ 1 = $ 2 = $ 3 = "" ; print $ 0} & # 39; 

You can do a lot more with awk including regex matching, but the ready-made column extraction works well for this use case.

Sort columns with sort and uniq

The command sort can be used to organize a list of data based on a specific column. The syntax is:

  sort -k 1 

where the flag -k indicates the column number. You enter input into this command and spit out an ordered list. sort uses standard alphabetical order, but supports more options via flags, such as -n for numeric sort, -h for suffix sort (1M> 1K), -M for sorting month abbreviations and -V for sorting file version numbers (file-1.2.3> file-1.2.1).

The uniq The command filters out double lines, leaving only unique lines. It only works for adjacent lines (for performance reasons), so you should always use it after sort to remove duplicates from the whole file. The syntax is simple:

  sort -k 1 | uniq 

To list only the duplicates, use the flag -d .

uniq can also count the number of duplicates with the -c Flag, which makes it very good for frequency tracking. For example, to get a list of the best IP addresses that hit your Apache server, you can run the following command on your access.log :

  cut -d & # 39; & # 39; -f 1 | sort | uniq -c | sort-no | head 

This series of commands will cut out the IP address column, group the duplicates, remove the duplicates when counting each occurrence, and then sort according to the count column in descending numerical order, giving you a list that looks like this :

  21 192.168.1.1
12 10.0.0.1
5 1.1.1.1
2 8.0.0.8 

You can apply the same techniques to your log files in addition to other utilities such as awk and sed to extract useful information. These chained commands are long, but you don't have to type them in every time, because you can always save them in a bash script or alias them via your ~ / .bashrc .

Filtering data with grep and awk

grep is a very simple command; you give it a search term and pass it on, and it spits out every line that contains that search term. For example, to search your Apache access log for 404 errors, you can do the following:

  cat access.log | grep "404" 

which would spit a list of log entries corresponding to the given text. However,

grep cannot narrow the search to a specific column, so this command will fail if you have the text "404" somewhere else in the file. If you only want to search in the HTTP status code column, you must awk :

  cat access.log | awk & # 39; {if ($ 9 == "404") print $ 0;} & # 39; 

With awk you also have the advantage of being able to perform negative searches. For example, you can search for all log entries that did not return with status code 200 (OK):

  cat access.log | awk & # 39; {if ($ 9! = "200") print $ 0;} & # 39; 

and has access to all programmatic functions that awk provides.

GUI Options for Web Logs

 Monitoring the Access Log of a Web Server in Real Time

GoAccess is a CLI tool for monitoring the access log of your web server in real time and sorts by any useful field. It runs entirely in your terminal, so you can use it via SSH, but it also has a much more intuitive web interface.

apachetop is another apache-specific utility that can be used to filter and sort by columns in your access log. It runs directly on your access.log in real time.


Source link