قالب وردپرس درنا توس
Home / Tips and Tricks / Parsing HTML in Bash – CloudSavvy IT

Parsing HTML in Bash – CloudSavvy IT



Bash Shell

I have a process where I have to copy all images from a web page. I have always performed this process with xmllint, which will process an XML or HTML file and print the entries you specify. But when my server host provider upgraded their systems, that wasn’t included xmllint. So I had to find another way to extract a list of images from an HTML page. It turns out you can do this in Bash.

You may not think Bash can parse data files, but it can with a little bit of smart thinking. Bash, like other UNIX shells before it, can parse lines one by one from a file via the built-in read statement.

By default it is read statement scans a line of data and breaks it into fields. Mostly read splits fields using spaces and tabs, with newlines ending with each line, but you can change this behavior by changing the internal field separator (IFS) value and the end of the line separator (-d).

To parse an HTML file with read , set the IFS to a greater-than symbol (>) and the separator to a less than symbol (<). Every time Bash scans one line, it parses to the next < (the beginning of an HTML tag) then splits that data at each > (the end of an HTML tag). This sample code takes an input line and breaks the data into the TAG and VALUE variables:

local IFS='>'
read -d '<' TAG VALUE

Let's see how this works. Consider this simple HTML file:

My logo

some text

The first time read parses this file, it stops at the first < symbol. Since < is the first character of this sample entry, which means that Bash finds an empty string. The result TAG and VALUE strings are also empty. But that's fine for my use case.

The next time Bash reads the input, it becomes img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/logo.png"↲alt="My logo" />↲ with a newline just before the alt, and stop before the < symbol on the next line. Than read splits the line at the > symbol, that leaves TAG with img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/logo.png"↲alt="My logo" / and VALUE with an empty newline.

The third time read parses the HTML file, it gets p>some text. Bash splits the string at the > resulting in TAG with p and VALUE with some text .

Now that you understand how to read, it is easy to parse a longer HTML file with Bash. Start with a Bash function called xmlgetnext to parse the data with read as you will do this over and over in the script. I mentioned my position xmlgetnext to remind me that this is a replacement for Linux xmllint program, but I might as well have given it a name htmlgetnext .

xmlgetnext () {
local IFS='>'
read -d '<' TAG VALUE
}

Call that now xmlgetnext function to parse the HTML file. This is my complete htmltags script:

#!/bin/sh
# print a list of all html tags

xmlgetnext () {
local IFS='>'
read -d '<' TAG VALUE
}

cat $1 | while xmlgetnext ; do echo $TAG ; done

The last line is the key. It runs through the file with xmlgetnext to parse the HTML, and only print the TAG submissions. And because of how echo works with the standard field separators, all lines such as img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/logo.png"↲alt="My logo" / containing a new line are printed on a single line, such as img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/logo.png" alt="My logo" /.

Screenshot showing HTML parsing in Bash
Parse HTML in Bash

To get just the list of images, I run the output of this script grep to print only the lines that contain a img tag at the beginning of the line.


Source link