قالب وردپرس درنا توس
Home / Tips and Tricks / How to Parse File Names Correctly in Bash – CloudSavvy IT

How to Parse File Names Correctly in Bash – CloudSavvy IT

Bash Shell

Naming conventions for Bash files are very rich and it is easy to create a script or one-liner that incorrectly parses file names. Learn to parse filenames correctly and make sure your scripts work as intended!

The problem with correctly parsing file names in Bash

If you have been using Bash for a while and have scripted in the rich Bash language, you will likely run into problems parsing file names. Let̵

7;s take a look at a simple example of what can go wrong:

touch 'a
> b'

Set up a file with a CR character in the file name

Here we have created a file with an actual CR (carriage return) entered by pressing enter after the a. Naming conventions for Bash files are very rich, and while it’s cool in some ways we can use special characters like this in a file name, let’s see how this file fares if we take some actions on it:

ls | xargs rm

The problem when dealing with a file name containing CR

That didn’t work. xargs takes input from ls (via the | pipe), and pass it to rm, but something went wrong in the process!

What went wrong is the output of ls is taken literally by xargs, and the ‘enter’ (CR – Carriage Return) within the file name is seen by xargs as an actual termination character, not a CR are passed on rm as it should be.

Let’s illustrate this in a different way:

ls | xargs -I{} echo '{}|'

Demonstrates how xargs sees the CR character as a newline and splits data on it

It is obvious: xargs processes the input as two separate lines, splitting the original file name in half! Even if we solved the space problems by doing some fancy parsing with sed, we would soon run into other problems when we started using other special characters, such as spaces, backslashes, quotes and more!

touch 'a
touch 'a b'
touch 'ab'
touch 'a"b'
touch "a'b"

All kinds of special characters in file names

Even if you are an experienced Bash developer, you may shudder at seeing such file names as it would be very complex for most common Bash tools to parse these files correctly. You would have to do all kinds of string tweaks for this to work. Unless you have the secret recipe.

Before we get into that, there’s one more thing – a must-know – you might run into while parsing ls output. If you are using color coding for directory entries, which is enabled by default on Ubuntu, it is easy to use a different set ls decomposition problems.

These are not really related to the name of files, but rather how the files are presented as output from ls. The ls output contains hexadecimal codes that represent the color to use for your terminal.

To avoid running into these, just use --color=never as an option for ls:
ls --color=never.

In Mint 20 (a great Ubuntu-derived operating system) this problem appears to be resolved, although the problem may still be present in many other or older versions of Ubuntu etc. I didn’t see this problem on Ubuntu until mid-August 2020.

Even if you are not color-coding your directory entries, your script may be running on other systems that are not owned or controlled by you. In that case, you also want to use this option to prevent users of such a machine from performing the described problem.

Let’s go back to our secret recipe and see how to make sure we don’t have trouble with special characters in Bash filenames. The solution provided avoids any use of ls, which one could well avoid in general, so the color coding problems don’t apply either.

There are still times when ls parsing is quick and convenient, but it will always be tricky and probably ‘dirty’ once special characters are introduced – not to mention unsafe (special characters can be used to introduce all kinds of problems).

The secret recipe: NULL termination

Bash tool developers have realized this same problem many years before and provided us with: NULL termination!

What’s NULL termination you ask? Consider how you do in the above examples CR (or literally enter) was the main ending character.

We’ve also seen how special characters such as quotes, spaces, and backslashes can be used in file names, even though they have special functions when it comes to other Bash text parsing and editing tools, such as sed. Now compare this with the -0 option to xargs, from man xargs:

-0, -null Input items are terminated with a zero character instead of a space, and the quotation marks and backslash are not special (each character is taken literally). Disables the end of the file string, which is treated like any other argument. Useful when input items can contain white space, quotation marks or backslashes. The GNU find -print0 option produces input suitable for this mode.

And the -print0 option to find, from man find:

-fprint0 file True; print the full file name on the standard output, followed by a null character (instead of the newline character that -print uses). This allows file names containing newlines or other types of whitespace to be correctly interpreted by programs processing the search output. This option is equivalent to xargs’s -0 option.

The True; here means If the option is specified, the following is true;. Also interesting are the two clear warnings given elsewhere in the same manual:

  • If you are transferring the output from find to another program and there is a small chance that the files you are looking for contain a newline, you should seriously consider using the -print0 option instead of -print. See the UNUSUAL FILE NAMES section for information on how to handle unusual characters in file names.
  • If you are using find in a script or in a situation where the matching files can have arbitrary names, consider using -print0 instead of -print.

These clear warnings remind us that parse filenames in bash can be, and still is, tricky. With the right options to findnamely -print0, and xargsnamely -0all our special characters with filenames can be parsed correctly:

find . -name 'a*' -print0 
find . -name 'a*' -print0 | xargs -0 ls
find . -name 'a*' -print0 | xargs -0 rm

The solution: find -print0 and xargs -0

First we check our directory listing. All our file names with special characters are present. We then do a simple one find ... -print0 to see the output. We note that the strings are NULL terminated (with the NULL or – same character – not visible).

We also note that there is a single CR in the output, which matches the single CR we had entered in the first filename, consisting of a followed by enter followed by b.

Finally, the output does not introduce a newline (which also CR) before using the $ terminal prompt, as the strings were NULL and not CR terminated. We press enter at the $ terminal prompt to make things a bit clearer.

Then we add xargs with the -0 options, making xargs around the NULL correctly completed entry. We see that the input has been passed to and received from ls looks clear and there is no text transformation garbled.

Finally we try our rm command, and this time for all files including the original with the CR we had problems with. The rm works perfectly, and no errors or parsing problems are observed. Very well!

Shut down

We’ve seen how important it is in many cases to properly parse and handle filenames in Bash. While learning to use find correct is a bit more challenging than just using it ls, the benefits it provides may eventually pay off. Increased security and no problems with special characters.

If you liked this article, you may also want to read How to Bulk Rename Files to Numeric File Names in Linux, which is an interesting and somewhat complex find -print0 | xargs -0 statement. To enjoy!

Source link