This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature: Shell & CLI

Patterns and string processing in shell scripts

By Peter Seebach on December 26, 2008 (2:00:00 PM)

Share    Print    Comments   

Shell programming is heavily dependent on string processing. The term string is used generically to refer to any sequence of characters; typical examples of strings might be a line of input or a single argument to a command. Users enter responses to prompts, file names are generated, and commands produce output. Recurring throughout this is the need to determine whether a given string conforms to a given pattern; this process is called pattern matching. The shell has a fair amount of built-in pattern matching functionality.

This article is excerpted from the newly published book Beginning Portable Shell Scripting.

Furthermore, many common Unix utilities, such as grep and sed, provide features for pattern matching. These programs usually use a more powerful kind of pattern matching, called regular expressions. Regular expressions, while different from shell patterns, are crucial to most effective shell scripting. While there is no portable regular expression support built into the shell itself, shell programs rely heavily on external utilities, many of which use regular expressions.

Shell patterns

Shell patterns are used in a number of contexts. The most common usage is in the case statement. Given two shell variables string and pattern, the following code determines whether text matches pattern:

case $string in $pattern) echo "Match" ;; *) echo "No match";; esac

If $string matches $pattern, the shell echoes "Match" and leaves the case statement. Otherwise, it checks to see whether $string matches *. Since * matches anything in a shell pattern, the shell prints "No match" when there was not a match against $pattern. (The case statement executes only one branch, even if more than one pattern matches.)

For exploring pattern matching, you might find it useful to create a shell script based on this. The following self-contained script performs matching tests of a number of words against a pattern:

#!/bin/sh pattern="$1" shift echo "Matching against '$pattern':" for string do case $string in $pattern) echo "$string: Match." ;; *) echo "$string: No match." ;; esac done

Save this script to a file named pattern, make it executable (chmod a+x pattern), and you can use it to perform your own tests:

$ ./pattern '*' 'hello' Matching against '*': hello: Match. $ ./pattern 'hello*' 'hello' 'hello, there' 'well, hello' Matching against 'hello*': hello: Match. hello, there: Match. well, hello: No match.

Remember to use single quotes around the arguments. An unquoted word containing pattern characters such as the asterisk (*) is subject to globbing (sometimes called file name expansion), where the shell replaces such words with any files with names matching the pattern. This can produce misleading results for tests like this.

Pattern-matching basics

In a pattern, most characters match themselves, and only themselves. The word hello is a perfectly valid pattern; it matches the word hello, and nothing else. A pattern that matches only part of a string is not considered to have matched that string. The word hello does not match the text hello, world. For a pattern to match a string, two things must be true:

    Every character in the pattern must match the string.
    Every character in the string must match the pattern.

Now, if this were all there were to patterns, a pattern would be another way of describing string comparison, and the rest of this chapter would consist of filler text like "a ... consists of sequences of nonblank characters separated by blanks," or possibly some wonderful cookie recipes. Sadly, this is not so. Instead, there are some characters in a pattern that have special meaning and can match something other than themselves. Characters that have special meaning in a pattern are called wildcards or metacharacters. Some users prefer to restrict the term wildcard to refer only to the special characters that can match anything. In talking about patterns, I prefer to call them all wildcards to avoid confusion with characters that have special meaning to the shell. Wildcards make those two simple rules much more complicated; a single character in a pattern could match a very long string, or a group of characters in the pattern might match only one character or even none at all. What matters is that there are no mismatches and nothing left over of the string after the match.

The most common wildcards are the question mark (?), which matches any character, and the asterisk (*), which matches anything at all, even an empty string.

The ? is easy to use in patterns; you use it when you know there will be exactly one character, but you are not sure exactly what it will be. For instance, if you are not sure what accent the user will greet you in, you might use the pattern h?llo, in case your user prefers to write hallo or hullo. This leaves you with two problems. The first is that users are typically verbose, and write things like hello, there, or hello little computer, or possibly even hello how do i send email. If you just want to verify that you are getting something that sounds a bit like a greeting, you need a way to say "this, or this plus any other stuff on the end."

That is what * is for. Because * matches anything, the pattern hello* matches anything starting with hello, or even just hello with nothing after it. However, that pattern doesn't match the string well, hello because there is nothing in the pattern that can match characters before the word hello. A common idiom when you want to match a word if it is present at all is to use asterisks on both sides of a pattern: *hello* matches a broad range of greetings.

If you want to match something, but you are not sure what it is or how long it will be, you can combine these. The pattern hello ?* matches hello world but does not match hello alone. However, this pattern introduces a new problem. The space character is not special in a pattern, but it is special in the shell. This leads to a bit of a dilemma. If you do not quote the pattern, the shell splits it into multiple words, and it does not match what you expected. If you do quote it, the shell ignores the wildcards. There are two solutions available; the first is to quote spaces, the second is to unquote wildcards. So, you could write hello" "?*, or you could write "hello "?*.

In the contexts where the shell performs pattern matching (such as case statements), you do not need to worry about spaces resulting from variable substitution; the shell doesn't perform splitting on variable substitutions in those contexts. (A disclaimer is in order: zsh's behavior differs here, unless it is running in sh emulation mode.)

Character classes

The h?llo pattern has another flaw, which is that it is too permissive. While your friends who type with a thick accent will doubtless appreciate your consideration, you might reasonably draw the line at hzllo, h!llo, or hXllo. The shell provides a mechanism for more restrictive matches, called a character class. A character class matches any one of a set of characters, but nothing else; it is like ?, only more restrictive. A character class is surrounded in square brackets ([]), and looks like [characters]. The greeting described previously could be written using a character class as h[aeu]llo. A character class matches exactly one of the characters in it; it never matches more than one character.

Character classes may specify ranges of characters. A typical usage would be to match any digit, with [0-9]. In a range, two characters separated by a hyphen are treated as every character between them in the character set; mostly, this is used for letters and numbers. Patterns are case sensitive; if you want to match all standard ASCII letters, use [a-zA- Z]. The behavior of a range where the second character comes before the first in the character set is not predictable; do not do that.

Sometimes, rather than knowing what you do want, you know what you don't want; you can invert a character class by using an exclamation mark (!) as its first character. The character class [!0-9] matches any character that is not a digit. When a character class is inverted, it matches any character not in the range, not just any reasonable or common character; if you write [!aeiou] hoping to get consonants, you will also match punctuation or control characters.

Wildcards do not have special meaning in a character class; [?*] matches a question mark or an asterisk, but not anything else.

Character classes are one of the most complicated aspects of shell pattern matching. Left and right square brackets ([]), hyphens (-), and exclamation marks (!) are all special to them. A hyphen can easily be included in a class by specifying it as the last character of the class, with no following character. An exclamation mark can be included by specifying it as any character but the first. (What if there are no other characters? Then you are specifying only one character and probably don't need a character class.) The left bracket is actually easy; include it anywhere, it won't matter. The right bracket (]) is special; if you want a right bracket, put it either at the very beginning of the list or immediately after the ! for a negated class. Otherwise, the shell might think that the right bracket was intended to close the character class. Even apart from the intended feature set, be aware that some shells have plain and simple bugs having to do with right brackets in character classes; avoid them if you can.

If you want to match any left or right bracket, exclamation mark, or hyphen, but no other characters, here is a way to do it:

[][!-]

The first left bracket begins the definition of the class. The first right bracket does not close the class because there is nothing in it yet; it is taken as a plain literal right bracket. The second left bracket and the exclamation mark have no special meaning; neither is in a position where it would have any. Finally, the hyphen is not between two other characters in the class because the right square bracket ends the definition of the character class, so the hyphen must be a plain character.

Many users have the habit of using a caret (^) instead of ! in shell character classes. This is not portable, but it is a common extension some shells offer because habitual users of regular expressions may be more used to it. This can create an occasional surprise if you have never seen it used, and want to match a caret in a class.

Table 2-1 explains the behavior of a number of characters that may have special meaning within a character class, as well as how to include them literally in a class when you want to. Table 2-1. Special Characters in Character Classes

CharacterMeaningPortabilityHow to Include It
]End of classUniversalPut at the beginning of the class (or first after the negation character)
[Beginning of classUniversalPut it anywhere in the class
^InversionCommonPut after some other character
!InversionUniversalPut after some other character
-RangeUniversalPut at the beginning or end of the class

Ranges have an additional portability problem that is often overlooked, especially by English speakers. There is no guarantee that the range [a-z] matches every lowercase letter, and strictly speaking there is not even a guarantee that it matches only lowercase letters. The problem is that most people assume the ASCII character set, which defines only unaccented characters. In ASCII, the uppercase letters are contiguous, and the lowercase letters are also contiguous (but there are characters between them; [A-z] matches a few punctuation characters). However, there are Unix-like systems on which either or both of these assumptions may be wrong. In practice, it is very nearly portable to assume that [a-z] matches 26 lowercase letters. However, accented variants of lowercase letters do not match this pattern. There is no generally portable way to match additional characters, or even to find out what they are. Scripts may be run in different environments with different character sets.

Some shells also support additional character class notations; these were introduced by POSIX but so far are rare outside of ksh (not pdksh) and bash. The notation is [[:class:]], where class is a word like digit, alpha, or punct. This matches any character for which the corresponding C isclass() function would return true. For example, [[:digit:]] is equivalent to [0-9]. These classes may be combined with other characters; [[:digit:][:alpha:]_] matches any letter or number or an underscore (_). Additional similar rules use [.name.] to match a special collating symbol. (For instance, some languages might have a special rule for matching and sorting certain combinations of letters, so a ch might sort differently from a c followed by an h) and [=name=] to match equivalence classes, such as a lowercase letter and any accented variant of it.) These rules are particularly useful for internationalized scripts but not sufficiently widely available to be used in portable scripts yet. To avoid any possible misunderstandings, avoid using a left bracket followed immediately by a period (.), equals sign (=), or colon (:) in a character class. Note that this applies only to a left bracket within the character class, not the initial bracket that opens the class; [.] matches a period. (This is more significant in regular expressions, where a period would otherwise have special meaning.)

Character classes are, as you can see, substantially more complicated than the rest of the shell pattern matching rules.

Shell patterns are quite powerful, but they have a number of limitations. There is no way to specify repetition of a character class; no shell pattern matches an arbitrary number of digits. You can't make part of a pattern optional; the closest you get to optional components is the asterisk.

Patterns as a whole generally match as much as they can; this is called being greedy. However, if matching too many things with an asterisk prevents a match, the asterisk gives up the extra characters and lets other pattern components match them. If you match the pattern b* to the string banana, the * matches the text anana. However, if you use the pattern b*na, the * matches only the text ana. The rule is that the * grabs the largest number of characters it can without preventing a match. Other pattern components, such as character classes, literal characters, or question marks, get first priority on consuming characters, and the asterisk gets what's left.

Some of the limitations of shell patterns can be overcome by creative usage. One way to store lists of items in the shell is to have multiple items joined with a delimiter; for instance, you might store the value a,b,c to represent a list of three items. The following example code illustrates how such a list might be used. (The case statement, used here, executes code when a pattern matches a given string.)

list=orange,apple,banana case $list in *apple*) echo "How do you like them apples?";; esac How do you like them apples?

This script has a subtle bug, however. It does not check for exact matches. If you try to check against a slightly different list, the problem becomes obvious:

list=orange,crabapple,banana case $list in *apple*) echo "How do you like them apples?";; esac How do you like them apples?

The problem is that the asterisks can match anything, even the commas used as delimiters. However, if you add the delimiters to the pattern, you can no longer match the ends of the list:

list=orange,apple,banana case $list in *,orange,*) echo "The only fruit for which there is no Cockney slang.";; esac [no output]

To resolve this, wrap the list in an extra set of delimiters when expanding it:

list=orange,apple,banana case ,$list, in *,orange,*) echo "The only fruit for which there is no Cockney slang.";; esac The only fruit for which there is no Cockney slang.

The expansion of $list now has a comma appended to each end, ensuring that every member of the list has a comma on both sides of it.

Share    Print    Comments   

Comments

on Patterns and string processing in shell scripts

Note: Comments are owned by the poster. We are not responsible for their content.

Why do people still write shell scripts?

Posted by: Anonymous [ip: 82.192.250.149] on December 26, 2008 08:29 PM
Why do people still write shell scripts, when a Perl script is easier to write and easier to read?

Maybe for installation scripts there is still a case for shell scripts - but nowhere else.

#

Re: Why do people still write shell scripts?

Posted by: Anonymous [ip: 81.109.203.77] on December 26, 2008 09:31 PM
Why do people still write perl scripts, when a shell scripting is easier to write and easier to read?

#

Re(1): Why do people still write shell scripts?

Posted by: Anonymous [ip: 71.32.249.43] on December 27, 2008 08:31 AM
Because Perl, Python and Ruby are better at string processing than shell scripts.

They have built-in regular expressions, built-in ranges (similar to seq 0 10), can iterate over matched patterns/lines/characters and allow you to do more complex programming (such as OOP or semi-functional) if you need to.

#

Re(2): Why do people still write shell scripts?

Posted by: Johannes Truschnigg on December 28, 2008 12:36 PM
GNU bash features regex and ranges, too.

Personally, I'm just much more familiar with my shell than with any of the available scripting languages, because I interact with it on a daily basis, to do so many different things.
I'd probably spend more time with perl, python or ruby if they could also act as a comfortable, interative shell for me.

#

Re(1): Why do people still write shell scripts?

Posted by: Anonymous [ip: 83.99.169.132] on December 27, 2008 07:17 PM
because:
eval{
open(blhablahblah)
read(blhablahblah)
...etc
} or die(now wee catch error $!)

is imposible in shell scripts.

#

Re(2): Why do people still write shell scripts?

Posted by: Anonymous [ip: 64.81.141.153] on December 29, 2008 08:37 AM
(
set -e
blah
boo
bar
) || echo "error"
# There. I fixed that for you.

#

Re: Why do people still write perl scripts, when a shell scripting is easier to write and easier to

Posted by: Anonymous [ip: 69.249.21.116] on December 29, 2008 02:40 AM
+1

#

Re(1): Why do people still write shell scripts?

Posted by: Anonymous [ip: 202.83.42.27] on December 29, 2008 12:01 PM
How do you know if perl,python interpreters are there by default in the target system..in most gnu/linux system you can be sure of a bash or a borne shell at the least. Why the hell would u or I write small day to day scripts in python when bash could easily accomplish that in couple of lines. I eagerly read a book called "Python for Unix and Linux System Administration"..and I saw nothing compelling to do things differently. I totally agree that there is a case for python in production environment to keep things clean but perl is a no go..as far as readability is concerned.

#

Re: Why do people still write shell scripts?

Posted by: Anonymous [ip: 74.212.28.172] on December 26, 2008 10:48 PM
Why not write shell scripts. To use a Perlism, TIMTOWTDI. The goal is usually getting a job done with the resources available.

Speaking for myself (I began using UNIX V3 while working for Western Electric in the early 70s), UNIX, the Bourne shell, sed, and awk allowed one to do a hell of a lot of useful work with amazingly little effort long before Perl came along. The shell, sed, awk combo (early 70s) preceeded Perl (late 80s) by about 15 years. Prior to Perl if you needed more speed you redid your shell work in C. When Larry Wall's Perl came along, IMHO, in the *NIX world it added a middle ground between shell and C - both Perl's execution speed and programming effort are greater than shell and less than C.

Having said all that, the reason I use Perl (in addition to shell, sed, awk, and their numerous *NIX provided friends) is CPAN - a truely awesome resource that has no real competition.

#

Re: Why do people still write shell scripts?

Posted by: Anonymous [ip: 79.150.60.129] on December 26, 2008 11:00 PM
Why do people still write perl scripts, when inserting needles under your own fingernails is so much more pleasurable?

#

Why do people still write books about shell scripting?

Posted by: Anonymous [ip: 71.136.225.18] on December 27, 2008 01:37 AM
When this one:

http://tldp.org/LDP/abs/html/

has been available for free -as in beer- for years.

#

Patterns and string processing in shell scripts

Posted by: PerlCoder on December 27, 2008 03:00 AM
For many activities, Perl is easier to write and more portable. However, I would think that if you knew you were going to be passing a lot of data back and forth between *nix programs on a *nix system, you would want shell scripts. With the backticks, quick expansion, easy access to shell variables, and i/o operators, a shell script would probably be better suited for such a situation.

If you're planning to write an advanced text filter, I would think a Perl script would be better.

#

Patterns and string processing in shell scripts

Posted by: Anonymous [ip: 85.29.99.245] on December 27, 2008 08:39 AM
Shell scripts can be quite powerful. Source-based distros Lunar and SourceMage have an advanced package manager that is written in Bash.

#

Patterns and string processing in shell scripts

Posted by: Anonymous [ip: 80.101.93.201] on December 27, 2008 10:22 AM
Use shell scripts (I prefer ash) to automate your command-line usage. If you need to take this a bit further, use Tcl as a gluing language. For advanced pattern matching (and e.g. on a lot of files), use Perl, but this language is less appropriate as a simple 'glue'.

#

Re: Patterns and string processing in shell scripts

Posted by: Anonymous [ip: 62.85.117.242] on December 30, 2008 10:25 AM
Been using Tcl + sh for past 2 years. A good combo. Also I've noticed that Tcl works for me better than sh since I tend to make less bugs in Tcl. I've had lots of code that works just after first draf!. With sh there is usually tedious debugging involved. I guess there is something clean about Tcl syntax compared to sh.

#

Patterns and string processing in shell scripts

Posted by: Anonymous [ip: 83.241.11.135] on December 27, 2008 04:47 PM
Use right tool for right jom

#

Patterns and string processing in shell scripts

Posted by: Anonymous [ip: 83.241.11.135] on December 27, 2008 04:48 PM
I mean:
Use right tool for right job

sorry for error

#

Shell tools for LDAP?

Posted by: Anonymous [ip: 66.31.67.63] on December 30, 2008 11:34 PM
Anyone know of any good tools for working with LDAP from shell scripts? I have been using ldif files and running sed -e ... | ldapadd .. but it is very clunky. I'm looking for something that's cleaner and easier to work with.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya