قالب وردپرس درنا توس
Home / Tips and Tricks / How do you actually use Regex? – CloudSavvy IT

How do you actually use Regex? – CloudSavvy IT



Regex, short for regular expression, is often used in programming languages ​​to match patterns in strings, find and replace, input validation, and text reformatting. Learning to work with Regex can make working with text much easier.

Regex Syntax, Explained

Regex has a reputation for having terrible syntax, but it is much easier to write than to read. For example, here's a general regex for an RFC 5322 compliant email validator:

  (?: [a-z0-9!#$%&'*+/=?^_`{|}~-] + (?: . [a-z0-9!#$%&'*+/=?^_`{|}~-] +) * | "(?: [x01
- x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f] | \ [19459033)]) * ") @ (?: (?: [a-z0-9] (?: [a-z0-9-] * [a-z0-9])? .) + [a-z0-9] (?: [a-z0-9-] * [a-z0-9])? | [(? :(?:25[0-5] | 2 [19659005] | [01]? [0-9][0-9]?) .) {3} (?: 25 [0-5] | 2 [0-4][0-9] | [01]? [0- 9][0-9]? | [a-z0-9-] * [a-z0-9] :(?: [x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f] | \ [x01-x09x0bx0cx0e-x7f]) +) ])

If someone seems to have slapped his face against the keyboard, you're not alone. But under the hood, all this mess actually programs a finite state machine. This machine runs for each character, puffing and matching based on rules you have set. Numerous online tools will display railway diagrams showing how your Regex machine works. Here is that same Regex in visual form:

still very confusing, but it is much more understandable. It is a moving parts machine with rules that determine how it all fits together. You can see how someone put this together; it's not just a large amount of text.

First: Use a Regex Debugger

Before we begin, unless your Regex is particularly short or you are highly skilled, you will need an online debugger when writing and testing it. It makes understanding the syntax much easier. We recommend Regex101 and RegExr, both of which test and provide a built-in syntax reference.

How does Regex work?

Now let's focus on something much simpler. This is a diagram from Regulex for a very short (and certainly not RFC 5322 compliant) Regex email corresponding to email:

The Regex engine starts on the left and runs along the lines, where the characters are adjusted. Group # 1 matches any character except a line break, and continues to match characters until the next block finds a match. In this case, it stops when it reaches a @ symbol, which means Group # 1 captures the name of the email address and everything matches the domain after that.

The Regex that defines Group # 1 in our email example is:

  (. +) 

The brackets define a capture group, which tells the Regex engine to list the content of this group's agreement. take in a special variable. When you run a Regex on a string, the default return is the full match (in this case, the entire email). But it also returns any capture group, which makes this Regex useful for extracting names from emails.

The dot is the symbol for "Any character except Newline". This matches everything on a line, so if you pass this email Regex an address like:

% $ # ^ &% * #% $ # ^ @ gmail.com 

It matches % $ # ^ &% * #% $ # ^ as the name, even though that's ridiculous.

The plus sign (+) is a control structure which means "matches the previous sign or group one or more times". It ensures that the whole name matches, not just the first character. This is what the loop in the railway diagram creates.

The rest of the Regex is quite easy to decipher:

  (. +) @ (. +  .. +) 

The first group stops when it hits the @ symbol. The next group then begins, which again matches multiple characters until it reaches a point character.

Since characters such as periods, parentheses and slashes are used as part of the syntax in Regrex, if you want to match those characters you have to escape with a backslash. In this example, to match the period, we write . and the parser treats it as one symbol that & # 39; corresponds to a point & # 39; means.

Character Matching

If you have non-check characters in your Regex, the Regex engine assumes that those characters form a corresponding block. For example, the Regex:

  he + llo 

matches the word & # 39; hello & # 39; with any number of e & # 39; s. All other characters must be escaped for it to work properly.

Regex also has character classes, which abbreviate a set of characters. These can vary depending on the Regex implementation, but these are standard:

  • . – matches anything but newline.
  • w – matches any "word" character, including numbers and underscores.
  • d – matches numbers.
  • b – matches whitespace characters (ie space, tab, newline)

These three all have capital letters that reverse their function. For example, D matches anything that isn't a number.

Regex also has character set matches. For example:

  [abc] 

Corresponds to a b or c . This works as one block and the square brackets are just control structures. You can also specify a string of characters:

  [a-c] 

Or ignore the set, which corresponds to any character not in the set:

  [^a-c] 

Quantifiers

Quantifiers are an important part of Regex. They'll match your strings that you don't know the exact exact format, but you have a pretty good idea.

The + operator from the email example is a quantifier, specifically the "one or more" quantifier. If we don't know how long a certain string is, but we know that it consists of alphanumeric characters (and is not empty), we can write:

   w + 

Besides + there is also:

  • The * operator, which corresponds to "zero or more". Essentially the same as + except that it has the option of not finding a match.
  • The ? operator, which corresponds to "zero or one". It makes a character optional; it is there or it is not there and it does not match more than once.
  • Numerical Quantors. This can be a single number such as {3} which means "exactly 3 times" or a range such as {3-6} . You can omit the second number to make it unlimited. For example, {3,} means & # 39; 3 or more times & # 39 ;. Oddly enough, you can't omit the first number, so if you & # 39; 3 or less times & # 39; you must

Greedy and Lazy Quantifiers

Under the hood, the * And + operators are greedy . It matches as much as possible and returns what it takes to start the next block. This can be a huge problem.

Here's an example: say you're trying to match HTML, or something else with curly braces. Your input text is:

  
Hello World

And you want to match everything in parentheses. You could write something like:

  <.*> 

This is the right idea, but it fails for one crucial reason: The Regex engine matches " div> Hello World

"for the series . * and then backtrack until the next block matches, in this case a closing hook (> ). You would expect it to come back only to match " div " and then repeat again to match the closing div. But the back tracker runs from the end of the string and stops on the end hook, which ultimately matches everything inside the brackets.

The solution is to make our quantifier lazy, which means it will match as few characters as possible. Under the hood, this really only matches one character and then expands to fill the space until the next block match, making it that much more performant in large Regex edits. Diaper making of a quantifier is done by adding a question mark directly after the quantifier. This is a bit confusing because ? is already a quantifier (and is actually greedy by default). For our HTML example, the Regex is fixed with this simple addition:

  <.*?> 

The lazy operator can be applied to any quantifier, including +? {0.3}? and even ?? . Although the latter has no effect; because you combine zero or one characters anyway, there is no room to expand.

Grouping and looking back

Groups in Regex serve many purposes. At a basic level, they merge multiple tokens into one block. For example, you can create a group and then use a quantifier for the whole group:

  ba (na) + 

This groups the repeated "na" to match the sentences banana and banananana and so on. Without the group, the Regex engine would match the final character over and over again.

This type of group with two simple parentheses is called a capture group and will include it in the output:

[19659016] If you want to avoid this, and to group tokens simply for execution reasons, you can use a non-capturing group:

  ba (?: Na) 

The question mark (a reserved character) defines a non-standard group and the following character defines what kind of group it is. Starting groups with a question mark is ideal because otherwise if you want to match semicolons in a group, you have to escape them for no good reason. But you always must escape question marks in Regex.

You can also name your groups for convenience when working with the output:

  (? & # 39; Group & # 39;) 

You can refer to them in your Regex, making them similar with variables. You can refer to unnamed groups with the token 1 but this only goes up to 7, after which you have to start naming groups. The syntax for referring to named groups is:

   k {group} 

This refers to the results of the named group, which can be dynamic. It essentially checks if the group occurs multiple times, but doesn't care about the function. For example, this can be used to match all text between three identical words:

In the group class you will find most Regex control structure, including lookaheads. Lookaheads force an expression to match, but do not include it in the result. In a way, it looks like an if statement and will not match if it returns false.

The syntax for a positive lookahead is (? =) . Here's an example:

This is very similar to the name portion of an email address by stopping execution at the split @ . Lookaheads don't consume any characters, so if you want to keep running after a lookahead succeeds, you can still match the character used in the lookahead.

In addition to positive lookaheads, there are also:

  • (?!) – Negative lookaheads, which cause an expression to not match .
  • (? <=) – Positive look behinds, which are not supported everywhere due to some technical limitations. These are placed in front of the expression you want to match and must be fixed width (ie no quantifiers except {number} . In this example you could use (? <= @) W + . w + to match the domain portion of the email.
  • (? <!) – Negative look behinds, which are the same as positive look behinds, but ignore.

Differences Between Regex Engines

Not All Regex Are Created Equal Most Regex Engines Do Not Follow A Specific Standard And Some Switch Things A Little To Their Language Some Features That Work In One Language May Not Work In Another

For the versions of sed compiled for macOS and FreeBSD do not support the use of t to display a tab character. copy tab character and paste into terminal to use tab in d e command line sed .

Most of this tutorial is compatible with PCRE, the standard Regex engine used for PHP. But JavaScript's Regex engine is different: it doesn't support named capture groups with quotes (it wants parentheses) and can't perform recursion, among other things. Even PCRE is not fully compatible with different versions and it has many differences with Perl regex.

There are too many small differences to list here, so you can use this reference table to compare the differences between multiple Regex engines. In addition, Regex debuggers like Regex101 allow you to switch Regex engine, so make sure to debug with the correct engine. of what makes a Regex. But if you really want to use your Regex, you need to make it a full regular expression.

This usually has the format:

  / match / g 

Everything within the forward slash is our match. The g is a mode modifier. In this case, it tells the engine not to stop spinning after finding the first match. To find and replace Regex, you often need to format it as:

  / find / Replace / g 

This will replace the whole file. You can use capture group references when replacing, which makes Regex very good at formatting text. For example, this Regex matches all HTML tags and replaces the standard brackets with square brackets:

  / <(.+?)> / [1] / g 

When running, the engine matches

and ]

so you can replace this text (and only this text). As you can see, the inner HTML remains unchanged:

This makes Regex very useful for finding and replacing text. The command line program to do this is sed which uses the base format of:

  sed & # 39; / find / Replace / g & # 39; file> file 

This runs on a file and is output to STDOUT. You will have to pipe it to itself (as shown here) to actually replace the file on disk.

Regex is also supported in many text editors and can really speed up your workflow when performing batch operations. Vim, Atom and VS Code all have built-in Regex search and replace.

Of course, Regex can also be used programmatically and is usually built-in in many languages. The exact implementation depends on the language, so you should consult the documentation of your language.

In JavaScript, for example, regex can be made literal or dynamic using the global RegExp object:

  var re = new RegExp (& # 39; abc & # 39;) 

This can be used directly by the .exec () method of calling the newly created regex object, or by using .replace () .match () and .matchAll () methods on strings.


Source link