[CLUE-Tech] Re: PHP - str_replace, preg_replace, wtf?

David L. Willson DLWillson at TheGeek.NU
Fri Feb 27 17:02:30 MST 2004


I fixed it.  Apparently, my newlines and backslash n's weren't making it
all the way through to the replacement engine.  I needed to double my
backslashes.  As a bonus, when I replaced str_replace with preg_replace,
I was able to reduce the number of elements in the replacement arrays. 
The program looks like this now: 
-----------------------------------------------------------------
#!/usr/bin/php -q

<?php

/**
* @return string
* @param string $strHTML
* @desc Converts ~all~ HTML amp-codes to actual characters.
*/
function html_decode($strHTML) {
   $strHTML = html_entity_decode($strHTML);
   $strPattern = "/&#(\d{1,4});/e";
   $strReplace = "chr($1)";
   $strHTML = preg_replace($strPattern,$strReplace,$strHTML);
   return $strHTML;
}

/**
* @return void
* @param int $urgency (0 for chatter, 3 for critical errors)
* @param any $message
* @desc Show something, if it is more urgent than the threshold
*/
function chat ($urgency,$message) {
   global $chatlevel;
   if ($chatlevel >= $urgency){
      /* 
      What I ~should~ do is check if $message is a string 
      with a \n for the last char, and echo a \n, if not.
      */
      print_r($message);
   }
}
$chatlevel = 3; // 0 = no chatting, 3 = chatty

$e_info = 3; $e_warn = 2; $e_err = 1;

$files = glob("/home/dlwillson/gatstuph/example.*");

$old[0] = "/&#39;/"       ; $new[0] = "''"   ;
$old[] = "/&#160;/"       ; $new[] = chr(32) ;
$old[] = "/".chr(160)."/" ; $new[] = chr(32) ;
$old[] = "/&nbsp;/"       ; $new[] = chr(32) ;
$old[] = "/ \ +/"         ; $new[] = chr(32) ;
$old[] = "/^ +/"          ; $new[] = ""      ;
$old[] = "/ '/"           ; $new[] = "'"      ;
$old[] = "/\\\\r/"        ; $new[] = ""      ;
$old[] = "/\\r/"          ; $new[] = ""      ;
$old[] = "/\\\\n'/"       ; $new[] = "'"     ;
$old[] = "/'\\\\n/"       ; $new[] = "'"     ;
$old[] = "/\s*\\\\n\s*/"  ; $new[] = "\\\\n" ;
$old[] = "/\\\\n\\\\n/"   ; $new[] = "\\\\n" ;
$old[] = "/\s*\\n\s*/"    ; $new[] = "\\n"   ;
$old[] = "/\\n\\n+/"      ; $new[] = "\\n"   ;

print_r($old);
print_r($new);

foreach ( $files as $file_num => $infile) {
   chat($e_warn, "Begin processing on file $file_num: $infile\n");
   $outfile =
fopen(str_replace("gatstuph","gatstuph/testing",$infile),'w');
   $arrFile = file($infile);
   //print_r($arrFile);
   foreach ( $arrFile as $line_num => $line_text ) {
      chat($e_info, "$line_num:$line_text");
      // Single-quotes are most trouble-some.  Getting rid of them
first.
      $line_text = rtrim($line_text);
      $newString = preg_replace($old,$new,$line_text);
      while ($line_text <> $newString) {
         $line_text = $newString;
         $newString = preg_replace($old,$new,$line_text);
      }
      chat($e_info, "$line_num:$line_text\n");
      $newString = html_decode($line_text);
      while ($line_text <> $newString) {
         $line_text = $newString;
         $newString = html_decode($line_text);
      }
      $line_text = strip_tags($line_text);
      chat($e_info, "$line_num:$line_text\n");
      $newString = preg_replace($old,$new,$line_text);
      while ($line_text <> $newString) {
         $line_text = $newString;
         $newString = preg_replace($old,$new,$line_text);
      }
      $line_text .= "\n";
      chat($e_info, "$line_num:$line_text\n");
      fwrite($outfile, $line_text);
   }
}

?>
On Thu, 2004-02-26 at 15:38, David L. Willson wrote:
> I'm trying to do some stripping and cleaning of data.
> These are my goals:
>  - Remove all the encoded HTML by decoding it, and then stripping it
> out.
>  - Remove all extraneous whitespace ('\r', doubled spaces,
> doubled-newlines, mixed adjacent whitespace, etc...)
> 
> I'll show you a code snippet, then an output snippet.  The bugger is
> exhibited from 'Line 1432' of the output onward.  Why won't the spaces
> near the '\n's go away?  I'd like to know what I'm doing wrong, or
> failing that, a method that works!
> -------------------------------------------------------------
> #!/usr/bin/php -q
> 
> <?php
> 
> /**
> * @return string
> * @param string $strHTML
> * @desc Converts ~all~ HTML amp-codes to actual characters.
> */
> function html_decode($strHTML) {
>    $strHTML = html_entity_decode($strHTML);
>    $strPattern = "/&#(\d{1,4});/e";
>    $strReplace = "chr($1)";
>    $strHTML = preg_replace($strPattern,$strReplace,$strHTML);
>    return $strHTML;
> }
> 
> $chatlevel = 3; // 0 = no chatting, 3 = chatty
> 
> $e_info = 3; $e_warn = 2; $e_err = 1;
> 
> $files = glob("/home/dlwillson/gatstuph/data-sql/airland*.sql");
> 
> $old[0] = "&#39;" ; $new[0] = "''"   ;
> $old[] = "&#160;" ; $new[] = chr(32) ;
> $old[] = chr(160) ; $new[] = chr(32) ;
> $old[] = "&nbsp;" ; $new[] = chr(32) ;
> $old[] = "  "     ; $new[] = chr(32) ;
> $old[] = "\\r"    ; $new[] = ""      ;
> $old[] = "\r"     ; $new[] = ""      ;
> $old[] = "\\n'"   ; $new[] = "'"     ;
> $old[] = "'\\n"   ; $new[] = "'"     ;
> $old[] = " \\n "  ; $new[] = "\\n"   ;
> $old[] = "\\n "   ; $new[] = "\\n"   ;
> $old[] = " \\n"   ; $new[] = "\\n"   ;
> 
> foreach ( $files as $file_num => $infile) {
> //   chat($e_warn, "Begin processing on file: $infile\n");
>    $outfile = fopen(str_replace("data-sql","testing",$infile),'w');
>    $arrFile = file($infile);
>    foreach ( $arrFile as $line_num => $line_text ) {
>       chat($e_info, "Line $line_num: $line_text");
>       // Single-quotes are most trouble-some.  Getting rid of them
> first.
>       $line_text = rtrim($line_text);
>       while ($line_text <> str_replace($old,$new,$line_text)) {
>          $line_text = str_replace($old,$new,$line_text);
>       }
>       $line_text = html_decode($line_text);
>       $line_text = strip_tags($line_text);
>       while ($line_text <> str_replace("\\n\\n", "\\n", $line_text)) {
>          $line_text = str_replace("\\n\\n", "\\n", $line_text);
>       }
>       $line_text .= "\n";
>       chat($e_info, "Line $line_num: $line_text");
>       fwrite($outfile, $line_text);
>    }
> }
> ?>
> --------------------------------Output------------------------------------
> Line 1408: INSERT INTO MSTR_ProductRefs (item_id, "ProductRefID",
> "ProductRefName")
> Line 1408: INSERT INTO MSTR_ProductRefs (item_id, "ProductRefID",
> "ProductRefName")
> Line 1409:     VALUES ('wiannircocra', 'index', 'Home');
> Line 1409:  VALUES ('wiannircocra', 'index', 'Home');
> Line 1410: INSERT INTO MSTR_ProductRefs (item_id, "ProductRefID",
> "ProductRefName")
> Line 1410: INSERT INTO MSTR_ProductRefs (item_id, "ProductRefID",
> "ProductRefName")
> Line 1411:     VALUES ('wiannircocra', 'racobo1', 'Radio Control
> Boats');
> Line 1411:  VALUES ('wiannircocra', 'racobo1', 'Radio Control Boats');
> Line 1412: INSERT INTO MSTR_Products (storename, tablename, table_id,
> item_id,
> Line 1412: INSERT INTO MSTR_Products (storename, tablename, table_id,
> item_id,
> Line 1413:     "product-url",
> Line 1413:  "product-url",
> Line 1414:     "name",
> Line 1414:  "name",
> Line 1415:     "image",
> Line 1415:  "image",
> Line 1416:     "code",
> Line 1416:  "code",
> Line 1417:     "price",
> Line 1417:  "price",
> Line 1418:     "sale-price",
> Line 1418:  "sale-price",
> Line 1419:     "orderable",
> Line 1419:  "orderable",
> Line 1420:     "caption",
> Line 1420:  "caption",
> Line 1421:     "features",
> Line 1421:  "features",
> Line 1422:     "specification",
> Line 1422:  "specification",
> Line 1423:     "taxable")
> Line 1423:  "taxable")
> Line 1424:   VALUES ('RCGATELYS', 'item.', 'solidtype', 'wiannircocra',
> Line 1424:  VALUES ('RCGATELYS', 'item.', 'solidtype', 'wiannircocra',
> Line 1425:     'http://store.yahoo.com/rcgatelys/wiannircocra.html',
> Line 1425:  'http://store.yahoo.com/rcgatelys/wiannircocra.html',
> Line 1426:     'Wicked Angel Nitro R/C Ocean Racer',
> Line 1426:  'Wicked Angel Nitro R/C Ocean Racer',
> Line 1427:     'http://edit.store.yahoo.com/I/rcgatelys_1772_14861613',
> Line 1427:  'http://edit.store.yahoo.com/I/rcgatelys_1772_14861613',
> Line 1428:     'MTC6801, MTC7500',
> Line 1428:  'MTC6801, MTC7500',
> Line 1429:     '$425.00',
> Line 1429:  '$425.00',
> Line 1430:     '$354.99',
> Line 1430:  '$354.99',
> Line 1431:     'T',
> Line 1431:  'T',
> Line 1432:     'The ready&#45;to&#45;race Wicked Angel is a
> pro&#45;competition nitro ocean racer. This 30&#45;inch long, 35MPH boat
> featres Megatech&#39;s 2.7cc Nitro Mariner engine with unique AquaRam
> water cooling feature. From stem to stern, every design feature found on
> this model was put there for one reason&#45;to make it go
> faste!&#60;br&#62;\r\n&#60;p&#62;Wicked Angel&#39;s modified V hull
> design featuring multiple planning strakes for breaking up wate surface
> tension, the biggest drag&#45;causing factor on any race boat, and the
> Wicked Angel&#39;s planing strakes are placed at serveral points on the
> hull&#39;s bottom to defeat this speed&#45;robbing phenomenon. As a
> result&#45;this boat literallly flies over the water&#39;s
> surface.&#60;br&#62;\r\n	&#60;br&#62;\r\n&#60;/p&#62;\r\n&#60;p&#62;Other full race features include heavy&#45;duty drive shaft with bronze phosphorous bushings, triple shoe centrifugal clutch, quick&#45;fill racing tank, shaft oiler reservoir and sealed radio box. The Wicked Angel comes fully assembled with engine and 2 channel radio installed and guaranteed to ecite any onlookers!&#60;br&#62;\r\n&#60;/p&#62;\r\n&#60;p&#62;&#60;/p&#62;',
> Line 1432:  'The ready-to-race Wicked Angel is a pro-competition nitro
> ocean racer. This 30-inch long, 35MPH boat featres Megatech''s 2.7cc
> Nitro Mariner engine with unique AquaRam water cooling feature. From
> stem to stern, every design feature found on this model was put there
> for one reason-to make it go faste!\nWicked Angel''s modified V hull
> design featuring multiple planning strakes for breaking up wate surface
> tension, the biggest drag-causing factor on any race boat, and the
> Wicked Angel''s planing strakes are placed at serveral points on the
> hull''s bottom to defeat this speed-robbing phenomenon. As a result-this
> boat literallly flies over the water''s surface.\n	\nOther full race
> features include heavy-duty drive shaft with bronze phosphorous
> bushings, triple shoe centrifugal clutch, quick-fill racing tank, shaft
> oiler reservoir and sealed radio box. The Wicked Angel comes fully
> assembled with engine and 2 channel radio installed and guaranteed to
> ecite any onlookers!\n',
> Line 1433:     '&#60;li type=&#34;disc&#34;&#62;M16 1+ HP Nitro Water
> Cooled Motor &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;Tuned
> Exhaust System &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;Custom
> tuned High Speed Prop &#60;br&#62;\r\n	&#60;li
> type=&#34;disc&#34;&#62;Water proof Sealed Radio Box
> &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;35+ mph Out Of The Box
> &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;Adjustable trim tabs
> and skid fins &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;15&#45;20
> Minute run time &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;2
> Channel FM Radio &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;Gas
> Completer Combo &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;Fuel
> &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;Glow ignitor
> &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;Glow Plug,
> &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;Glow Plug Wrench
> &#60;br&#62;\r\n	&#60;li type=&#34;disc&#34;&#62;12 AA Batteries
> \r\n        &#60;li type=&#34;disc&#34;&#62;Gas Completer Combo
> (MTC7500) Fuel &#60;br&#62;',
> Line 1433:  'M16 1+ HP Nitro Water Cooled Motor \n	Tuned Exhaust System
> \n	Custom tuned High Speed Prop \n	Water proof Sealed Radio Box \n	35+
> mph Out Of The Box \n	Adjustable trim tabs and skid fins \n	15-20 Minute
> run time \n	2 Channel FM Radio \n	Gas Completer Combo \n	Fuel \n	Glow
> ignitor \n	Glow Plug, \n	Glow Plug Wrench \n	12 AA Batteries\nGas
> Completer Combo (MTC7500) Fuel ',
> Line 1434:     'Hull Length :  30&#38;quot; &#60;br&#62;Weight:   48 oz
> &#60;br&#62;',
> Line 1434:  'Hull Length : 30&quot; Weight: 48 oz ',
> Line 1435:     'T');
> Line 1435:  'T');




More information about the clue-tech mailing list