Text manipulation in PHP

PHP is a widely used language for server-side development, but if you don’t properly understand how to handle strings (text), it can lead to unexpected bugs or garbled characters. This article provides a step-by-step guide suitable even for beginners, covering the basics of handling text in PHP to important points for dealing with strings safely. Each section includes executable code examples, so feel free to follow along and try them out as you read.

Basics of Handling Strings in PHP ①: Two Types of Quotation Marks

In PHP, strings can be defined using either single or double quotation marks. Double quotes allow for variable interpolation, so it’s important to use them appropriately depending on the situation. You can also handle multi-line templates concisely by using heredoc and nowdoc syntax.

<?php
$name = 'Taro';
echo "Hello, {$name}\n";
// Variables are expanded ⇒ "Hello, Taro"

echo 'Hello, $name\n';
// No expansion, output as-is ⇒ "Hello, $name"

?>

Moreover, even though strings in PHP are not objects, there are many built-in functions available. Examples include strlen(), substr(), and str_replace(). Throughout this article, we’ll look at when and how to use these functions effectively.

Basics of Handling Strings in PHP ②: Heredoc and Nowdoc

In PHP, there are many situations where you want to handle multi-line strings — for example, when outputting HTML templates or building long messages. This is where heredoc and nowdoc come in handy.

Both are syntaxes that allow for clear, multi-line string declarations. They are more readable and easier to maintain than surrounding the string with simple quotation marks.

Heredoc

Heredoc syntax begins with <<<identifier and ends with the same identifier placed at the beginning of a new line. Like double quotes, it allows variable interpolation.

<?php
$name = "Taro";
$text = <<<EOD

Hello, {$name}.<br>
Thank you for using our service today.<br>
We look forward to serving you again.

EOD;

echo $text;

//【Display Example】
//Hello, Taro.
//Thank you for using our service today.
//We look forward to serving you again.

?>

Note that heredoc does not allow indentation. The identifier EOD must be written at the beginning of the line with no spaces or tabs.

Nowdoc

Nowdoc syntax uses a single-quoted identifier (e.g., <<<'EOD'). Variable interpolation is not performed; the string is output exactly as written. It’s ideal when you want to handle templates statically.

<?php
$name = "Taro";
$text = <<<'EOD'
Hello, $name.<br>
In this text, variables are not expanded.

EOD;
echo $text; 

//【Display Example】
//Hello, $name.<br>
//In this text, variables are not expanded.

?>

Nowdoc is convenient when embedding static content that doesn’t require variable interpolation, such as HTML code or scripts.

Both heredoc and nowdoc are powerful tools that improve code readability and allow you to manage long strings safely and concisely. They are especially useful in PHP development where template processing is common, so be sure to master them.

Concatenating Strings with the Dot Operator ( . )

In PHP, the dot operator (.) is used when you want to concatenate strings. This is different from the + operator commonly used in other languages, so be aware of this distinction.

Basic Usage

<?php
$firstName = "Taro";
$lastName = "Yamada";

$fullName = $firstName . "---" . $lastName;
echo $fullName;
// Output: Taro---Yamada

?>

As shown above, you can combine multiple strings or variables with . to treat them as a single string.

Concatenation with Assignment

By using .= (dot equals), you can append to an existing string.

<?php
$message = "Hello";
$message .= ", Taro.";
$message .= " Thank you for visiting.";

echo $message;
// Output: Hello, Taro. Thank you for visiting.

?>

This is useful when you want to add multiple pieces of information in sequence.

Common Pitfalls

After using ., it’s recommended to include a space. Omitting spaces can reduce readability and may lead to mistakes.
To prevent unexpected behavior when concatenating numbers, it’s advisable to explicitly convert them to strings using functions like strval() or number_format() if necessary.

<?php
$price = 1500;
echo "The price is " . $price . " yen.";

// Output: The price is 1500 yen.

?>

Practical Example

String concatenation is also frequently used when outputting HTML tags.

<?php
$title = "Welcome";
$content = "<h1>" . $title . "</h1>";
echo $content;

?>

As shown, string concatenation is one of the most fundamental operations in PHP development, especially for building templates or generating email content.

Counting and Extracting Characters: `strlen`, `substr`, and `mb_` Functions

When working with strings in PHP, operations like “counting characters” or “extracting a part of a string” are frequently used. One important thing to note is the handling of multibyte characters such as Japanese.

`strlen()` Returns the Number of Bytes

<?php
$str = 'こんにちは';
echo strlen($str);

// Output: 15

?>

As shown in this example, each Japanese character takes 3 bytes (in UTF-8), so strlen() counts the 5-character string “こんにちは” as 15 bytes. While this isn’t a problem for single-byte characters like alphanumerics, it can cause mismatches with visual character counts for multibyte strings, leading to display or validation issues.

To Accurately Count Characters: Use `mb_strlen()`

<?php
$str = 'こんにちは';
echo mb_strlen($str);

// Output: 5

?>

mb_strlen() allows you to **accurately count the number of visible characters**. It supports encodings like UTF-8 and can be used safely with Japanese text.

To Extract a Substring: Use `mb_substr()`

When extracting part of a string, it’s also safer to use mb_substr() instead of substr().

<?php
$str = 'こんにちは';
echo mb_substr($str, 2, 2);

// Output: にち

?>

This code extracts 2 characters starting from the 3rd character (“にち”). With mb_substr(), the start position is based on characters, not bytes.

Avoid Repeating Encoding Settings with `mb_internal_encoding()`

Although mb_strlen() and mb_substr() accept the encoding as a third argument, it’s cumbersome to specify it every time. You can simplify this by setting the internal encoding in advance.

<?php
mb_internal_encoding('UTF-8');

$str = 'こんにちは';
echo mb_strlen($str);

// Output: 5

?>

With this setting, mb_ functions will default to UTF-8, making your code simpler and cleaner.

Supplement: List of Corresponding `mb_` Functions (Partial)

Standard Function	Multibyte-Compatible Version	Usage
`strlen()`	`mb_strlen()`	Count characters
`substr()`	`mb_substr()`	Extract part of a string
`strtoupper()`	`mb_strtoupper()`	Convert to uppercase (multilingual support)
`strpos()`	`mb_strpos()`	Search for a substring

Summary

For multibyte characters (e.g., Japanese), use mb_ functions by default
strlen() and substr() are suitable for alphanumerics, but may not work correctly with Japanese
Setting the internal encoding in advance leads to safer and cleaner code

Handling strings in a multibyte environment is a critical topic that directly affects web application display and validation. Always choose the right function for the job.

Advanced Text Matching with Regular Expressions: preg_match and preg_replace

Regular expressions are a powerful mechanism that can detect and replace complex patterns. In PHP, the PCRE (Perl Compatible Regular Expressions) engine is used, and functions like preg_match() and preg_replace() are available.

Example: Validating Email Address Format

<?php
// The target string to validate as an email address
$email = 'user@example.com';

// Use a regular expression to check if the email format is valid
// ^: start of string, $: end of string
// [\w\.-]+: one or more of alphanumeric, underscore, dot, or hyphen
// @: at symbol
// \.: dot (period), \w+: one or more alphanumeric characters

if (preg_match('/^[\w\.-]+@[\w\.-]+\.\w+$/', $email)) {
  echo 'Valid email address'; 
}

// Output: Valid email address
?>

Example: Convert Date Format from Slash to Hyphen

<?php

// Convert date format: change slashes to hyphens
$text = '2025/05/20';

// Use a regular expression to capture the "year/month/day" format and convert to hyphen-separated
// (\d{4}): 4-digit number (year)
// (\d{2}): 2-digit number (month)
// (\d{2}): 2-digit number (day)
// $1, $2, $3 correspond to captured values
$fixed = preg_replace('/(\d{4})\/(\d{2})\/(\d{2})/', '$1-$2-$3', $text);
echo $fixed;

// Output: 2025-05-20

?>

Normalizing User Input: Prevent Mismatches by Trimming Spaces and Unifying Case

Even if user-entered data looks correct, it may not be processed as expected due to extra spaces or differences in letter case. For example, while “Hello” and " hello " may appear similar, they are considered different strings in code.

To prevent such mismatches, it’s important to apply **normalization of input values**.

Removing Whitespace: `trim()` Function

The trim() function is used to remove whitespace characters (spaces, tabs, newlines, etc.) from the beginning and end of a string. Since PHP 8.0, it also removes full-width spaces (U+3000), making it compatible with multibyte input.

<?php
$raw = '  Hello World  ';
$clean = trim($raw);
echo $clean;

// Output: "Hello World" (leading and trailing spaces removed)
?>

Unifying Case: `strtolower()` and `strtoupper()`

If you want to ignore case differences when comparing data, it’s common to convert strings to either all lowercase or all uppercase.

<?php
$clean = 'Hello World';
echo strtoupper($clean);
// Output: HELLO WORLD (converted to uppercase)

echo strtolower($clean);
// Output: hello world (converted to lowercase)

?>

For example, when comparing user IDs, email addresses, or category names during login or search, it’s often more natural to ignore case differences, so comparing after normalization is recommended.

Encoding and Garbled Text Prevention: UTF-8 and mbstring Settings

Garbled characters are primarily caused by encoding mismatches. In web applications, it’s essential to standardize on UTF-8. On the PHP side, explicitly set default_charset, and in HTML, use <meta charset="UTF-8">.

<?php
// Example setting in PHP.ini
// default_charset = "UTF-8"

// Explicitly set in the script
ini_set('default_charset', 'UTF-8');
mb_internal_encoding('UTF-8');

?>

Additionally, by aligning file save formats and database collation settings with UTF-8, you can significantly reduce environment-related issues.

Practical Example: Safely Processing Form Input with Text Sanitization

Finally, here is a practical example of how to safely handle form input using the techniques discussed. We use trim() to remove whitespace, mb_substr() to limit length, and htmlspecialchars() to prevent XSS attacks.

<?php
function sanitize_input(string $input, int $max = 255): string {
  $input = trim($input); // Remove leading/trailing whitespace
  $input = mb_substr($input, 0, $max); // Limit max character length
  return htmlspecialchars($input, ENT_QUOTES, 'UTF-8');
}

$name = $_POST['name'] ?? '';
$safe_name = sanitize_input($name);
echo "Hello, {$safe_name}!";

?>

By following this process, you can eliminate unnecessary spaces and excessive characters while also preventing injection of unauthorized HTML tags. In real-world development, also combine CSRF tokens and server-side validation to ensure comprehensive security.

Conclusion

This article covered essential techniques for handling strings in PHP. By mastering everything from basic concatenation and splitting to multibyte support, regular expressions, encoding, and practical sanitization, you’ll be equipped to tackle a wide range of challenges in real-world development. Use the sample code provided as a base and try combining these functions to suit your specific use cases.