Searching inside HTMLy content

Posted on 12 January 2025 by andyrew 7 min

Working on finding a flat-file calendaring agent, I came-across LuxFind

LuxFind is very good at quickly and recursively finding matches, though it dumps full URLs and line-matches (great for debugging purposes, but neither of which I ultimately wanted in-production).

"Search" in HTMLy only queries the file basename (tags & slug), and not the content. Totally-Fantastic! But, this doesn't really allow me to drill-down to a particular context.

Since I had (already) been working on trying to incorporate the idea of searching for content inside HTMLy blog posts, this piqued my interest....

So, I thought: can I adapt/modify the ideas in LuxFind to work with HTMLy in searching the contents of all the blog posts?

I modified the original script to match whole-words only, have the title be the slug name, and pull the year and month out--individually--so the resulting href='site/year/month/the-respective-blog-post'.

[Here is my modified search in all its full-screen glory.]

My re-write is very extremely particular to searching for specific, whole-word content in HTMLy.

There are two important sections.

A) The functions:

$site = "https://andyrew.info/blog/"; // where is the blog?
$rootDir = "./content/andyrew/blog/"; // folder from which the files should be searched

function getFilePaths($dir) { // get all files to be searched
    global $subdirs, $filePaths;
    if ($items = preg_grep("~^[^.].+~",scandir($dir))) { // get folders and files
        foreach($items as $item) {
            if (is_dir($dir."/".$item)) { // is folder?
                    getFilePaths($dir."/".$item);
            } elseif ($item !== basename(__FILE__) and preg_match("~\.md$~i",$item)) { // valid, individual files only
                $filePaths[] = "$dir/$item";
            }
        }
    }
}

function searchFile($file, $findRx) { // searches $file with find regex
    global $counter, $site, $year, $month, $title;

    $findRxWhole = '\b' . preg_quote($findRx, '~') . '\b'; // Add word boundaries (whole-words only)
    $findRx = "~{$findRxWhole}~iu"; // Case-insensitive and UTF-8 mode
    $lines = file($file); // load array with all hit-lines
    $hit = 0; // process lines
    $fileName = pathinfo(substr($file, strpos($file, "/")))['filename']; // full filename
    $date = '/^(\d{4})-(\d{2})/'; // isolate the first XXXX-YY (YEAR-MO) group from the fileName
        if (preg_match($date, $fileName, $matches)) {
            $year = $matches[1]; // XXXX
            $month = $matches[2]; // YY
        }
    $title = preg_replace('(^.*_)', '', $fileName); // remove date-timestamp_tags_ from filename
    foreach ($lines as &$line) {
        $line = preg_replace($findRx,"║$0║",$line,-1,$count); //enclose hit in ║ chars
        if ($count) {
            $hit = 1;
        } else {
            $line = '';
        }
    }
    if ($hit) {
        $counter++;
        showHits($site, $file, $title, $year, $month); // search hits
    }
}

function showHits($site, $file, $title, $year, $month) { // displays the hit(s)

    $fLink = ($title ? "<span class='fTitle'>{$title}</span>\n\n" : '');
    $hRef = $site.$year.'/'.$month.'/'.$title.''; // specific to HTMLy site/year/month/slug format
    echo "<p class='openLink' title='{$hRef}'><a href='{$hRef}' target='_blank'>{$fLink}</a></p>\n\n";
}

B) The Main Program:

$url = $_SERVER['HTTPS'] == 'on' ? 'https://' : 'http://'; // get full URL
$url .= $_SERVER['SERVER_NAME'] . $_SERVER['REQUEST_URI']; // get full URL
$find = trim($_REQUEST['s'] ?? ''); // get form input
$rootDir = rtrim($rootDir,'/'); // globals
$webPath = dirname($url); // globals
$filePaths = []; // file paths to search

echo "<div class='head'>
<form name='form' action='' method='post'>
<p>\n";
echo "<input type='text' name='s' placeholder='Search blog...' value='".htmlspecialchars($find,ENT_QUOTES)."' title='please type search text...'> 
</p>
</form>
</div>\n";

$counter = 0;
if (isset($_REQUEST['s'])) {
    if (strlen($find) >= 3) {
        $findRx = str_replace(['\?','\*'],['.','.+?'],preg_quote($find)); // prepare find regex (escape spec. chars and substitute wildcards)
        echo "<p class='head'>Search Results:</p><br />\n"; // Starts searching
        echo "<div id='scroll' class='hitSet'>\n";
        getFilePaths($rootDir); // get files to be searched
        natcasesort($filePaths); // natural sort file list
        foreach ($filePaths as $file) {
            searchFile($file,$findRx);
        }
        echo "</div>\n";
        echo "<br><div class='results'>Blog Posts Found: {$counter}</div>\n";
    } else {
        echo "<br><div class='head hilite'>Enter a search text of at least three characters!</div>";
    }
}

With all this in-mind, I pulled-out the "search" function out from functions.php

// Return search page.
function get_keyword($keyword, $page, $perpage)
{
    $posts = get_blog_posts();

    $tmp = array();

    $words = explode(' ', $keyword);

    foreach ($posts as $index => $v) {
        $arr = explode('_', $v['basename']);
        $filter = $arr[1] . ' ' . $arr[2];
        foreach ($words as $word) {
            if (stripos($filter, $word) !== false) {
                if (!in_array($v, $tmp)) {
                    $tmp[] = $v;
                }
            }
        }
    }

    if (empty($tmp)) {
        return false;
    }

    return $tmp = get_posts($tmp, $page, $perpage);

}

I'm coming-into all this with a basic, layman’s understanding of PHP (and programming languages, in general), so it's a lot to take-in and try to understand.

Some relevant code in functions.php that I found that might help gluing these ideas together:

   foreach ($posts as $index => $v) {

        $post = new stdClass;

        $filepath = $v['dirname'] . '/' . $v['basename'];

        $post->file = $filepath;

        $content = file_get_contents($filepath);

        // Get the contents and convert it to HTML
        $post->body = MarkdownExtra::defaultTransform(remove_html_comments($content));
// or
        $postContent = MarkdownExtra::defaultTransform(remove_html_comments($content));

So (since I now have a few functions and parameters that seem-to relate to my quest) I can start to merge these ideas, together, and produce some measurable results.


Update: So, it's So.

// Return search page.
function get_keyword($keyword, $page, $perpage)
{
    $posts = get_blog_posts();

    $tmp = array();

    foreach ($posts as $index => $v) {

        $filepath = $v['dirname'] . '/' . $v['basename'];

        $findRxWhole = '\b' . preg_quote($keyword, '~') . '\b'; // Add word boundaries (find whole-words only)

        $findRx = "~{$findRxWhole}~iu"; // Case-insensitive and UTF-8 mode

        $lines = file($filepath);

        foreach ($lines as $line) {
            if (preg_match ($findRx, $line)) {
                if (!in_array($v, $tmp)) {
                    $tmp[] = $v;
                }
            }
        }
    }

    if (empty($tmp)) {
        return false;
    }

    return $tmp = get_posts($tmp, $page, $perpage);

}
// Return search result count
function keyword_count($keyword)
{
    $posts = get_blog_posts();

    $tmp = array();

    foreach ($posts as $index => $v) {

        $filepath = $v['dirname'] . '/' . $v['basename'];

        $findRxWhole = '\b' . preg_quote($keyword, '~') . '\b'; // Add word boundaries (find whole-words only)

        $findRx = "~{$findRxWhole}~iu"; // Case-insensitive and UTF-8 mode

        $lines = file($filepath);

        foreach ($lines as $line) {
            if (preg_match ($findRx, $line)) {
                if (!in_array($v, $tmp)) {
                    $tmp[] = $v;
                }
            }
        }
    }

    $tmp = array_unique($tmp, SORT_REGULAR);

    return count($tmp);
}

[edit: 20250128]

My PR was successfully accepted by danpros, but I forgot to limit the Search $keyword to ≥ three characters, so I wrapped the search function() with a strlen() in the code:

// Return search result count
function keyword_count($keyword)
{
    if (strlen($keyword) >= 3) { // three-character minimum

        // ...

    }
}

Without a baseline limiting function, one could basically search for "t", which goes-against the grain of the idea of what a specific keyword search is all about.

The additional, modified PR was accepted, and I now consider this endeavour Closed ;)


In the process of discovery, I have also realized how to modify the title in main.html.php $is_search to label and show the $search->count:

<?php if (isset($is_search)):?>
<!-- main.html.php -->
<div class="row justify-content-center" style="padding-top: 3rem;">
    <div class="col-md-12 text-center">
        <h2 class="mt-0">Search: <span style='color: #628B48;'><?php echo $search->title;?></span> (<?php echo $search->count;?>)</h2>
        <form><input type="search" name="search" class="form-control is-search" placeholder="<?php echo i18n('Type_to_search');?>"></form>
    </div>
</div>
<?php endif;?>

fini