Script to Automatically Mirror Blog on Seperate Domain

Script to Automatically Mirror Blog on Seperate Domain

The idea of having a blog mirror on a personal server stuck me when Reliance Jio, my ISP arbitrarily started blocking GitLab Pages sites, which also houses my blog. Now doing it the manual way by copying the generated file HTML/CSS/JS file onto the server wouldn’t be fun, so just mirroring the blog through a script via wget and scheduling it via cronjob was the initial thought.

My blog is rather simple. It’s a bunch of static files generated through hugo in a git repository for hosting through GitLab pages. A wget to grab the live files and serving them via nginx is pretty simple due to this.

The plan was divided into the following parts:

Now let me walk you through how I came up with the script.

Script to download content

Initially I thought of using wget to mirror the website and writing a script to change all references to blog.sahilister.in to point at blog.mirror.sahilister.in through some grep/awk/sed magic. Did a simple internet search to get all required flags for wget. Tried downloading with the following flags 1:

wget -m -p -E -k blog.sahilister.in

And to my surprise it automatically handled relative linking due to -k flag, leaving the requirement of changing references. The downloaded document was perfect for directly hosting on a server.

The flag explanation as taken from the answer 1:

-m, --mirror            Turns on recursion and time-stamping, sets infinite 
                        recursion depth, and keeps FTP directory listings.
-p, --page-requisites   Get all images, etc. needed to display HTML page.
-E, --adjust-extension  Save HTML/CSS files with .html/.css extensions.
-k, --convert-links     Make links in downloaded HTML point to local files.

I wanted the website files to be downloaded in a specific, pre-defined directory. Read wget --help and found -P flag. -P or --directory-prefix flag allows defining a location for downloaded files. Resulting query was:

wget -m -p -E -k blog.sahilister.in -P /var/www/html/

Putting everything in bash script and using variables in places of URL and location, the script until now:

#!/bin/bash
URL=https://blog.sahilister.in
LOC=/var/www/html/

wget -mpEk ${URL} -P ${LOC}

Check website status before mirroring

As it was an unattended wget, checking if the site is up was important. If the website was down for some reason, wget will override the existing, working pages with wrong pages. Again searched the internet for any command to check website status. Found this:

curl -I -s https://blog.sahilister.in

Here is the response site header:

HTTP/2 200 
cache-control: max-age=600
content-type: text/html; charset=utf-8
expires: Thu, 08 Apr 2021 10:12:27 UTC
vary: Origin
content-length: 9175
date: Thu, 08 Apr 2021 10:02:27 GMT

curl with -s or --silent triggers the silent mode that doesn’t show in process messages. -I or --head flags fetches the headers only, as we didn’t require the whole document for checking website status.

A 200 http status code in the first line shows site is up and OK. The rest of the header wasn’t required, so the initial curl command was piped into head command to output only the first line. The query became:

curl -I -s https://blog.sahilister.in | head -1

-1 is a shorthand for -n 1 flag which tells head command to output only the first line.

Now the response became:

HTTP/2 200 

The relevant part is 200, so another round of piping through cut for stdout on the status code:

curl -I -s https://blog.sahilister.in | head -1 | cut -d ' ' -f 2

The -d or --delimiter flag tells the delimiter to be looked for, in this case space using ' '. -f or --field flag, using 2 as argument prints only the relevant status code part.

Now comes the part to add conditional statements to download from the website only when the site is up (status code 200).

if [[ $(curl -Is ${URL} | head -1 | cut -d ' ' -f 2) -eq 200 ]]; then
    wget -m -p -E -k blog.sahilister.in -P /var/www/html/
else 
    exit 1
fi

Putting everything in bash script and using variables in places of URL and location, the script until now:

#!/bin/bash
URL=https://blog.sahilister.in
LOC=/var/www/html/
siteStatus=$(curl -Is ${URL} | head -1 | cut -d ' ' -f 2)

if [[ siteStatus -eq 200 ]]; then
    wget -mpEk ${URL} -P ${LOC}
else 
    exit 1
fi

Adding logging

Now, I wanted some logs to occasionally check everything is fine or if something is going amiss. Somethings, I wanted in logs were total download time for wget, timings, sites status if download doesn’t happen and wget errors.

First came total download time. wget does stdout certain logs but extracting it was a hassle. There was an option of running the script with time command, but that would have to be used outside the script, also I didn’t bother making a short function by enclosing everything and doing time func(). After all the script is quite short in itself. So again internet came to the rescue. Saving the initial time and subtracting it with time in the end, gives the run-time for wget. Following was the implementation:

start=$(date +%s)
duration=$(echo "$(date +%s) - $start" | bc)
echo $duration

+%s flag in date gives seconds since 1970-01-01 00:00:00 UTC (Unix/epoch time). bc command was used for calculation.

Next came the run and error logs, which would add date, time and download time taken if successful or error code if it fails. On a successful run, the following statement was applied:

echo "$(date --rfc-3339=s) download completed in ${duration}s." >> run.log

In a situation where wget fails, the following statement would run:

echo "$(date --rfc-3339=s) wget failed with ${?}." >> error.log

Plus if some status code other than 200 is returned in website up check and download doesn’t happen, the following statement would be executed:

echo "$(date --rfc-3339=s) Site returned ${siteStatus}, exiting." >> error.log

Putting everything in bash script and using variables, the final bash script is:

#!/bin/bash
URL=https://blog.sahilister.in
LOC=/var/www/html/
RUN_LOG=~/mirror-logs/run.log
ERROR_LOG=~/mirror-logs/error.log

printf "
+-+-+-+-+-+-+ +-+-+-+-+-+-+-+
|s|c|r|i|p|t| |s|t|a|r|t|e|d|
+-+-+-+-+-+-+ +-+-+-+-+-+-+-+  
"
start=$(date +%s)
siteStatus=$(curl -Is ${URL} | head -1 | cut -d ' ' -f 2)
if [[ siteStatus -eq 200 ]]; then 
        wget -mpEkq ${URL} -P ${LOC}
        duration=$(echo "$(date +%s) - $start" | bc)
        if [[ ${?} -eq 0 ]]; then
                echo "$(date --rfc-3339=s) download completed in ${duration}s." >> ${RUN_LOG}
        else
                echo "$(date --rfc-3339=s) wget failed with ${?}." >> ${ERROR_LOG}
        fi
else 
        echo "$(date --rfc-3339=s) Site returned ${siteStatus}, exiting." >> ${ERROR_LOG}
        exit 1
fi

Added a “script started” in the beginning through figlet. Also added -q or --quiet flag for no stdout from wget.

Adding a cronjob for running the script

As it’s an inexpensive operation (about 1 MB of download and 16-18s of runtime) and I want to keep the mirror as synced as possible, so scheduled the cronjob to run twice daily at 1am and 1pm UTC. The entry added through crontab -e was as follows:

0 1,13 * * * ~/mirror-script.sh

Nginx configurations

Added a nginx configuration for blog.mirror.sahilister.in with root pointing towards /var/www/html/blog.sahilister.in (the directory where mirroring happens, didn’t add renaming in wget). Every time the script runs, it overrides this directory with an updated version of the blog.

Final thoughts

That completes the setup. The mirror is live now at blog.mirror.sahilister.in. Git repository for the script can be found here. Code on this page and in git are licensed under Apache License 2.0.


  1. “How Can I Download an Entire Website?” Super User↩︎