Script to Automatically Mirror Blog on Seperate Domain
The idea of having a blog mirror on a personal server stuck me when Reliance Jio, my ISP arbitrarily started blocking GitLab Pages sites, which also houses my blog. Now doing it the manual way by copying the generated file HTML/CSS/JS file onto the server wouldn’t be fun, so just mirroring the blog through a script via wget
and scheduling it via cronjob
was the initial thought.
My blog is rather simple. It’s a bunch of static files generated through hugo in a git repository for hosting through GitLab pages. A wget
to grab the live files and serving them via nginx is pretty simple due to this.
The plan was divided into the following parts:
- Script to download content
- Check website status before mirroring
- Adding logging
- Adding a cronjob for running the script
- Nginx configurations
- Final thoughts
Now let me walk you through how I came up with the script.
Script to download content
Initially I thought of using wget
to mirror the website and writing a script to change all references to blog.sahilister.in to point at blog.mirror.sahilister.in through some grep/awk/sed
magic. Did a simple internet search to get all required flags for wget
. Tried downloading with the following flags 1:
wget -m -p -E -k blog.sahilister.in
And to my surprise it automatically handled relative linking due to -k
flag, leaving the requirement of changing references. The downloaded document was perfect for directly hosting on a server.
The flag explanation as taken from the answer 1:
-m, --mirror Turns on recursion and time-stamping, sets infinite
recursion depth, and keeps FTP directory listings.
-p, --page-requisites Get all images, etc. needed to display HTML page.
-E, --adjust-extension Save HTML/CSS files with .html/.css extensions.
-k, --convert-links Make links in downloaded HTML point to local files.
I wanted the website files to be downloaded in a specific, pre-defined directory. Read wget --help
and found -P
flag. -P
or --directory-prefix
flag allows defining a location for downloaded files. Resulting query was:
wget -m -p -E -k blog.sahilister.in -P /var/www/html/
Putting everything in bash script and using variables in places of URL and location, the script until now:
#!/bin/bash
URL=https://blog.sahilister.in
LOC=/var/www/html/
wget -mpEk ${URL} -P ${LOC}
Check website status before mirroring
As it was an unattended wget
, checking if the site is up was important. If the website was down for some reason, wget
will override the existing, working pages with wrong pages. Again searched the internet for any command to check website status. Found this:
curl -I -s https://blog.sahilister.in
Here is the response site header:
HTTP/2 200
cache-control: max-age=600
content-type: text/html; charset=utf-8
expires: Thu, 08 Apr 2021 10:12:27 UTC
vary: Origin
content-length: 9175
date: Thu, 08 Apr 2021 10:02:27 GMT
curl
with -s
or --silent
triggers the silent mode that doesn’t show in process messages. -I
or --head
flags fetches the headers only, as we didn’t require the whole document for checking website status.
A 200
http status code in the first line shows site is up and OK. The rest of the header wasn’t required, so the initial curl
command was piped into head
command to output only the first line. The query became:
curl -I -s https://blog.sahilister.in | head -1
-1
is a shorthand for -n 1
flag which tells head
command to output only the first line.
Now the response became:
HTTP/2 200
The relevant part is 200
, so another round of piping through cut
for stdout on the status code:
curl -I -s https://blog.sahilister.in | head -1 | cut -d ' ' -f 2
The -d
or --delimiter
flag tells the delimiter to be looked for, in this case space using ' '
. -f
or --field
flag, using 2
as argument prints only the relevant status code part.
Now comes the part to add conditional statements to download from the website only when the site is up (status code 200).
if [[ $(curl -Is ${URL} | head -1 | cut -d ' ' -f 2) -eq 200 ]]; then
wget -m -p -E -k blog.sahilister.in -P /var/www/html/
else
exit 1
fi
Putting everything in bash script and using variables in places of URL and location, the script until now:
#!/bin/bash
URL=https://blog.sahilister.in
LOC=/var/www/html/
siteStatus=$(curl -Is ${URL} | head -1 | cut -d ' ' -f 2)
if [[ siteStatus -eq 200 ]]; then
wget -mpEk ${URL} -P ${LOC}
else
exit 1
fi
Adding logging
Now, I wanted some logs to occasionally check everything is fine or if something is going amiss. Somethings, I wanted in logs were total download time for wget
, timings, sites status if download doesn’t happen and wget
errors.
First came total download time. wget
does stdout certain logs but extracting it was a hassle. There was an option of running the script with time
command, but that would have to be used outside the script, also I didn’t bother making a short function by enclosing everything and doing time func()
. After all the script is quite short in itself. So again internet came to the rescue. Saving the initial time and subtracting it with time in the end, gives the run-time for wget
.
Following was the implementation:
start=$(date +%s)
duration=$(echo "$(date +%s) - $start" | bc)
echo $duration
+%s
flag in date
gives seconds since 1970-01-01 00:00:00 UTC (Unix/epoch time). bc
command was used for calculation.
Next came the run and error logs, which would add date, time and download time taken if successful or error code if it fails. On a successful run, the following statement was applied:
echo "$(date --rfc-3339=s) download completed in ${duration}s." >> run.log
In a situation where wget
fails, the following statement would run:
echo "$(date --rfc-3339=s) wget failed with ${?}." >> error.log
Plus if some status code other than 200
is returned in website up check and download doesn’t happen, the following statement would be executed:
echo "$(date --rfc-3339=s) Site returned ${siteStatus}, exiting." >> error.log
Putting everything in bash script and using variables, the final bash script is:
#!/bin/bash
URL=https://blog.sahilister.in
LOC=/var/www/html/
RUN_LOG=~/mirror-logs/run.log
ERROR_LOG=~/mirror-logs/error.log
printf "
+-+-+-+-+-+-+ +-+-+-+-+-+-+-+
|s|c|r|i|p|t| |s|t|a|r|t|e|d|
+-+-+-+-+-+-+ +-+-+-+-+-+-+-+
"
start=$(date +%s)
siteStatus=$(curl -Is ${URL} | head -1 | cut -d ' ' -f 2)
if [[ siteStatus -eq 200 ]]; then
wget -mpEkq ${URL} -P ${LOC}
duration=$(echo "$(date +%s) - $start" | bc)
if [[ ${?} -eq 0 ]]; then
echo "$(date --rfc-3339=s) download completed in ${duration}s." >> ${RUN_LOG}
else
echo "$(date --rfc-3339=s) wget failed with ${?}." >> ${ERROR_LOG}
fi
else
echo "$(date --rfc-3339=s) Site returned ${siteStatus}, exiting." >> ${ERROR_LOG}
exit 1
fi
Added a “script started” in the beginning through figlet
. Also added -q
or --quiet
flag for no stdout from wget
.
Adding a cronjob for running the script
As it’s an inexpensive operation (about 1 MB of download and 16-18s of runtime) and I want to keep the mirror as synced as possible, so scheduled the cronjob
to run twice daily at 1am and 1pm UTC. The entry added through crontab -e
was as follows:
0 1,13 * * * ~/mirror-script.sh
Nginx configurations
Added a nginx configuration for blog.mirror.sahilister.in with root
pointing towards /var/www/html/blog.sahilister.in
(the directory where mirroring happens, didn’t add renaming in wget
). Every time the script runs, it overrides this directory with an updated version of the blog.
Final thoughts
That completes the setup. The mirror is live now at blog.mirror.sahilister.in. Git repository for the script can be found here. Code on this page and in git are licensed under Apache License 2.0.
-
“How Can I Download an Entire Website?” Super User. ↩︎ ↩︎