#!/bin/bash #fix wget mirrored files to contain no forbidden chars in names (?, &) #and to have the right extension for static files (i.e. html and css) #David Buchmann, 31.1.2003 #Changelog: # 22.11.2003: Simplyfied regexp for file rename, added comment on rename # 29.11.2003: Corrected bug: does not hang anymore if there are no files in the current directory, but gives a warning # #Please send comments go to budda@budda.ch # #Important note on rename: # fixwget relies on a perl rename script. rename must have this syntax: # rename [ -v ] perlexpr [ files ] # # I use debian woody, where this is the case. But older SuSE and possibly other distros have other rename syntax. # If your rename does not work with this script, you can download the perl rename here: # http://www.budda.ch/code/download/rename.pl.txt # and put it into your path. Maybe you want do give it a specific name, i.e. regexp-rename, to make your normal # rename command work as usual. In this case you will have to edit this script and replace all calls to rename # with your new new. # #Usage: #first mirror your page with wget -r -k -E http://url (see http://www.budda.ch/code/ for tips) #then call fixwget.sh in the root dir where the page downloaded by wget resides. #make shure that the dir was clean before wget was run and make shure you only run fixwget.sh once. #(this is because rename does not want to rename if the target file exists) # #I have stylesheets with a .php extension to create different css depending on the browser type. #This script can rename the stylesheet with the wrong extension to .css #Jump to the end of this file to change if you want to use this. # #I am shure there are still bugs in this script. If something goes wrong, try to figure out where and tell me. #Under no circumstances may I be held reliable for damage done by this script: Use it at your own risk. # #Known Issues: # -no bugs known # -quite inefficient, be careful with large sites... # -rename of debian/woody has a stupid bug: when using backreferences, it writes out that \1 should be written $1. # if you use $1 however, this variable is empty. # #-------------------------- #Implementation: # #We always do replace in files using find -type f to find any files and use # -exec to execute a perl regexp replace which replaces all href=... occurences # (except href="protocol:... i.e. absolute links) # #Then we rename files/dirs accordingly. # #For replacing in files, its important to match just the string inside the href argument until the anchor #. #This means matching anything different from " and #. In perl regexp, not some char is [^char] #Note that sometimes we escape chars from perl regexp, but sometimes also from the shell. #(minimize shell escapes by using single quotes). #first of all, we fix eventually existing invalid