DeveloperDen: Small utility to extract a website based on a pattern of URLs

I had to pull down the webpages for the pattern of URLs from one of the online GRE sites. I tried my hands on a small program in java and backed up the contents so that I can browse offline.

Performance statistics :
It took approximately 5 secs to pull and download each file of 80 KB of html text.
In total it took 25 minutes to download 300 html pages each around 80 kb.

=============================================================

public static void main(String[] args) {
try {

//This file is to collect the performance metric for each page extract
File metFile = new File("C:\\phoenix school\\Exams\\GRE Wordlist\\english-test\\Metrics.txt");
metFile.createNewFile();
FileWriter metFileWr = new FileWriter(metFile);

//Repeat for each page in the site
for(int i=1;i<=300;i++) { try { GregorianCalendar gregCal = new GregorianCalendar(); // To note the start time
double start = gregCal.get(GregorianCalendar.SECOND)+gregCal.get(GregorianCalendar.MILLISECOND)/1000;

// The fileID varies for the 300 pages and is dynamic in the url pattern
String fileID = Integer.toString(1000+i).substring(1);
URL url = new URL("URL is not provided here for confidentiality");
System.out.println("Downloading URLs : "+url);
// FileWriter fw = new FileWriter("c:\\test\\Metrics.txt");
File f = new File("C:\\Wordlist\\"+fileID+".htm");
boolean isFileCreatedNow = f.createNewFile();
FileWriter fw = new FileWriter(f);
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
fw.write(str);
}
in.close();
fw.flush();
fw.close();

// To calculate the end time
GregorianCalendar gregCal1 = new GregorianCalendar();
double end = gregCal1.get(GregorianCalendar.SECOND)+gregCal1.get(GregorianCalendar.MILLISECOND)/1000;
System.out.println("Start : "+start+" end : "+end);
metFileWr.write(url+" -> "+Double.toString(end-start)+"\n\n");
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} // End of try catch block
} // End of for
metFileWr.flush();
metFileWr.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}

DeveloperDen

Thursday, September 20, 2007

Small utility to extract a website based on a pattern of URLs

No comments:

Blog Archive

About Me