FtpLocate - make your own
FTP Search Engine
FtpLocate is a FTP search engine written with Perl.
It has the following features.
It supports indexing on multiple FTP servers. The username, password and
initial index directory for each ftp server could be defined.
It is very fast! FtpLocate uses glimpse as the indexer. One query on 4
million records takes less than 3 seconds in average!
It is very easy to install.
It provides the user two type of searching:
search by name:
The result is grouped by FTP servers. The server most near by the client
will be displayed first. (It is choose by domain name). Besides the file
name, file size and file date, the file description will be also provided
search by description:
User can find the files he wants without knowing the filename. For
example: Searching with the keyword "windows;ftp;server"
will give the result of all ftp server programs available for windows platform.
Files with same description will be grouped together.
It has both web and text version. The web version are CGI programs. The
text version is a simple web client, it communicates with CGI programs
through http protocol. It is handy in case there is no browser available.
It can collect the filelist from three sources:
The filelist is collected by sending the 'ls -lR' command to remote
ftp server directly
The filelist is collected by parsing the ls-lR file on
remote ftp server. Supported formats are Z, gz, zip and plain text.
an URL like http://other.ftplocate/cgi-bin/ftplocate/flserv.pl
The filelist is collected by requesting from the filelist database
of another FtpLocate server. This is useful if the ftp server is far away
from your FtpLocate server.
ps: The transfer of filelist between FtpLocate
servers uses http protocol and supports proxy.
Just set the environment
variable 'http_proxy' to 'http://your_proxy_server:3128/'
It generates summaries for all indexed FTP servers, including directory
count, file count, total file size,...etc.
It generates the maplist of FtpLocate servers on the Internet. Each FtpLocate
server will register itself to the master server and get the most up to
date server list from the master server.
It caches the results of user queries to speed up the response for repeated
It generates the hot list of user queries. To increase the cache hit ratio,
a training program is provided to rebuild results of these queries
into cache after the database re-indexing.
It generates the history list of user queries.
The output is separated into pages.(100 records for each) Users don't have
to wait for large result transfer to complete.
It is designed to minimize the unavailability. The search engine will be
only unavailable at indexing stage.
ps: The time needed for indexing is short
compared to data collecting, so the search engine will serves the user
most of the time.
We have indexed 27 FTP sites with a Celeron450/256MB ram machine.
There are 924322 dirs, 3823568 files found, total size is 1064GB.
It takes about 5 hours to collect file list, the total size is 410MB.
Indexing of file list takes 30 minute, file list index size is 18MB
After filelist indexing is done, we use it to get filelist of files
containing description information. Now the description parser recognizes
Linux lsm, FreeBSD package index, Simtel 00index and RFC index. If a file
is unrecognized, the description parser will try a wild guess. :)
It takes about 2 hour to get the descriptions files, the total size
Parsing and indexing of descriptions takes 5 minutes, index size is
Most queries in this example will be finished within 3 seconds.
ps: An example ftplocate is available at
ps: FtpLocate was developed on FreeBSD
3.1 Release, Perl 5.00502, Apache
1.3.4 and Glimpse 4.1
Perl 5.005 or above
Apache or any web server able to execute CGI programs
Glimpse 4.1 (a great indexing
tool by cs.arizona.edu)
the auto install program
chinese help file
english help file
data collecting and indexing programs
the most important file, specify the ftp servers to be indexed
define the most global variable
string definition file for chinese
string definition file for english
functions used in various ftplocate programs
ftp protocol related routines from Mirror 2.9 by Lee McLoughlin
search CGI programs
a shell script to do data collecting and indexing, used in cron table
collect file list from ftp servers
index the collected file list
collect the description files from ftp servers
parse the description files and does indexing on them
program used by flindex.pl when call glimpse indexer
program used to train the search engine
misc CGI programs
FtpLocate filename search engine
FtpLocate description search engine
text based client
list summaries of indexed ftp servers
list other FtpLocate servers
list the hottest queries
list the query history
log files (created by install program)
text based client
data directories (created by install program)
FtpLocate system log file, log the data collecting and indexing history
FtpLocate user query log file
FtpLocate server list map
used to store filelists of different ftp servers
used to store description files
used to store result of user queries (it will be cleared after data
re-index each time)
untar the ftplocate-2.xx.tar.gz, then change to the untared directory 'ftplocate-2.xx'
execute the './install.pl'. the install program will check the requirement
and determine most setting for you.
edit the file 'config.site' to specify the ftp site that will be indexed
by the FtpLocate server
execute indexer.sh to do data collecting and indexing
use your browser to test your FTP search engine...:)
If you have any problem, please
check if your CGI system is okay
check the disk space for $TMPDIR and $CACHEDIR defined in 'config'
check the permission of $USERLOG, $TMPDIR and $CACHEDIR, it needs to be able to be written by
your CGI user
check the path for external programs defined in $CMD_xxx
check log.system to see what happened
Support ftp server in Microsoft IIS (thanks to
Support ls-lr.bz2 filelist (thanks to
Put ftp.pl and lchat.pl into this package to solve the problem that new perl doesn't support ftp.pl anymore
Ftplocate official site is changed to
Change filelist source keyword "file" to "file://..." and support Z, zip formats
Fix a bug in dcollect.pl which induced the failure in ftpget
Fix maplist related function
Fix domainname related function
More modular design
An install program is provided to ease the installation.
The username, password and initial index directory are now assignable
Filelist source now can be direct, file, or other ftplocate server
Maplist function is added to maintain FtpLocate server list
The text server is now a CGI program. the text client now acts like
a web client.
Fix the ftp timeout problem in filelist collecting
Fix the DNS timeout problem in user query
Support description search
Display description when list files
Use glimpse to do index
make description parser recognize more format
search over multiple FtpLocate server at the same time
better server choice algorithm
Any help or suggestion is welcomed.
Distributed System Lab.