The key component of this interesting project is the web-crawler, which is mostly used for personal purpose. There are many similar applications used crawlers, but few are good quality. The project employed a third-party crawler by challenging the difficult utilization problem and further enhanced it to be a "job search engine". It also employed Xerces2 to overcome the issue that most crawlers could fail when the parser is not powerful. The project team ensured the application's extendibility and scalability by implementing it in java. These wise decisions made it a predictable successful application. The presentation flew smoothly and clearly illustrated how the crawler and parser functioned. By using a flow chart, the program logic appeared very understandable. Nevertheless, a few things are worth to consider:
After the presentation I was really excited by your demonstration of both the web-system and the meta-crawler. I will be graduating in one semester and an online resource like JobBot would be invaluable to me in finding a job! The ability to go to one site and find a comprehensive job listing is a major time saver and I think it would be far superior to something like the newspaper classifieds or even a single companies job postings. Also, I appreciated the fact that you talked about some of the implementation issues you have had up to this point. The detection of strings such as 'C++' and converting them into 'Cplus' so as not to cause problems with your system is a definite real-world implementation issue. Finally, I must congratulate the developer of the meta-crawler Java application. The user of an XML parser to parse XHTML and XML documents was most impressive.
Obviously, the main problem with this web project is the legality of taking job postings from other websites. You did state this problem during your presentation, but it is a major one that would probably prevent you from deploying your website publicly. I am curious to know what type of agreement monster.com has with people posting job listings. I think it would have been nice if you had done more research about the legal side of your system since it is a rather big issue. With regards to your current system, I have one suggestion. In your demonstration of the JobBot search engine I do not remember seeing a field for choosing a specific city/location to narrow down the results returned from your centralized database. I think this would be a valuable and easy to implement feature based on your current architecture and website layout. Chances are that I would only be interested in looking at all jobs pertaining to 'Java programming' in a certain city. I think the more robust the search engine the more efficient and useful the JobBot website will be to users.
This team is working on one very popular web site nowadays. They try to develop a job searching web site using different algorithm and technology. In their project, they used many different software architectures like java, JSP, MYSQL, and Lucene. They will capture all jobs that satisfy the client's requirement by searching different web sites, companies' database, and words from the description of the job. They integrate both employee and employer functions in the same application. In the demo, they did show what each technique is and how it works like MetaCrauler and parser. Also, they did explain why they choice each technique and how powerful they are.
They can try to work on some security issues like how to keep client's information in secure. Also, they specify some problems in the searching algorithm; they can try to work on those. How to get all companies' URL is a problem that they need to fix; changing URL of some companies might lead to a problem for their search result. When searching in the description for keyword, they can try to apply natural language theory into the searching algorithm. In this way, jobs with only the keyword but with no relation indeed at all with not include in the result which saving space and more efficiency. Finally, they can try to work on the management of the database for growth and rank the result in specific order.
The idea of a "centralized job search service" is ambitious and has practical significance. Metacrawler is an interesting information retrieval mechanism. The presentation showed a clear functional specifications. Overall Architecture is clear and makes sense. Website interface is simple and clear. The explanation during demo was also clear. Metacrawler architecture diagram was well presented and explained. The team has done great job on Third Party Parsers. The most appealing part of this presentation is the team introduced problems encounted and their solutions in Full-Text Searching, acknowledged some unsolved problems and made a call to the class for suggestions. The whole presentation is concise and speakers were relaxed.
However, Metacrawler demo should have been more prepared for and should have provided a bigger and more understandable interface. Although this application is based on an ambitious imagination, it is suspected that it has the ablility to exhaust searching all existing job postings and a job seeker will only visit this website for a job search. It would be a more practical idea that this application functions as a indicator system which could direct job-seekers to appropriate websites to achieve a more effective and time-saving job search. It would look like it does some pre-search to help job searching instead of attempting to cover all the functions of all job search websites.
I'm currently looking for a job right now, and the JobBot project definitely suits my needs. With so many people looking for jobs right now, my insight is that the project will gain wide user acceptance. I believe this system really saves the user's time by aggregating information from websites into one place so that the user doesn't waste time navigating. Another thing I like about the system is its flexibility in expanding to more web sources. Creating a new parser and a next page URL creator is usually not a difficult task, not to mention that the core functionality of the system does not need to be changed. In particular, the project has to potential to incorporate the information gathered from other job seeking websites (for example, monster.ca). As a result, the information available in the system will have as least as much as, if not more than, those websites. Consequently, this will attract more users, since more information is available at a single place.
However, a potential problem with a static parser is that the format of the company's web page might change and the parser will then be useless. As a result, as the system grows, the maintenance task of updating the parsers will increase. Also, how would you distinguish between the status of no jobs are currently available and the company's web page format changed? In both cases, the search will not return any jobs. But in the case where the company's website changed, actions will have to be taken to change the parser. I believe currently, someone has to manually go to the web page and see which status it falls under. As for searching, it would be nice if the search result can be sorted by some criteria, say company name or date retrieved. This allows the user to view the results easier, since they may have a preference to look at specific jobs first. Also, it would be nice if there were information about where the system has searched so that the users do not have to look at those places. For example, if the system has searched the Crystal Decisions website but not Microsoft, then the user would know that he/she does not have to go to the Crystal Decisions website and should spent some time at the Microsoft website. Another issue regarding searching is the possibility of getting the redundant jobs. For example, the same job could be retrieved from monster.ca and also directly from the company itself. As a result, using the URL as identifier, as your group suggested, will not work in this case to distinguish duplicate job. Finally, a useful addition to add to the system is to include tips and resources on resume, cover letters and interviews. The reason is that these topics are very relevant to the users, and will make the system a bit more attractive.
JobBot is a robot that retrieves all job postings from many different sites and present them to the user all at once. This is a very unique idea especially when there are many job posting websites on the internet. In the demo, the system seems to run fast, which is good. Even though the parser is using keyword search, it seems to return no false pages. The ability to save postings in a user account is a very useful feature. Another useful feature is the ability to filter out old postings. The presenters didn't demo it but it would be nice if the system can indicate a viewed posting by having a check mark icon beside the document.
One of the main concerns I have with the system is the accuracy of search results. I understand that the system is using keyword search, which in my mind is not as accurate as using other methods such as natural language processing. The second issue has to do with copyrighted materials. Many websites have fineprints at the bottom of the main page to prohibit the reproduction of their contents or the linking of their pages except the homepage. Another issue is regarding login. Many job posting websites requires the user to create a user account. Is JobBot able to get pass the login process to retrieve the postings?
The JobBot web information system's cornerstone is the metacrawler. It is responsible for gathering data for the database. The task of automatically retrieving all job postings and the descriptions from a target website is impressive. The use of multi-threading was appropriate and increases efficiency of the system. The metacrawler was also built with growth in mind. Although not a trivial task, only two components, a parser and next page URL creator need to be added for the metacrawler to search another target website. The metacrawler provides a service which otherwise would require tedious, repetitive work by an administrator.
Legal issues must be considered when determining a target website for the metacrawler. Two example websites used by the group were Crystal Decisions and Monster.ca. Although it may be acceptable to relay job postings and descriptions from Crystal Decisions, it may not be acceptable for a competitor website such as Monster.ca. Monster.ca provides a similar service as JobBot and may be disgruntled to learn that a rival website is stealing job postings using a metacrawler. Thus, permission should be obtained by JobBot before enlisting a website as a target.
As with any system, JobBot must be maintained. Since the parser is no longer valid once the owners of the target website is changed, the metacrawler should have a mechanism to indicate to the maintenance staff when a target website has been modified. Another solution would be to build a more generic parser. Currently, the maintenance staff is required to manually determine the target sites. It may be appropriate to create another metacrawler to automatically locate job posting sites.
Overall, the project and design, implemenation and security issues seem to be thought out very well. The fact that it will save job seekers valuable time from having to visit employer websites by just having them go to a single location to access many job postings is a good idea. The user's ablility to specify things such as skills, salary, title, and education and having features such as user sign up, login/logout, update personal information, search job postings, post resume, and post jobs, are all features used by other common job searching services such as Monster.com, so it would be familiar to use. One advantage over Monster.com and other services is that the crawler automatically goes to the employer's site to search for new postings instead of having the employer go to the job service site to post new jobs. I also liked the way they provided reasons and benefits of using certain technologies for there implementation over others; for example, why they used PHP instead of CGI/Perl. They also provided different alternatives, which gives different perspectives. Multithreading was a good feature to make the searching more effiicient. Their discussion about protecting users, and having transmission securityand database access control indicates that they have not ignored this issue.
The main problem, which was pointed out by this group is the problem of having the potentially time-consuming job of customizing the parsing and formatting each different company or target website in order to get the job information they need. On the otherhand, employers don't have to waste time to post because this would be done automatically. This leads toa tradeoff between who spends more time to do their work: the employer posting a job or the JobBot crew. Personally, I thought the the formating and parsing process looked pretty complicated already; for example there were 2 or 3 parsers needed. Doing this for a many different sites could be tough. An upside is that this needs to be done only once for a company website and it can be reused during the periodic checkups. The issues of finding out when job postings expire, or redundant job postings, and creating a more generic parser are pretty important issues that were talked about in the presenation, so it is good that they thought of these things. Also, If I remember correctly, the link titles to the postings found were pretty long: they had the job title, employer, date, and location listed on the link, which may be too long and hard to read. There may be repetitve things in those links. For example, if you have many links toCrystal Decisions job postings in Vancouver BC, but they are for different positions, you may have many links with Crystal Decisions and Vancouver BC in the link title. Maybe they could be listed in tabular format with just job title in the link title, or something that would make it simpler and therefore easier to read. This is just a minor problem though.
The ability to search webpages automatically for jobs and bring the jobs together on one site is very useful. Implementing multi-threading is a great idea since the searches can be seperate from one another, as long as they aren't searching the same page. The slide on how the metacrawler was very helpful. Also, it was good idea to allow users to search for a job on certain aspects of job.
For the login, is security a big issue? Noticed that https was not used, but it might be helpful. Is it necessary for the user to login to search for jobs? More information needed on submitting a job to the website. Also, how is the website sure that the person submitting a job actually works at that company? For the crawler, is it at all possible to allow the crawler to serach for its own links instead of having to add links for every new job? The problem of jobs expiring was considered, but what about if there has been an update in the job information, for example a change in the description? Will the crawler replaced the previous version or will it add a second update version and still keep the old one. Finally, I don't recall seeing this, but perhaps a directory of job listings would be possible, instead of just searching the jobs.
One great feature was their web search system/technology called the 'Metacrawler', a hybrid of web crawler, which only retrieves static web contents, and metasearch, which does not crawl the web to compile their own searchable databases. Another appealing feature is the provision for users to be able to post resumes or employers to post available job positions, this empowers both types of users. Finally the idea of a user being able to store their job search results (no matter how many) for later reference anytime they want will be greatly welcomed by any user, it is a very convenient feature.
Storing search results is a great feature but considering it is a world wide web system, there is no telling how many users will use the system, possibly tens of millions if not more. In that case if search results (no matter how many) can be stored for as long as one chooses, then huge amounts of storage space will be tied up unnecessarily since the jobs' availability have a limited time anyway. One way of dealing with this space wastage will be to track the job availability expiry date and on that date sent a notice to the user to delete the information. This can be done by having the system generate a letter from a template and send to the user automatically. Another way of dealing with the wastage is to set up a policy informing all users that search results can only be stored for a certain period of time, say one month for example, and then the system deleting that particular search result one month after being stored.
JobBot, a job search web service, proves its efficiency, time saving and usability through the presentation. Instead of spending hours surfing all over the web to find for a job, now a job seeker only needs to visit JobBot and it will give him what he want. Since JobBot stores condensed information of jobs on its server, job seekers can retrieve the information instantly. Indeed, JobBot saves a lot of time and effort for job seekers. Moreover, JobBot lets job seekers to post their resumes on the server so that employers can have a chance to look for employees actively. Another important point of JobBot is that it is extendable by just adding a targeting web page parser and a page URL creator and then the system will automatically search for new jobs. To be efficient, JobBot is implemented with multithread approach. Even though the project is complicated, the team doing the presentation does a good job on describing applications, methods and parsers used in the project. With a clear diagram on how the system works, listeners can understand the system immediately at the first glance. The team concludes the presentations with thoughtful questions and solutions.
Even though JobBot is efficient and usable, it still raises some questions from users. First of all, is the JobBot server still fast enough when it handles all of the job posting websites on the Internet? In the demo, the system used only one thread and targeted only one website and it took about more than a minutes. To solve this problem, multi server system should be set up. Probably each server will handle a number of job posting websites. Second, security is a biggest issue but it was not focused in the presentation. Although job seekers can opt not to post their resumes on the server, this will eliminate their chances of getting a job when employers search through posted resumes. In this case, on the resumes, JobBot should codes private information of job seekers and JobBot should act like an intermediate to let an employer and a job seeker communicate with each other anonymously first and then the job seeker can reveal their information on his willing. Third, how does a job seeker can search for a job required no experience? JobBot should provide the quotation search method like in google to group a key words together (i.e. "no experience"). Fourth, JobBot can't match posted jobs with posted resumes automatically. It would be great if the JobBot can automatically match the key words from posted jobs and posted resumes and then email the owners of the resumes about the jobs. Fifth, the presentation focuses on too much technical aspect. The presentation should include more about business case of the project, the difference between JobBot and other job search services and future plans of the project.
The selling point of this web site is that it allows the users to save time when seraching for job postings. By visiting home pages of companies in the relevant industry on a regular basis to search for their current job openings and then storing results of such searches in the database of this web site for future queries, this web site saves its users the trouble of visiting dozens of company web sites. It also saves the user the waiting time for the return of an search since the information is stored at one single database and it also allows the user to make a comparison of all job postings from different companies simultaneously.
While this web site provides a list of job postings available based on criteria chosen by the user, these postings may not be the most up-to-date since the search is triggered off by a timer and the results stored in the database may not be the current job postings displayed at the web sites of the companies. A more realistic approach is to conduct a real-time search when there is a real user request. However, this means that the return time of such a request will be much longer as the metacrawler does require some time to go to each companies to retrieve the relevant web pages - as observed during the demonstration. Another issue is that the URLs of the target websites is on a static list, this may limit the number of web sites visited during the search unless the web site administrator frequently update this list by manually inserting URL of new company websites into the list. Should this be the case, then a functionality that allows non-technical staff to add such data to the database will provide the expandability to the list.
I'm not really sure where to start, because I found the whole concept of the centralised job advertisement site intriguing. I am somewhat biased because I tend to favour the paradigm of centralising information but even that aside I was impressed with the work done as I would not know how to do half of what was done. The presentation was very descriptive and I was pleased to see the different technologies that were under consideration; I had never heard of Lucene before and was unaware of some of the options that were found in MySQL, such as the "IN BOOLEAN MODE" option. Which brings me to the next point I liked: I hadn't noticed too much description of the problems encountered and how they were solved in the first day's presentations, but the JobBot group nicely went over their issues and solutions, like "IN BOOLEAN MODE."
I am concerned however that some companies have descriptions of how they want applications to be formatted; in a completely different place than the job descriptions in many cases! So, it would be very nice if another function was added to retrieve these and display them with the job postings, which could most likely be done very similarly to the job retrieval. This could be made somewhat easier by allowing companies to signup via the web where they could input what pages to parse. One other thing I noticed was that there was no ranking of jobs, which could easily, albeit crudely, done with limiting fields, such as location or salary range.
This JobBot project is very amazing that connects to the other online job posting applications or website (such as monster.com, crystaldecsion.com) and extracts the useful and relevant information from them. This is a very clever and efficient design to use the web resources to provide organized and centralized job-searching services to the users. This application not only saves much time and effort for the people who look for job but also reduce the cost of hosting and maintaining the job searching service. The reason behind is that it does not require storing all the jobs into its own database but rather extracting the information from the web. I really benefit a lot from their presentation about the current searching technology especially the "MetaCrawler". The most impressed is that they suddenly show the MetaCrawler demo on how to connect to other websites and return the job-searched result in the command line interface. I think most people in the class find that searching technology is innovative to them. Another remarkable thing is that they implement and use the HTML parsers to parser information among difference web sites.
The most challenge part for the maintenance of this project seems to be overcoming the parsing among difference websites. In the future development, if there is a standard to handle these differences on parsing, it will definitely be a great improvement. Furthermore, they can try to provide a functionality to sort the result of the searched job by date or the relevant of the search data matched with the search keywords. Also, as show in the demo, there is only the employee side for job search; therefore, they can try to extend this service to the employer side such as for resume searching. In addition, there might be a problem that the user saved jobs in the database will be outdated and become unavailable after a few days or weeks after they saved. Therefore an automatic update checker is need to ensure the saved jobs are still valid and also if a saved job is outdated or unavailable, a warning message should provide provides beside it and prompt the user to delete it.
Generally speaking, I think JobBot the project team has done a relatively deep research work on web-based information searching and retrieving technique. If the project team can fine-tune their work on some aspects, it can be a really good project.
Good research on Web Search technique | In this project, group members compare web crawler and metadata search technique and combine these two technology and creates metacrawling, which is a hybrid from web crawler and metasearch. The concept of metacrawling provide a powerful tool for searching web-based information even though it is at its initial state. |
Technique challenge for HTML/XHTML parser | This project also did a relatively deep research on how to parse the retrieved web pages. They compared some third party parsers, such as Xerces2 Parser and NekoHTML Parser, and even provided their own Parser, even though it is still imperfect. I think this is a very good practice for our course project. |
Clear data flow structure | The project also presents a clear system architect of how the data flows from how a metacrawler is triggered, how the retrieved metadata are stored in the centralized database to how a concrete job search result is obtained. This gives me more concrete idea of how the web search engine works basing on what we have learned from this course. |
A suggestion on improving search efficiency | In this project, team adopts full text search from text stored in the local database to get search result for each job search. This method is difficult to achieve an easy scalable application system with more and more data storing in the database. I think if metacrawler can also retrieve metadata information from the original internet webpage, or create its own metadata for retrieved information if they do not have this, and put these metadata in its database, this will greatly facilitate its job search and improve search effeciency. |
Can be more words on comparison of JSP vs. PHP? | I have read the project design document. It said the project team will adopt PHP scripts in server side. But in their presentation, they seem to change to Java/JSP to develop their dynamic web page with Tomcat as its container. But they did not give any reason for this. I think this is a good chance for them to compare these two server side technology and get some insight about the advantage and disadvantage of each technology. |
Another suggestion for get URL | In this project, metacrawler only retrieve data from a list of manually supplied URLs. I think this limits the project practicability for future developing. If the system can supply function to let each company or job suppliers to submit an application form to sign in this JobBot service, like many web search engines usually adopt, and the system automatically add this new website URL into its own URL crawling lists, I am sure the JobBot will be more powerful that it is today |
After watching the design document and demo, I must say the
project members are really spending a lot of time to do research, so their
document is clear, demo is successful.
The most significant thing of this project is that it successfully connects to
other web page, retrieves the specific data based on user input, then parses,
restores data, at last displays to user in some format.
The members have not only finished high level design, but also implemented
basic feathers such as user login/logout, search by keyword.
Followings are what I found most appealing:
However, I still have following questions for discussion:
The technology of JobBot is exciting and intriguing. It has the potienial to dramatically reduce the time and effort needed for a job search. Searching for a job can be stressful enough, and currently, it is extrodinarily time consuming becuas eyou have to search so many sites. JBot drastically reduces effort by combining the search results from many of these sites, including all of the major ones such as monster. The interface was simple and easy to use making this a program that any indiviual, technically oriented or not, can use.
Serach results are returned with out ranking or order. This can be confusing or incrediably long. Possibly ranking them by date or relevence would be exremely helpful and increase the usability of the site. The other problem noted is that the returned results included expired job positings. This enables tyhe possibility that you can apply for jobs already taken. This results in wasted user time and enegery, and at quite a significant amount. The other problem is that it only searches selected sites and not all job search sites. A published list of included searched sitres should be avaliable so users can search sites not included in JobBot's search.
In the perspective of a soon-to-be job seeker, I find that JobBot offers a very appealing service to many job seekers. Firstly, it provides a centralized job database to all subscribed users. Users of the system can save a huge amount of valuable time from searching employers' job posting websites. Moreover, a searchable database provides a very efficient tool for job seekers to match their skills with available jobs. In addition, a crawler has the advantage of retrieving the most up-to-date job postings, as long as employers keep their website up-to-date. Finally, I think the technology behind this project can be easily applied to other areas of interests, for instance, retrieval of the latest video game releases, or restaurant menus.
Before the project proceeds to the final development stage, the project team has to overcome the following challenges. Firstly, many large enterprises such as Microsoft may use dynamic web pages to post their career opportunities. The inability to retrieve information from dynamic websites can be a serious limitation. In addition, how does JobBot realize when a posting has expired? As the database accumulates jobs, will the program has enough resources to check every site to verify that a posting is still active? Finally, from the in-class demo, it seems like the crawler is retrieving job information using a list of websites. Therefore, someone has to manually generate this list before the crawler can be sent off to retrieve information. Perhaps future extensions of JobBot can automate this process by retrieving a list of websites from Google. Otherwise, the system can allow users to submit their preferred list of URL to monitor job openings of companies they are interested to work for.
Commentary -- JobBot
A very well prepared and organized presentation. The implementation work and demo showed your effort toward the project. The ideas of JobBot of being a centralized job search engine and automatically collecting job postings from company websites interest are very attractive. Particularly, if JobBot can access and collect job postings from large number of companies, the time saving benefit will definitely convince me to use their services. Besides this special features, JobBot also provides the features, such as job-seeker and recruiter functions, as the typical job searching websites. I like the idea of job-seeker account. I can save the job postings that are of interest to me while searching the website for jobs. At the end of the job searching process, I can review those jobs or review then in the next time I login. That again saves my time in the next visit. In the technical aspect, the group utilizes many kinds of technology, such as a crawler, parsers, mySQL, multithreading, java, tomcat and apache. Also, with the graphs and demonstration, the explanation on the technical side of JobBot is very clear.
One of the questions that I have about JobBot is that, they mentioned that the method they use to distinguish a new or a redundant job posting is to compare the URL of a detailed job description. I think it could happen that the content of an URL changes over time. When considering this possibility, JobBot would miss to collect a new job posting. Another question that I have is on the job-seeker account. The way that JobBot stores a job posting in an account is to store only a job ID. What will happen that if an old job posting has been deleted from the database, but the corresponding job ID is still being stored in one of the job-seeker accounts? That job ID will become meaningless. One suggestion is to periodically warn the job-seekers that some of the job-postings in their account have already expired and provide an option for them to delete those expired job-posting. The last question that I have is how JobBot ensures the privacy of the job-seekers' resumes. Once an employer/recruiter registered to JobBot, they can view all the resumes posted in JobBot. What are the eligibilities of gaining this kind of access to the job-seekers' resumes.