Intelligent Analytic Technology Division
Computational Intelligence Technology Center
Industrial Technology Research Institute
Capturing users’ information needs is essential in decreasing the barriers in information access. This paper mines sequences of actions called search scripts from search query logs which keep large scale users’ search experiences. Search scripts can be applied to guide users to satisfy their information needs, improve the search effectiveness of retrieval systems, recommend adver-tisements at suitable places, and so on. Information quality, query ambiguity, topic diversity, and document relevancy are four major challenging issues in search script mining. In this paper, we determine the relevance of URLs for a query, adopt the Open Directory Project (ODP) categories to disambiguate queries and URLs, explore various features and clustering algorithms for intent clustering, identify critical actions from each intent cluster to form a search script, generate a nature language description for each action, and summarize a topic for each search script. Experiments show that the complete link hierarchical clustering algorithm with the features of query terms, relevant URLs, and disambiguated ODP categories performs the best. Applying the intent clusters created by the best model to intent boundary identification achieves an F-score of 0.6666. The intent clusters then are applied to generate search scripts. When only search scripts containing a single intent are considered to be correct, the accuracy of the best action identification algorithm is 0.4650. If search scripts containing a major intent are also counted, the accuracy increases to 0.7315.
User queries to the Web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has attracted considerable attention recently. This paper is aimed at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. For the indirect subtopic mining approach, clustering the retrieval results and summarizing the content of clusters is investigated. In addition, labeling topic categories and concept tags on each returned document is explored. For the direct subtopic mining approach, several external resources, such as Wikipedia, Open Directory Project, search query logs, and the related search services of search engines, are consulted. Furthermore, we propose a diversified retrieval model to rank documents with respect to the mined subtopics for balancing relevance and diversity. Experiments are conducted on the ClueWeb09 dataset with the topics of the TREC09 and TREC10 Web Track diversity tasks. Experimental results show that the proposed subtopic-based diversification algorithm significantly outperforms the state-of-the-art models in the TREC09 and TREC10 Web Track diversity tasks. The best performance our proposed algorithm achieves is α-nDCG@5 0.307, IA-P@5 0.121, and α#-nDCG@5 0.214 on the TREC09, as well as α-nDCG@10 0.421, IA-P@10 0.201, and α#-nDCG@10 0.311 on the TREC10. The results conclude that the subtopic mining technique with the up-to-date users’ search query logs is the most effective way to generate the subtopics of a query, and the proposed subtopic-based diversification algorithm can select the documents covering various subtopics.
Detecting intent shift is fundamental for learning users’ behaviors and applying their experiences. In this paper, we propose a search query log based system to predict users’ intent shift. We begin with selecting sessions in search query logs for training, extract features from the selected sessions, and cluster sesssions of similar intent. The resulting clusters are used to predict intent shift in testing data. The experiment results show that the proposed model achieves an accuracy of 0.5099, which is significantly better than the baseline. Moreover, the miss rate and spurious rate of the model are 0.0954 and 0.0867, respectively.
The credit card industry has been growing rapidly recently, and thus huge numbers of consumers’ credit data are collected by the credit department of the bank. The credit scoring manager often evaluates the consumer’s credit with intuitive experience. However, with the support of the credit classification model, the manager can accurately evaluate the applicant’s credit score. Support Vector Machine (SVM) classification is currently an active research area and successfully solves classification problems in many domains. This study used three strategies to construct the hybrid SVM-based credit scoring models to evaluate the applicant’s credit score from the applicant’s input features. Two credit datasets in UCI database are selected as the experimental data to demonstrate the accuracy of the SVM classifier. Compared with neural networks, genetic programming, and decision tree classifiers, the SVM classifier achieved an identical classificatory accuracy with relatively few input features. Additionally, combining genetic algorithms with SVM classifier, the proposed hybrid GA-SVM strategy can simultaneously perform feature selection task and model parameters optimization. Experimental results show that SVM is a promising addition to the existing data mining methods.
Support Vector Machines, one of the new techniques for pattern classification, have been widely used in many application areas. The kernel parameters setting for SVM in a training process impacts on the classification accuracy. Feature selection is another factor that impacts classification accuracy. The objective of this research is to simultaneously optimize the parameters and feature subset without degrading the SVM classification accuracy. We present a genetic algorithm approach for feature selection and parameters optimization to solve this kind of problem. We tried several real-world datasets using the proposed GA-based approach and the Grid algorithm, a traditional method of performing parameters searching. Compared with the Grid algorithm, our proposed GA-based approach significantly improves the classification accuracy and has fewer input features for support vector machines.
由於信用卡被廣泛的使用,各銀行累積相當多信用卡有關的資料,所以如何利用這 些資料來對未來的發卡用戶進行信用判斷就相當重要。支援向量機(Support Vector Machine)分類法,近幾年來被廣泛地運用在解決各領域的分類問題。本研究嘗試以支援 向量機來解決信用卡分類問題,然而,使用不同的核心函數來建立支援向量機分類系 統,會有不同的參數需要設定,而使用不同的參數將會影響到支援向量機分類的正確 率。本研究使用格子點演算法來調整支援向量機的參數,使支援向量機有最佳的分類能 力。實驗資料取自UCI 資料庫之信用卡領域兩個資料集,以支援向量機建立分類器,實 驗結果顯示依據支援向量機演算法建立的分類系統,在分類的正確率有不錯的表現。
In this paper, we present a question-answering (QA) system as a virtual tutor for students in the 5th and 6th grades. Students ask questions and the QA system gives answers to their questions based on a knowledge base. Teaching materials for history and geography are considered as a knowledge source. Because question log is not available in developing QA systems, multiple choice questions (MCQs) in the learning and testing materials are regarded as a training corpus to learn question types, answer types and keywords for retrieval, where an MCQ consists of a stem and a set of options. Options from the same MCQ are grouped into a cluster. Clusters with common elements are merged into a larger cluster. A cluster is labelled with a nominal element selected from the corresponding stems. We also mine question patterns from the stems for question type analysis in the QA system. Because the questions created by instructors in MCQs and the questions asked by students may be different, we develop a procedure to collect possible questions from students in the 6th grade. In the experiments, we first evaluate the question type classification systems using the MCQ corpus and the student corpus with 5-fold cross validation, respectively. Then we train question type classifiers with the complete MCQ corpus, and test them on the student corpus. The student's and the instructor's questions are compared and analyzed.
In Internet ad campaign, ranking of an ad on search result pages depends on a cost-per-click (CPC) of ad words offered by an advertiser and a quality score estimated by a search engine. Bidding for ad words with a higher CPC is more competitive than bidding for the same ad words with a lower CPC in the ad ranking competition. However, offering a higher CPC will increase a burden on advertisers. In contrast, offering a lower CPC may decrease the exposure rate of their ads. Thus, how to select an appropriate CPC for ad words is indispensable for advertisers. In this paper, we extract different semantic levels of features, such as named entities, topic terminologies, and individual words from a large-scale real-world ad words corpus, and explore various learning based prediction algorithms. The thorough experimental results show that the CPC prediction models considering more ad words semantics achieve better prediction performance, and the prediction model using the support vector regression (SVR) and features from all semantic levels performs the best.
A benchmark evaluation dataset which reflects users’ search behaviors in the real world is indispensable for evaluating the performance of information retrieval applications. A typical evaluation dataset consists of a document set, a topic set and relevance judgments. Manual preparation of an evaluation dataset needs much human cost, and human-made topics may not fully capture users’ real search needs. This paper aims at automatically constructing an evaluation dataset from wisdom of the crowds in search query logs for information retrieval applications. We begin with collecting documents of clicked documents in search query logs, selecting suitable queries in terms of topics, sampling documents from the document collection for each query and estimating the multi-level relevance of document samples based on click count, normalized count and average count functions. The machine-made evaluation dataset is trained and tested by three learning to rank algorithms, including linear regression, SVMRank and FRank. We compare their performance on a testing collection MQ2007 of LETOR which is a well-known human-made benchmark dataset for learning to rank. The experimental results show that the performance tendency is similar by using machine-made and human-made evaluation datasets. That demonstrates our proposed models can construct an evaluation dataset with similar quality of human-made.
This paper proposes a method to construct an evaluation dataset from microblogs for the development of recommendation systems. We extract the relationships among three main entities in a recommendation event, i.e., who recommends what to whom. User-to-user friend relationships and user-to-resource interesting relationships in social media and resource-to-metadata descriptions in an external ontology are employed. In the experiments, the resources are restricted to visual entertainment media, movies in particular. A sequence of ground truths varying with time is generated. That reflects the dynamic of real world.
Users express their information needs in terms of queries to find the relevant documents on the web. However, users’ queries are usually short, so that search engines may not have enough information to determine their exact intents. How to diversify web search results to cover users’ possible intents as wide as possible is an important research issue. In this paper, we will propose several subtopic mining approaches and show how to diversify the search results by the mined subtopics. For Subtopic Mining subtask, we explore various subtopic mining algorithms that mine subtopics of a query from enormous documents on the web. For Document Ranking subtask, we propose re-ranking algorithms that keep the top-ranked results to contain as many popular subtopics as possible. The re-ranking algorithms apply sub-topics mined from subtopic mining algorithms to diversify the search results. The best performance of our system achieves an I-rec@10 (Intent Recall) of 0.4683, a D-nDCG@10 of 0.6546 and a D#-nDCG@10 of 0.5615 on Chinese Subtopic Mining subtask of NTCIR-9 Intent task and an I-rec@10 of 0.6180, a D-nDCG@10 of 0.3314 and a D#-nDCG@10 of 0.4747 on Chinese Document Ranking subtask of NTCIR-9 Intent task. Besides, the best performance of our system achieves an I-rec@10 of 0.4442, a D-nDCG@10 of 0.4244 and a D#-nDCG@10 of 0.4343 on Japanese Subtopic Mining subtask of NTCIR-9 Intent task and an I-rec@10 of 0.5975, a D-nDCG@10 of 0.2953 and a D#-nDCG@10 of 0.4464 on Japanese Document Ranking subtask.
Predicting potential advertisement clicks of users are important for advertisement recommendation, advertisement placement, presentation pricing, and so on. In this paper, several machine learning algorithms such as conditional random fields (CRF), support vector machines (SVM), decision tree (DT) and back-propagation neural networks (BPN) are developed to learn user’s click behaviors from advertisement search and click logs. In addition, four levels of features are extracted to represent user search and click intents. Given a user’s search session and a query, machine learning algorithms along with different features are proposed to predict if the user will click advertisements displayed for the query. We further study the impact of feature selection algorithms on the prediction models. Random subspace (RS), F-score (FS) and information gain (IG) are employed to search for a predictive subset of features. The experiments show that CRF model with the random subspace feature selection algorithm achieves the best performance.
This paper mines sequences of actions called search scripts from query logs which keep large scale users’ search experiences. Search scripts can be applied to predict users’ search needs, improve the retrieval effectiveness, recommend adver¬tisements, and so on. Information quality, topic diversity, query ambiguity, and URL relevancy are major challenging issues in search scripts mining. In this paper, we calculate the relevance of URLs, adopt the Open Directory Project (ODP) categories to disambiguate queries and URLs, explore various features and clustering algorithms for intent clustering, and identify critical actions from each intent cluster to form a search script. Experiments show that the model which consist of complete link hierarchical clustering algo¬rithm with the features of query terms, relevant URLs, and disambiguated ODP categories perform the best. Search scripts are generated from the best model. When only search scripts containing a single intent are considered to be correct, the accuracy of the action identification algorithm is 0.4650. If search scripts containing a major intent are also counted, the accuracy increases to 0.7315.
Capturing users’ future search actions has many potential applications such as query recommendation, web page reranking, advertisement arrangement, and so on. This paper predicts users’ future queries and URL clicks based on their current access behaviors and global users’ query logs. We explore various features from queries and clicked URLs in the users’ current search sessions, select similar intents from query logs, and use them for prediction. Because of an intent shift problem in search sessions, this paper discusses which actions have more effects on the prediction, what representations are more suitable to represent users’ intents, how the intent similarity is measured, and how the retrieved similar intents affect the prediction. MSN Search Query Log excerpt (RFP 2006 dataset) is taken as an experimental corpus. Three methods and the back-off models are presented.
You are the th visitor