-- download from file server. -- run command as DownloadExt.`` where -- from="public/SogouCS.reduced.tar" and -- to="/tmp/nlp/sogo"; -- or you can use command line: -- !saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo; -- load data with xml format load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData; --extract `sports` from url[http://sports.sohu.com/20070422/n249599819.shtml] select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp where temp.labelStr is not null as rawData; -- try to use the follow sql to explore how many label we have and how they looks like. -- -- select distinct(split(split(url,"/")[2],"\\.")[0]) as labelStr from rawData as output; -- select split(split(url,"/")[2],"\\.")[0] as labelStr,url from rawData as output; -- train a model which can map label to number and vice versa train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and outputCol="label" ; -- convert label to number predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel; -- you can use register to convert a model to a functioin register StringIndex.`/tmp/nlp/label_mapping` as convert_label; -- we can reduce the dataset. Because if there are too much data but just get limited resource -- it may take too long. you can use command line -- or you can use raw ET: -- -- run xmlData as RateSampler.`` -- where labelCol="url" and sampleRate="0.9,0.1" -- as xmlDataArray; !split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray; -- then we fetch the xmlDataArray with position one to get the 10% data. select * from xmlDataArray where __split__=1 as miniXmlData; -- we can save the result data, because it really take much time. save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`; load parquet.`/tmp/nlp/miniXmlData` as miniXmlData; -- select * from miniXmlData limit 10 as output; --convert the content to tfidf format train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData; save overwrite trainData as parquet.`/tmp/nlp/trainData`; load parquet.`/tmp/nlp/trainData` as trainData; -- again register a model as a functioin register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict; -- use algorithm RandomForest to train train trainData as RandomForest.`/tmp/nlp/rf` where keepVersion="true" and fitParam.0.featuresCol="content" and fitParam.0.inputLabel="labelCol" and fitParam.0.maxDepth="4" and fitParam.0.checkpointInterval="100" ; -- register RF model as a functioin register RandomForest.`/tmp/nlp/rf` as rf_predict; -- end to end predict; you can also deploy this as a API service select rf_predict(tfidf_predict("新闻不错")) as predicted as output;
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Clojure编程
Chas Emerick、Brian Carper、Christophe Grand / 徐明明、杨寿勋 / 电子工业出版社 / 2013-3-26 / 99.00元
Clojure是一种实用的通用语言,它是传奇语言LISP的方言,可与Ruby、Python等动态语言相媲美,更以无缝Java库、服务,以及拥有JVM系统得天独厚的资源优势而胜出。本书既可以用来熟悉Clojure基础知识与常见例子,也可了解其相关的实践领域与话题,更可以看到这一JVM平台上的LISP如何帮助消除不必要的复杂性,为大家在编程实践中解决最具挑战性的问题开辟新的选择——更具灵活性,更适于W......一起来看看 《Clojure编程》 这本书的介绍吧!