MLSQL NLP Example

栏目: 编程工具 · 发布时间: 6年前

-- download from file server.
-- run command as DownloadExt.`` where 
-- from="public/SogouCS.reduced.tar" and
-- to="/tmp/nlp/sogo";
-- or you can use command line:
--  !saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;

-- load data with xml format
load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData; 


--extract `sports` from url[http://sports.sohu.com/20070422/n249599819.shtml]
select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp 
where temp.labelStr is not null 
as rawData;
-- try to use the follow sql to explore how many label we have and how they looks like.
--
-- select distinct(split(split(url,"/")[2],"\\.")[0]) as labelStr from rawData as output;
-- select split(split(url,"/")[2],"\\.")[0] as labelStr,url from rawData as output;

-- train a model which can map label to number and vice versa
train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
outputCol="label" ;

-- convert label to number 
predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;

-- you can use register to convert a model to a functioin
register StringIndex.`/tmp/nlp/label_mapping` as convert_label; 

-- we can reduce the dataset. Because if there are too much data but just get  limited resource 
-- it may take too long. you can use command line 
-- or you can use raw ET:
--
-- run xmlData as RateSampler.`` 
-- where labelCol="url" and sampleRate="0.9,0.1" 
-- as xmlDataArray;
!split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;
-- then we fetch the xmlDataArray with position one to get the 10% data.
select * from xmlDataArray where __split__=1 as miniXmlData;

-- we can save the result data, because it really take much time.
save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;

load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;
-- select * from miniXmlData limit 10 as output;

--convert the content to tfidf format
train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;

save overwrite trainData as parquet.`/tmp/nlp/trainData`;
load parquet.`/tmp/nlp/trainData` as trainData;
-- again register  a model as a functioin
register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;

-- use algorithm RandomForest to train 
train trainData as RandomForest.`/tmp/nlp/rf` where 
keepVersion="true" 
and fitParam.0.featuresCol="content" 
and fitParam.0.inputLabel="labelCol"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
;

-- register  RF model as a functioin
register RandomForest.`/tmp/nlp/rf` as rf_predict;

-- end to end predict; you can also deploy this as a API service
select rf_predict(tfidf_predict("新闻不错")) as predicted as output;

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

信息乌托邦

信息乌托邦

桑斯坦 / 毕竞悦 / 法律出版社 / 2008-10-1 / 28.50元

我们被无限的媒体网从四面包围,如何能够确保最准确的信息脱颖而出、并且引起注意?在本书中,凯斯•R. 桑斯坦对于积蓄信息和运用知识改善我们生活的人类潜能,展示了深刻的乐观理解。 在一个信息超负荷的时代里,很容易退回到我们自己的偏见。人群很快就会变为暴徒。伊拉克战争的合法理由、安然破产、哥伦比亚号航天载人飞机的爆炸——所有这些都源自埋于“信息茧房”的领导和组织做出的决定,以他们的先入之见躲避意见......一起来看看 《信息乌托邦》 这本书的介绍吧!

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具