7. 从文本提取信息
待回答的问题?
-
提取结构化数据
-
识别实体与关系
-
合适的语料库
7.1 信息提取
分句 - 分词 - 词性标注
import nltk
def ie_preprocess(document):
sentences = nltk.sent_tokenize(document) # 分句
sentences = [nltk.word_tokenize(sent) for sent in sentences] # 分词
sentences = [nltk.pos_tag(sent) for sent in sentences] # 词性标注
return sentences