基于NLTK的自动驾驶词频分析

  1. 描述
  2. 思路
  3. 单篇文章实践
    1. 文章内容
    2. 链表化
    3. 词频分析
    4. 移除噪声
  4. 增加抓取部分
    1. 确定数据源
    2. HTML内容处理
  5. Questions
  6. 参考

因工作关系,掌握某个领域(比如自动驾驶领域)的英文词汇,尤其高频词汇,对分析整个行业讨论的重点有所知晓的同时,对语言沟通也很重要。

描述

  1. 抓取100篇和自动驾驶有关的文章、新闻到论文;(BeautifulSoup + nltk)
  2. 找出与自动驾驶有关的高频重点词汇;(需要规则匹配)
  3. 自动给出释义和用法,最终做成一个和自动驾驶相关的高频词库(存储)

思路

之前对自然语言处理有过一定了解,对于最基本的处理方式有一些了解。本次正是一个实际操练的过程。

NLTK,Natural Language Toolkit,自然语言处理工具包,在NLP领域中,最常使用的一个Python库。

主要过程:

文本处理流程

以下,先从一段文本的处理,得到词频获取的一般流程,然后从web数据源抓取HTML内容的自动驾驶相关新闻数据,做进一步处理。

单篇文章实践

要完成批量的过程,必须的一个过程是批量处理。以下,从单篇文档的角度,讲述分段分句,分词的处理过程。

文章内容

Toyota Will Test Their AI-Powered Driverless Cars in 2020:
https://futurism.com/toyota-will-test-their-ai-powered-driverless-cars-in-2020/

可以基于HTML内容,再做去HTML标签的做法,得到真正关心的内容。
前期先将文章内容拷贝制作成为一个单独的数据文件。

链表化

fdist2 = tokenizer.tokenize(raw)

词频分析

fdist2 = tokenizer.tokenize(raw)
fdist2.plot(50, cumulative=True);
>>> fdist2.items()
[('all', 1), ('concept', 1), ('Powered', 1), ('expects', 1), ('month', 1), ('Okabe', 1), ('driverless', 1), ('planning', 2), ('through', 1), ('emotions', 1), ('Test', 1), ('its', 1), ('boundaries', 1), ('supposedly', 1), ('title', 1), ('drive', 1), ('better', 1), ('to', 12), ('only', 1), ('going', 1), ('equipped', 1), ('has', 1), ('By', 1), ('meant', 1), ('division', 1), ('tests', 1), ('hit', 1), ('unveiling', 1), ('get', 1), ('assistant', 1), ('BRIEF', 1), ('know', 1), ('using', 2), ('now', 1), ('governments', 1), ('runs', 2), ('trucks', 1), ('bring', 1), ('manager', 1), ('drivers', 1), ('this', 1), ('habits', 1), ('t', 2), ('enhance', 1), ('works', 1), ('futuristic', 1), ('Tokyo', 1), ('Management', 1), ('isn\xe2', 2), ('intelligence', 1), ('testing', 2), ('are', 2), ('combining', 1), ('arm', 1), ('transport', 1), ('said', 1), ('preferences', 1), ('for', 5), ('Driverless', 1), ('artificial', 1), ('content', 1), ('capital', 1), ('new', 1), ('be', 4), ('we', 1), ('transportation', 1), ('business', 1), ('cars', 4), ('pushing', 1), ('notable', 1), ('however', 1), ('Yui', 4), ('EV', 1), ('by', 2), ('manufacturer', 1), ('on', 1), ('Concept', 2), ('of', 6), ('Image', 2), ('experience', 1), ('trial', 2), ('s', 2), ('games', 1), ('Makoto', 1), ('year', 1), ('billions', 1), ('Their', 1), ('learning', 1), ('roads', 1), ('your', 1), ('Sure', 1), ('Olympic', 1), ('powered', 2), ('spending', 1), ('Cars', 1), ('flying', 1), ('system', 1), ('start', 1), ('expand', 1), ('way', 1), ('vehicle', 1), ('Toyota\xe2', 1), ('speaking', 1), ('wants', 2), ('that', 2), ('company', 1), ('Credit', 1), ('line', 1), ('general', 1), ('with', 4), ('builds', 1), ('these', 3), ('car', 2), ('will', 2), ('future', 1), ('2020', 5), ('venture', 1), ('carmakers', 1), ('making', 1), ('autonomous', 3), ('called', 2), ('affection', 1), ('and', 4), ('promises', 1), ('later', 1), ('want', 1), ('Cartivator', 1), ('is', 4), ('partnership', 1), ('them', 1), ('deep', 1), ('an', 2), ('as', 1), ('chat', 1), ('in', 7), ('seem', 1), ('technology', 1), ('their', 6), ('Reuters', 1), ('again', 1), ('different', 1), ('credit', 1), ('hydrogen', 1), ('self', 1), ('After', 1), ('able', 1), ('virtual', 1), ('also', 1), ('which', 1), ('test', 1), ('development', 2), ('product', 1), ('Resource', 1), ('AI', 7), ('object', 1), ('Toyota', 10), ('Will', 1), ('paving', 1), ('waves', 1), ('The', 3), ('the', 10), ('carmaker', 1), ('Yui\xe2', 1), ('a', 4), ('\xe2', 4), ('driving', 3), ('i', 2), ('vehicles', 3), ('Japanese', 2), ('time', 1), ('model', 1), ('looking', 1), ('make', 1), ('typical', 1)]
>>> fdist2.max()
'to'

词频排名前15分析:

累计词频排名前15如下:

移除噪声

何为噪声?
任何与数据上下文和最终输出无关的文本都可被判作噪声。

问题:排名在前的介词,连接词等,这些词被称为语言停止词(stopword,语言中常用的词汇:系动词is,am,定冠词the,介词of,in),均不会成为我们关心的内容,需要过滤掉。

解决方案:去除停止次

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.book import *

f = open('news1')
raw = f.read()
tokenizer = RegexpTokenizer(r'\w+')
fdist2 = tokenizer.tokenize(raw.decode('utf-8'))

filtered = [word for word in fdist2 if(word not in stopwords.words('english'))]
fd = FreqDist(filtered)
fd.plot(20, cumulative=False)

去掉停止词之后的效果,已经明显的吐出了其中要点词:

news2:

news3:

还存在的问题:

TODO 解决办法:

要做自己的词频库,下一步就是将这些词,与对应的词组结合起来,是跟其使用情况结合。

到目前为止,方案已经可行。为了将该过程做成自动化的过程,以下是该部分的完整逻辑:

增加抓取部分

确定数据源

HTML内容处理

抓取部分主要依赖BeautifulSoup,基类主要完成文章内容内容页的抓取,具体的内容页完成具体网站列表页文章列表的提取。比如卫报内容的抓取过程:

#! /usr/bin/env python
# coding=utf8
#
#
from urllib2 import urlopen
from bs4 import BeautifulSoup

class Scrapy:
# news content fetch
def fetchContent(self, url):
soup = BeautifulSoup(urlopen(url))
news = ''

# title
title = soup.title.string
news += soup.title.string + '\r\n\n'
# content
for string in soup.find_all('p'):
if string.string != None:
news+=string.string

news += '\r\n\n' + url

self.save_file(title, news)
return Scrapy.news

def save_file(self, title, news_content):
f = open('news_sample/' + title + '.news', 'w+')
f.write(news_content.encode('utf-8'))
f.close()

class Guardian(Scrapy):
data_source = "https://www.theguardian.com/technology/self-driving-cars"

def fetchArticleHref(self):
# url for autonomous vehicles tags
#soup = BeautifulSoup(urlopen(Guardian.data_source))
soup = BeautifulSoup(urlopen("https://www.theguardian.com/technology/self-driving-cars"))
# href
scrapy = Scrapy()
for article in soup.find_all('h2'):
article_url = article.a.get('href')
print article_url
news = scrapy.fetchContent(article_url)
print news

Questions

刚开始通过BeautifulSoup获取到的HTML内容粒度有些粗,噪声很大,需要将javascript部分的内容也去掉;后来更精准的办法是将标题与文本内容分别获取即可。

处理后的新闻内容,以单独的文件存储。格式为:

标题

文章正文
文章链接

接下来针对已经抓取到的124文章,做整体的分析。

import glob
import os
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.book import *

raw = ''
for i in range(1, 4):
f = open('news_sample/news' + str(i))
raw += f.read()

news_files = glob.glob('news_sample/*.news')
for i in news_files:
with open(i,'rb') as infile:
raw +=(infile.read())

tokenizer = RegexpTokenizer(r'\w+')
token = tokenizer.tokenize(raw.decode('utf-8'))

filtered = [word for word in token if(word not in stopwords.words('english'))]
fd = FreqDist(filtered)
fd.plot(100, cumulative=False)

以下为人工未干预的词频分布情况,排名前100:

查看原图更清晰

之我见:

script>