数据导入与预处理实验二---json格式文件转换-物联网技术文章-傲云油气装备网

一、实验概述：
【实验目的】

初步掌握数据采集的方法；

初步掌握利用爬虫爬取网络数据的方法

掌握不同数据格式之间的转换方法；

【实施环境】（使用的材料、设备、软件） Linux或Windows操作系统环境，MySql数据库，Python或其他高级语言

二、实验内容
第1题爬取网络数据
【实验要求】

爬取酷狗音乐网站（https://www.kugou.com/）上榜单前500名的歌曲名称，演唱者，歌名和歌曲时长

将爬取的数据以JSon格式文件保存。

读取JSON格式任意数据，检验文件格式是否正确。

【实验过程】（步骤、记录、数据、程序等）
请提供操作步骤及界面截图证明。

from bs4 import BeautifulSoup
import requests
import time
import re
import json
import demjson
headers = { 
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

nameList = []
singerList = []
timeList = []
song = []
total = []
keys = ['songName','singer','time']

def get_info(url, file):
    res = requests.get(url, headers=headers)
    res.encoding = file.encoding  # 同样读取和写入的编码格式
    soup = BeautifulSoup(res.text, 'lxml')
    ranks = soup.select('span.pc_temp_num')
    titles = soup.select('a.pc_temp_songname')
    times = soup.select('span.pc_temp_time')
    #jsonData = []
    for rank, title, time in zip(ranks, titles, times):
        data = { 
            #'rank': rank.get_text().strip(),
            'title': title.get_text().strip(),
            'time': time.get_text().strip()
        }
        #print(data)

        singer, songName = data['title'].split(' - ')
        nameList.append(songName)
        singerList.append(singer)
        timeList.append(data['time'])
        #print(nameList)
        #print(singerList)
        #print(data['time'])
        #print(timeList)
        #print(singer, songName)
        #print(jsonData)

def output(url, file):
    songInfo = []
    for i in range(0,len(nameList)):
        #print(nameList[i])
        #print(singerList[i])
        #print(timeList[i])
        songInfo.append(nameList[i])
        songInfo.append(singerList[i])
        songInfo.append(timeList[i])
    #print(songInfo)
    for i in range(0, len(songInfo), 3):
        temp = songInfo[i:i + 3]
        song.append(temp)
    #print(len(song))
    file.write('{\n"songInfo":[\n')
    for i in range(0,len(song)):
        d = dict(zip(keys, song[i]))
        #print(d)
        file.write(json.dumps(d,ensure_ascii=False,indent=4,separators=(',', ': ')))
        if i != len(song)-1:
            file.write(',')
    file.write('\n]\n}')
def get_website_encoding(url):  # 一般每个网站自己的网页编码都是一致的,所以只需要搜索一次主页确定
    res = requests.get(url, headers=headers)
    charset = re.search("charset=(.*?)>", res.text)
    if charset is not None:
        blocked = ['\'', ' ', '\"', '/']
        filter = [c for c in charset.group(1) if c not in blocked]
        return ''.join(filter)  # 修改res编码格式为源网页的格式,防止出现乱码
    else:
        return res.encoding  # 没有找到编码格式,返回res的默认编码

if __name__ == '__main__':
    encoding = get_website_encoding('http://www.kugou.com')
    #print(encoding)
    urls = ['http://www.kugou.com/yy/rank/home/{}-8888.html?from=rank'.format(str(i)) for i in range(1, 23)]
with open(r'.\kugou_500.json', 'w+', encoding=encoding) as f:
    #f.write("歌手 歌名 长度\n")
    for url in urls:
        get_info(url, f)
        time.sleep(1) #缓冲一秒,防止请求频率过快
    output(url,f)

得到的json文件

打开使用json.load打开文件，成功输出后代表文件格式正确

import json

with open("kugou_500.json",'r',encoding='UTF-8') as f:
    new_dict = json.load(f)
    print(new_dict)

第2题编程生成CSV文件并转换成JSon格式
【实验要求】

编程生成CSV格式文件。文件内容如下：姓名，性别，籍贯，系别张迪，男，重庆，计算机系兰博，男，江苏，通信工程系黄飞，男，四川，物联网系邓玉春，女，陕西，计算机系周丽，女，天津，艺术系李云，女，上海，外语系

将上述CSV格式文件转换成JSon格式，并查询文件中所有女生的信息。

【实验过程】（步骤、记录、数据、程序等）
请提供操作步骤及界面截图证明。

import csv
#创建文件对象
f = open("question02.csv","w",encoding="utf-8")
#构建csv写入对象
csv_writer = csv.writer(f)
#构建列表头
csv_writer.writerow(["姓名","性别","籍贯","系别"])
#写入csv文件内容
csv_writer.writerow(["张迪","男","重庆","计算机系"])
csv_writer.writerow(["兰博","男","江苏","通信工程系"])
csv_writer.writerow(["黄飞","男","四川","物联网系"])
csv_writer.writerow(["周丽","女","天津","艺术系"])
csv_writer.writerow(["李芸","女","上海","外语系"])

转换为json格式

import csv
import json
csvFile = open("question02.csv","r",encoding="utf-8")
jsonFile = open("question02.json","w",encoding="utf-8")

fieldNames = { "姓名","性别","籍贯","系别"}
reader = csv.DictReader(csvFile)
i = 1
jsonFile.write('{\n"personInfo":[\n')
for row in reader:
    print(row)
    jsonFile.write(json.dumps(row,ensure_ascii=False,indent=4))
    if i != 5:
        jsonFile.write(',')
        i = i+1
jsonFile.write('\n]\n}')

import json
with open("question02.json","r",encoding="utf-8") as f:
    data = json.load(f)
    #print(data['personInfo'][1]['性别'])
    #print(type(data))
    for i in range(0,5):
        if data['personInfo'][i]['性别'] == '女':
            print(data['personInfo'][i])

第3题. XML格式文件与JSon的转换
【实验内容集要求】
(1) 读取以下XML格式的文件，内容如下： <?xml
version=”1.0” encoding=”gb2312”> <图书> <书名>红楼梦</书名> <作者>曹雪芹</作者><主要内容>描述贾宝玉和林黛玉的爱情故事</主要内容> <出版社>人民文学出版社</出版社> </图书>
(2) 将以上XML格式文件转换成JSon格式。

【实验过程】（步骤、记录、数据、程序等）
请提供相应代码及程序运行界面截图。

新建xml文件

import xml.dom.minidom
import xmltodict
import json
#打开xml文档
#dom = xml.dom.minidom.parse('question_03.xml')
#得到文档元素对象
#root = dom.documentElement
#bb = root.getElementsByTagName('书名')
#print(bb[0].firstChild.data)

#获取xml文件
file = open("question_03.xml","r",encoding="utf-8")
#读取文件内容
xmlStr = file.read()
#print(xmlStr)
jsonStr = xmltodict.parse(xmlStr)
#print(jsonStr)
with open("question03JSON.json","w",encoding="utf-8") as f:
    f.write(str(json.dumps(jsonStr,ensure_ascii=False,indent=4,separators=(',', ': '))))

• 乐鑫Esp32-S2学习之旅② ESP32-S2 以 I2C 驱动	• 仿面具公园源码约会源码面具公园app搭建简单教
• 华大单片机移植RTThread操作系统	• Haclon相机的标定
• Misra-C编码规范全解读 - 总目录	• 有关HC-05蓝牙模块的学习记录

• Esp8266天猫精灵_RGB灯_非点灯平台	• STM32F103 串口1和串口3对发数据配合蓝牙模块
• TMS570学习【1】了解什么是TMS570	• 新闻稿 \| Qt公司收购froglogic公司以巩固市场领
• [Java]SpringBoot2整合mqtt服务器EMQ实现消息订	• 苹果群控投屏同步操作原理及运用的平台APP分享

• Esp8266天猫精灵_RGB灯_非点灯平台	• STM32F103 串口1和串口3对发数据配合蓝牙模块
• TMS570学习【1】了解什么是TMS570	• 新闻稿 \| Qt公司收购froglogic公司以巩固市场领
• [Java]SpringBoot2整合mqtt服务器EMQ实现消息订	• 苹果群控投屏同步操作原理及运用的平台APP分享
• STM32查询式按键输入[直接用寄存器]	• Ubuntu系统 USB设备端口绑定
• 2021-04-14 第四次按键输入实验	• Flutter扫码功能完美实现