I’ve spent the last two days trying to figure out how to scrape data from a website and turn the relevant information into flashcards.
This is the procedure I’ve come up with:
-
Download the
RTF
file provided by the website developer. -
Save it as
htm
file. -
Scrape this file using pandas (the information is contained in tables that “cannot” be scraped directly from the website using BS4 in a practical way as there’s multiple
div
using the sameclass
name). -
Export the data to an Excel file with multiple sheets.
-
Save every sheet as a
.txt
file. -
Edit every file to add a custom separator.
-
Change the codification of these files to UTF-8.
-
Import the
txt
files to Anki.
Well… I’ve got 100+ rtf
files to scrape and each one of them contains something around 20 tables.
I’m definitely not interested in repeating the procedure I described above that many times so, I wonder if there’s a more efficient way to get the same result.
Can you help me with that?
For more context, this is the code I'm using to scrape the `htm` file:
import pandas as pd
url = 'C:/Users/Me/Desktop/Inf0731.htm'
df = pd.read_html(url)
q0 = {
'a' : [
df[2][3][0],
df[2][3][1],
df[2][3][2],
df[3][0][1],
df[4][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d0 = pd.DataFrame(data= q0)
q1 = {
'a' : [
df[6][3][0],
df[6][3][1],
df[6][3][2],
df[7][0][1],
df[8][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d1 = pd.DataFrame(data= q1)
q2 = {
'a' : [
df[9][3][0],
df[9][3][1],
df[9][3][2],
df[10][0][1],
df[11][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d2 = pd.DataFrame(data= q2)
q3 = {
'a' : [
df[12][3][0],
df[12][3][1],
df[12][3][2],
df[13][0][1],
df[14][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d3 = pd.DataFrame(data= q3)
q4 = {
'a' : [
df[16][3][0],
df[16][3][1],
df[16][3][2],
df[17][0][1],
df[18][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d4 = pd.DataFrame(data= q4)
q5 = {
'a' : [
df[20][3][0],
df[20][3][1],
df[20][3][2],
df[21][0][1],
df[22][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::',
]}
d5 = pd.DataFrame(data= q5)
q6 = {
'a' : [
df[23][3][0],
df[23][3][1],
df[23][3][2],
df[24][0][1],
df[25][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d6 = pd.DataFrame(data= q6)
q7 = {
'a' : [
df[26][3][0],
df[26][3][1],
df[26][3][2],
df[27][0][1],
df[28][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d7 = pd.DataFrame(data= q7)
q8 = {
'a' : [
df[29][3][0],
df[29][3][1],
df[29][3][2],
df[30][0][1],
df[31][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d8 = pd.DataFrame(data= q8)
q9 = {
'a' : [
df[32][3][0],
df[32][3][1],
df[32][3][2],
df[33][0][1],
df[34][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d9 = pd.DataFrame(data= q9)
q10 = {
'a' : [
df[35][3][0],
df[35][3][1],
df[35][3][2],
df[36][0][1],
df[37][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d10 = pd.DataFrame(data= q10)
q11 = {
'a' : [
df[39][3][0],
df[39][3][1],
df[39][3][2],
df[40][0][1],
df[41][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d11 = pd.DataFrame(data= q11)
q12 = {
'a' : [
df[42][3][0],
df[42][3][1],
df[42][3][2],
df[43][0][1],
df[44][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d12 = pd.DataFrame(data= q12)
q13 = {
'a' : [
df[46][3][0],
df[46][3][1],
df[46][3][2],
df[47][0][1],
df[48][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d13 = pd.DataFrame(data= q13)
q14 = {
'a' : [
df[49][3][0],
df[49][3][1],
df[49][3][2],
df[50][0][1],
df[51][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d14 = pd.DataFrame(data= q14)
q15 = {
'a' : [
df[52][3][0],
df[52][3][1],
df[52][3][2],
df[53][0][1],
df[54][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d15 = pd.DataFrame(data= q15)
q16 = {
'a' : [
df[55][3][0],
df[55][3][1],
df[55][3][2],
df[56][0][1],
df[57][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d16 = pd.DataFrame(data= q16)
q17 = {
'a' : [
df[58][3][0],
df[58][3][1],
df[58][3][2],
df[59][0][1],
df[60][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d17 = pd.DataFrame(data= q17)
q18 = {
'a' : [
df[62][3][0],
df[62][3][1],
df[62][3][2],
df[63][0][1],
df[64][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d18 = pd.DataFrame(data= q18)
q19 = {
'a' : [
df[65][3][0],
df[65][3][1],
df[65][3][2],
df[66][0][1],
df[67][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d19 = pd.DataFrame(data= q19)
q20 = {
'a' : [
df[68][3][0],
df[68][3][1],
df[68][3][2],
df[69][0][1],
df[70][0][1],
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
]}
d20 = pd.DataFrame(data= q20)
with pd.ExcelWriter('stj_731.xlsx') as writer:
d0.to_excel(writer, sheet_name='1', header=False, index=False)
d1.to_excel(writer, sheet_name='2', header=False, index=False)
d2.to_excel(writer, sheet_name='3', header=False, index=False)
d3.to_excel(writer, sheet_name='4', header=False, index=False)
d4.to_excel(writer, sheet_name='5', header=False, index=False)
d5.to_excel(writer, sheet_name='6', header=False, index=False)
d6.to_excel(writer, sheet_name='7', header=False, index=False)
d7.to_excel(writer, sheet_name='8', header=False, index=False)
d8.to_excel(writer, sheet_name='9', header=False, index=False)
d9.to_excel(writer, sheet_name='10', header=False, index=False)
d10.to_excel(writer, sheet_name='11', header=False, index=False)
d11.to_excel(writer, sheet_name='12', header=False, index=False)
d12.to_excel(writer, sheet_name='13', header=False, index=False)
d13.to_excel(writer, sheet_name='14', header=False, index=False)
d14.to_excel(writer, sheet_name='15', header=False, index=False)
d15.to_excel(writer, sheet_name='16', header=False, index=False)
d16.to_excel(writer, sheet_name='17', header=False, index=False)
d17.to_excel(writer, sheet_name='18', header=False, index=False)
d18.to_excel(writer, sheet_name='19', header=False, index=False)
d19.to_excel(writer, sheet_name='20', header=False, index=False)
d20.to_excel(writer, sheet_name='21', header=False, index=False)