I need help to import a huge number of cards from text files in a more efficient way

I’ve spent the last two days trying to figure out how to scrape data from a website and turn the relevant information into flashcards.

This is the procedure I’ve come up with:

  1. Download the RTF file provided by the website developer.

  2. Save it as htm file.

  3. Scrape this file using pandas (the information is contained in tables that “cannot” be scraped directly from the website using BS4 in a practical way as there’s multiple div using the same class name).

  4. Export the data to an Excel file with multiple sheets.

  5. Save every sheet as a .txt file.

  6. Edit every file to add a custom separator.

  7. Change the codification of these files to UTF-8.

  8. Import the txt files to Anki.

Well… I’ve got 100+ rtf files to scrape and each one of them contains something around 20 tables.

I’m definitely not interested in repeating the procedure I described above that many times so, I wonder if there’s a more efficient way to get the same result.

Can you help me with that?

For more context, this is the code I'm using to scrape the `htm` file:
import pandas as pd

url = 'C:/Users/Me/Desktop/Inf0731.htm'

df = pd.read_html(url)

q0 = {  

     'a' : [

       df[2][3][0],

       df[2][3][1],

       df[2][3][2],

       df[3][0][1],

       df[4][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d0 = pd.DataFrame(data= q0)

q1 = {  

     'a' : [

       df[6][3][0],

       df[6][3][1],

       df[6][3][2],

       df[7][0][1],

       df[8][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d1 = pd.DataFrame(data= q1)

q2 = {  

     'a' : [

       df[9][3][0],

       df[9][3][1],

       df[9][3][2],

       df[10][0][1],

       df[11][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d2 = pd.DataFrame(data= q2)

q3 = {  

     'a' : [

       df[12][3][0],

       df[12][3][1],

       df[12][3][2],

       df[13][0][1],

       df[14][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d3 = pd.DataFrame(data= q3)

q4 = {  

     'a' : [

       df[16][3][0],

       df[16][3][1],

       df[16][3][2],

       df[17][0][1],

       df[18][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d4 = pd.DataFrame(data= q4)

q5 = {  

     'a' : [

       df[20][3][0],

       df[20][3][1],

       df[20][3][2],

       df[21][0][1],

       df[22][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::',

       ]}

d5 = pd.DataFrame(data= q5)

q6 = {  

     'a' : [

       df[23][3][0],

       df[23][3][1],

       df[23][3][2],

       df[24][0][1],

       df[25][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d6 = pd.DataFrame(data= q6)

q7 = {  

     'a' : [

       df[26][3][0],

       df[26][3][1],

       df[26][3][2],

       df[27][0][1],

       df[28][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d7 = pd.DataFrame(data= q7)

q8 = {  

     'a' : [

       df[29][3][0],

       df[29][3][1],

       df[29][3][2],

       df[30][0][1],

       df[31][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d8 = pd.DataFrame(data= q8)

q9 = {  

     'a' : [

       df[32][3][0],

       df[32][3][1],

       df[32][3][2],

       df[33][0][1],

       df[34][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d9 = pd.DataFrame(data= q9)

q10 = {  

     'a' : [

       df[35][3][0],

       df[35][3][1],

       df[35][3][2],

       df[36][0][1],

       df[37][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d10 = pd.DataFrame(data= q10)

q11 = {  

     'a' : [

       df[39][3][0],

       df[39][3][1],

       df[39][3][2],

       df[40][0][1],

       df[41][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d11 = pd.DataFrame(data= q11)

q12 = {  

     'a' : [

       df[42][3][0],

       df[42][3][1],

       df[42][3][2],

       df[43][0][1],

       df[44][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d12 = pd.DataFrame(data= q12)

q13 = {  

     'a' : [

       df[46][3][0],

       df[46][3][1],

       df[46][3][2],

       df[47][0][1],

       df[48][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d13 = pd.DataFrame(data= q13)

q14 = {  

     'a' : [

       df[49][3][0],

       df[49][3][1],

       df[49][3][2],

       df[50][0][1],

       df[51][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d14 = pd.DataFrame(data= q14)

q15 = {  

     'a' : [

       df[52][3][0],

       df[52][3][1],

       df[52][3][2],

       df[53][0][1],

       df[54][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d15 = pd.DataFrame(data= q15)

q16 = {  

     'a' : [

       df[55][3][0],

       df[55][3][1],

       df[55][3][2],

       df[56][0][1],

       df[57][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d16 = pd.DataFrame(data= q16)

q17 = {  

     'a' : [

       df[58][3][0],

       df[58][3][1],

       df[58][3][2],

       df[59][0][1],

       df[60][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d17 = pd.DataFrame(data= q17)

q18 = {  

     'a' : [

       df[62][3][0],

       df[62][3][1],

       df[62][3][2],

       df[63][0][1],

       df[64][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d18 = pd.DataFrame(data= q18)

q19 = {  

     'a' : [

       df[65][3][0],

       df[65][3][1],

       df[65][3][2],

       df[66][0][1],

       df[67][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d19 = pd.DataFrame(data= q19)

q20 = {  

     'a' : [

       df[68][3][0],

       df[68][3][1],

       df[68][3][2],

       df[69][0][1],

       df[70][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d20 = pd.DataFrame(data= q20)

with pd.ExcelWriter('stj_731.xlsx') as writer:

       d0.to_excel(writer, sheet_name='1', header=False, index=False)

       d1.to_excel(writer, sheet_name='2', header=False, index=False)

       d2.to_excel(writer, sheet_name='3', header=False, index=False)

       d3.to_excel(writer, sheet_name='4', header=False, index=False)

       d4.to_excel(writer, sheet_name='5', header=False, index=False)

       d5.to_excel(writer, sheet_name='6', header=False, index=False)

       d6.to_excel(writer, sheet_name='7', header=False, index=False)

       d7.to_excel(writer, sheet_name='8', header=False, index=False)

       d8.to_excel(writer, sheet_name='9', header=False, index=False)

       d9.to_excel(writer, sheet_name='10', header=False, index=False)

       d10.to_excel(writer, sheet_name='11', header=False, index=False)

       d11.to_excel(writer, sheet_name='12', header=False, index=False)

       d12.to_excel(writer, sheet_name='13', header=False, index=False)

       d13.to_excel(writer, sheet_name='14', header=False, index=False)

       d14.to_excel(writer, sheet_name='15', header=False, index=False)

       d15.to_excel(writer, sheet_name='16', header=False, index=False)

       d16.to_excel(writer, sheet_name='17', header=False, index=False)

       d17.to_excel(writer, sheet_name='18', header=False, index=False)

       d18.to_excel(writer, sheet_name='19', header=False, index=False)

       d19.to_excel(writer, sheet_name='20', header=False, index=False)

       d20.to_excel(writer, sheet_name='21', header=False, index=False)

You would have to provide a link to download the files for someone to be able to help.

But anyway, your code looks very repetitive. Can’t you use a for loop here?

Usa um loop.
É o QC ou TEC? Bs4 funciona.

You’re right.

Here is the link for the actual website: https://processo.stj.jus.br/jurisprudencia/externo/informativo/?acao=pesquisarumaedicao&livre='0731'.cod.

Here is the link to download the rtf file: https://processo.stj.jus.br/docs_internet/informativos//RTF/Inf0731.rtf

It’s all written in Portuguese, but I’ll try to make it easier to understand what I want below:

The content that matters to me is inside:

  • <span class="clsInformativoTitulo">
    It appears just once. In my python file it’s equivalent to:
    df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

  • <div class="divCell clsInformativoTexto">
    It appears three times on each table, as shown in the image below. In my python file it’s equivalent to:
    df[NUMBER FROM 2 TO 68 HERE][3][0], df[NUMBER FROM 2 TO 68 HERE][3][1], df[NUMBER FROM 2 TO 68 HERE][3][2],

  • <div class="divCell clsInformativoTextoFormatado">
    It appears two times on each table, see image. In my python file it’s equivalent to:
    df[NUMBER FROM **3** TO 69 HERE][0][1], df[NUMBER FROM **4** TO 70 HERE][0][1]

:warning: The numbers inside the first [] doesn’t follow a perfect sequence. (e.g. it jumps from df[44] in q12 to df[46] in q13.

Image

:arrow_right: I want the content inside of each one of these red boxes - from the second to the last - to be in a field in the back side of the card. I also want the content inside of the first red box to be the tag for the card. This particular document has 21 tables, and I want to turn each one of them into a card.

Sincerely, I don’t know… My python skills are very basic and I just learned about pandas yesterday. I’d be interested in learning about it, though. Maybe you could share a link to some tutorial or documentation?

By the way, thank you for your time and your help!

STJ, rsrsrs. Mas também quero fazer do QC no futuro. Você tem alguma orientação sobre isso?

I recommend going through this free book, which teaches the basics of Python and how to pull data from websites.

The website is refusing to open for me for some reason.

1 Like

Qual parte voce quer pegar? o DESTAQUE?