I need help to import a huge number of cards from text files in a more efficient way

gbrl.sc · April 6, 2022, 2:39pm

I’ve spent the last two days trying to figure out how to scrape data from a website and turn the relevant information into flashcards.

This is the procedure I’ve come up with:

Download the RTF file provided by the website developer.
Save it as htm file.
Scrape this file using pandas (the information is contained in tables that “cannot” be scraped directly from the website using BS4 in a practical way as there’s multiple div using the same class name).
Export the data to an Excel file with multiple sheets.
Save every sheet as a .txt file.
Edit every file to add a custom separator.
Change the codification of these files to UTF-8.
Import the txt files to Anki.

Well… I’ve got 100+ rtf files to scrape and each one of them contains something around 20 tables.

I’m definitely not interested in repeating the procedure I described above that many times so, I wonder if there’s a more efficient way to get the same result.

Can you help me with that?

For more context, this is the code I'm using to scrape the `htm` file:

import pandas as pd

url = 'C:/Users/Me/Desktop/Inf0731.htm'

df = pd.read_html(url)

q0 = {  

     'a' : [

       df[2][3][0],

       df[2][3][1],

       df[2][3][2],

       df[3][0][1],

       df[4][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d0 = pd.DataFrame(data= q0)

q1 = {  

     'a' : [

       df[6][3][0],

       df[6][3][1],

       df[6][3][2],

       df[7][0][1],

       df[8][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d1 = pd.DataFrame(data= q1)

q2 = {  

     'a' : [

       df[9][3][0],

       df[9][3][1],

       df[9][3][2],

       df[10][0][1],

       df[11][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d2 = pd.DataFrame(data= q2)

q3 = {  

     'a' : [

       df[12][3][0],

       df[12][3][1],

       df[12][3][2],

       df[13][0][1],

       df[14][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d3 = pd.DataFrame(data= q3)

q4 = {  

     'a' : [

       df[16][3][0],

       df[16][3][1],

       df[16][3][2],

       df[17][0][1],

       df[18][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'

       ]}

d4 = pd.DataFrame(data= q4)

q5 = {  

     'a' : [

       df[20][3][0],

       df[20][3][1],

       df[20][3][2],

       df[21][0][1],

       df[22][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::',

       ]}

d5 = pd.DataFrame(data= q5)

q6 = {  

     'a' : [

       df[23][3][0],

       df[23][3][1],

       df[23][3][2],

       df[24][0][1],

       df[25][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d6 = pd.DataFrame(data= q6)

q7 = {  

     'a' : [

       df[26][3][0],

       df[26][3][1],

       df[26][3][2],

       df[27][0][1],

       df[28][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d7 = pd.DataFrame(data= q7)

q8 = {  

     'a' : [

       df[29][3][0],

       df[29][3][1],

       df[29][3][2],

       df[30][0][1],

       df[31][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d8 = pd.DataFrame(data= q8)

q9 = {  

     'a' : [

       df[32][3][0],

       df[32][3][1],

       df[32][3][2],

       df[33][0][1],

       df[34][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d9 = pd.DataFrame(data= q9)

q10 = {  

     'a' : [

       df[35][3][0],

       df[35][3][1],

       df[35][3][2],

       df[36][0][1],

       df[37][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d10 = pd.DataFrame(data= q10)

q11 = {  

     'a' : [

       df[39][3][0],

       df[39][3][1],

       df[39][3][2],

       df[40][0][1],

       df[41][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d11 = pd.DataFrame(data= q11)

q12 = {  

     'a' : [

       df[42][3][0],

       df[42][3][1],

       df[42][3][2],

       df[43][0][1],

       df[44][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d12 = pd.DataFrame(data= q12)

q13 = {  

     'a' : [

       df[46][3][0],

       df[46][3][1],

       df[46][3][2],

       df[47][0][1],

       df[48][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d13 = pd.DataFrame(data= q13)

q14 = {  

     'a' : [

       df[49][3][0],

       df[49][3][1],

       df[49][3][2],

       df[50][0][1],

       df[51][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d14 = pd.DataFrame(data= q14)

q15 = {  

     'a' : [

       df[52][3][0],

       df[52][3][1],

       df[52][3][2],

       df[53][0][1],

       df[54][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d15 = pd.DataFrame(data= q15)

q16 = {  

     'a' : [

       df[55][3][0],

       df[55][3][1],

       df[55][3][2],

       df[56][0][1],

       df[57][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d16 = pd.DataFrame(data= q16)

q17 = {  

     'a' : [

       df[58][3][0],

       df[58][3][1],

       df[58][3][2],

       df[59][0][1],

       df[60][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d17 = pd.DataFrame(data= q17)

q18 = {  

     'a' : [

       df[62][3][0],

       df[62][3][1],

       df[62][3][2],

       df[63][0][1],

       df[64][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d18 = pd.DataFrame(data= q18)

q19 = {  

     'a' : [

       df[65][3][0],

       df[65][3][1],

       df[65][3][2],

       df[66][0][1],

       df[67][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d19 = pd.DataFrame(data= q19)

q20 = {  

     'a' : [

       df[68][3][0],

       df[68][3][1],

       df[68][3][2],

       df[69][0][1],

       df[70][0][1],

       df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'      

       ]}

d20 = pd.DataFrame(data= q20)

with pd.ExcelWriter('stj_731.xlsx') as writer:

       d0.to_excel(writer, sheet_name='1', header=False, index=False)

       d1.to_excel(writer, sheet_name='2', header=False, index=False)

       d2.to_excel(writer, sheet_name='3', header=False, index=False)

       d3.to_excel(writer, sheet_name='4', header=False, index=False)

       d4.to_excel(writer, sheet_name='5', header=False, index=False)

       d5.to_excel(writer, sheet_name='6', header=False, index=False)

       d6.to_excel(writer, sheet_name='7', header=False, index=False)

       d7.to_excel(writer, sheet_name='8', header=False, index=False)

       d8.to_excel(writer, sheet_name='9', header=False, index=False)

       d9.to_excel(writer, sheet_name='10', header=False, index=False)

       d10.to_excel(writer, sheet_name='11', header=False, index=False)

       d11.to_excel(writer, sheet_name='12', header=False, index=False)

       d12.to_excel(writer, sheet_name='13', header=False, index=False)

       d13.to_excel(writer, sheet_name='14', header=False, index=False)

       d14.to_excel(writer, sheet_name='15', header=False, index=False)

       d15.to_excel(writer, sheet_name='16', header=False, index=False)

       d16.to_excel(writer, sheet_name='17', header=False, index=False)

       d17.to_excel(writer, sheet_name='18', header=False, index=False)

       d18.to_excel(writer, sheet_name='19', header=False, index=False)

       d19.to_excel(writer, sheet_name='20', header=False, index=False)

       d20.to_excel(writer, sheet_name='21', header=False, index=False)

abdo · April 6, 2022, 3:53pm

You would have to provide a link to download the files for someone to be able to help.

But anyway, your code looks very repetitive. Can’t you use a for loop here?

cjdduarte · April 6, 2022, 4:32pm

Usa um loop.
É o QC ou TEC? Bs4 funciona.

gbrl.sc · April 6, 2022, 5:26pm

You’re right.

Here is the link for the actual website: https://processo.stj.jus.br/jurisprudencia/externo/informativo/?acao=pesquisarumaedicao&livre='0731'.cod.

Here is the link to download the rtf file: https://processo.stj.jus.br/docs_internet/informativos//RTF/Inf0731.rtf

It’s all written in Portuguese, but I’ll try to make it easier to understand what I want below:

The content that matters to me is inside:

<span class="clsInformativoTitulo">
It appears just once. In my python file it’s equivalent to:
df[0][0][1].replace('Número ', 'JURISPRUDÊNCIA::INFORMATIVOS::STJ_') + '::'
<div class="divCell clsInformativoTexto">
It appears three times on each table, as shown in the image below. In my python file it’s equivalent to:
df[NUMBER FROM 2 TO 68 HERE][3][0], df[NUMBER FROM 2 TO 68 HERE][3][1], df[NUMBER FROM 2 TO 68 HERE][3][2],
<div class="divCell clsInformativoTextoFormatado">
It appears two times on each table, see image. In my python file it’s equivalent to:
df[NUMBER FROM **3** TO 69 HERE][0][1], df[NUMBER FROM **4** TO 70 HERE][0][1]

The numbers inside the first [] doesn’t follow a perfect sequence. (e.g. it jumps from df[44] in q12 to df[46] in q13.

Image

I want the content inside of each one of these red boxes - from the second to the last - to be in a field in the back side of the card. I also want the content inside of the first red box to be the tag for the card. This particular document has 21 tables, and I want to turn each one of them into a card.

Sincerely, I don’t know… My python skills are very basic and I just learned about pandas yesterday. I’d be interested in learning about it, though. Maybe you could share a link to some tutorial or documentation?

By the way, thank you for your time and your help!

gbrl.sc · April 6, 2022, 5:28pm

STJ, rsrsrs. Mas também quero fazer do QC no futuro. Você tem alguma orientação sobre isso?

abdo · April 7, 2022, 6:13am

I recommend going through this free book, which teaches the basics of Python and how to pull data from websites.

The website is refusing to open for me for some reason.

cjdduarte · April 7, 2022, 9:14am

Qual parte voce quer pegar? o DESTAQUE?

Topic		Replies	Views
Image with text to anki Add-ons	4	550	January 6, 2025
Add-on to create cards Help	8	4075	May 1, 2023
How Anki handles video/audio playback Suggestions	48	524	November 11, 2025
Two questions about Anki's source code and Python: How to store and retrieve the number of cards I didn't review then display in the deckbrowser? How to use the variable `thetime` in `stats.py's` function `todayStats` to make a pomodoro like timer? Development	5	990	May 1, 2023
Export flashcards with formatting Help	23	10561	May 1, 2023

I need help to import a huge number of cards from text files in a more efficient way

Related topics